A Python script to sort DICOM files

This script will help you understand and organize your dataset of medical images

Published in

Towards Data Science

5 min readOct 30, 2020

This article is a follow-up to my previous introduction to DICOM files. Special thanks to my good friend Dr. Gian Marco Conte for helping write this.

As a brief recap, DICOM files are the primary format for storing medical images. All clinical algorithms must be able to read and write DICOM. But these files can be challenging to organize. DICOM files have information associated with the image saved in a header, which can be extensive. Files are structured in 4 tiers:

Patient
Study
Series
Instance

In this tutorial, I’ll share some python code that reads a set of DICOM files, extracts the header information, and copies the files to a tiered folder structure that can be easily loaded for data science tasks.

There are many great resources available for parsing DICOM using Python or other languages. DicomSort has a flexible GUI which can organize files based on any field in the header (DicomSort is also available as a Python package with “pip install dicomsort”). I also want to credit this repo for getting me started with code for reading a DICOM pixel dataset. Finally, this great paper includes a section on image compression which I briefly mention here.

Ultimately I decided to write my own utility because I like knowing exactly what my code is doing, and it also provides an introduction to the DICOM header which is essential knowledge for any data scientist who works on medical imaging projects.

I’ve verified this code for both CT and MRI exams; it should work for any modality — Patient, Study, and Series information is reported for all DICOM files.

Required Code Packages

This code uses the Python package PyDicom for reading and writing DICOM files.

I want to briefly mention the GDCM package. DICOM files may have image compression performed on them either during storage or during transfer via the DICOM receiver. For example, at our institution, all DICOMs have JPEG2000 compression. GDCM is a C-based package that allows PyDicom to read these encrypted files. It’s available as a conda package (“conda install gdcm”) or built from source using cmake. I snuck a few lines in my code below which decompresses the pixel data using GDCM, so I don’t have to worry about it in the future.

Update — since writing this article, I’ve started using the pylibjpeg package which is a bit easier to install than GDCM. I added more information at the end of the article.

Code Walkthrough

A full, uninterrupted version of the code is at the end of this article.

First, we specify which directory contains our DICOM files (“src”), and where they will be copied (“dst”). Note that the file is copying, not moving, so we’ll end up storing 2 copies of each file. We’ll read DICOM files in no particular order — each file contains enough information in the header to identify exactly where it came from.

For each file in our list, we’ll use the PyDicom package to load the file header as a dictionary.

We’ll be sorting the DICOM files at the Patient, Study, and Series level (for more information on what these terms mean, I encourage you to read through my previous introduction). I’m also including an additional level, the Study Date, which is useful information if you are expecting multiple studies from the same patient.

There are two fields each associated with Patient, Study, and Series: a unique identifier (UID) and a text description. Most DICOM datasets you’ll see are sorted using UIDs. Although UIDs are always unique, they result in long folder trees that are not easy to understand. I’ve chosen to save each file under the text description, while retaining the Patient ID to provide a layer of anonymity (although patient name will still be available in the header).

Using this code on several datasets, I have yet to come across an instance where the Study or Series information is missing, but in the off-chance, we will replace it with “NA”.

Finally, I’m including a small function for cleaning the text description of “forbidden” characters, removing spaces, and converting the text to lowercase, which makes directory names cleaner.

The file name will be a concatenation of the Modality (CT, MRI, etc.), the Series UID, and the Instance Number. I would include the Study UID, but this makes for a very long filename.

Finally, I’m going to remove any JPEG2000 or any other file compression. This is done using the GDCM package which is an optional package that can be installed under the covers of PyDicom. This isn’t 100% foolproof (like much of DICOM), hence my novice try/except statement:

The rest of the script creates our tiered folder tree. Again, I’m adding an extra level sorting by the study date. Without this, it might be difficult to differentiate our separate studies and if two studies had the same description, all the series would be lumped together in one folder.

Using the code on a dataset

To test out this code, let’s run it on a dataset. I’ve selected the 2019 RSNA Kaggle Competition, which featured head CTs for the diagnosis of intracranial hemorrhage.

Right now all the files are saved into a single directory, which is how the data was originally made available. Let’s see if this script can help us organize it:

Figure 1. The code parses our unstructured list of DICOM files (left) into a nested folder structure by Patient, Date, Study Description, and Series Description (right). Files are also renamed based on Modality, Series UID, and Instance Number. Image by author.

Although the file list is long enough that we can only see the top folder, note that the files have been sorted by the Patient ID (which is deidentified), the Scan Date, and the Series and Study descriptions which tell us something about the images we’re looking at. Also, the files have been renamed with the Modality, Series UID, and Instance Number for easy sorting down the line.

Full Code

Update — A new package for file Compression

Since writing this article I came across the pylibjpeg package which is a bit easier to install and configure than the GDCM package I mention above.

Pylibjpeg has several dependencies, here’s the list I copied from my requirements.txt

pylibjpeg
pylibjpeg-openjpeg
pylibjpeg-libjpeg
pydicom