Understanding DICOM

How to read, write, and organize medical images

Alexander Weston, PhD
Towards Data Science

--

DICOM is the primary file format for storing and transferring medical images in a hospital’s database.

There are other file formats for storing images. Besides DICOM, you may also see medical images saved in the NIFTI format (file suffix “.nii”), PNG or JPEG format, or even Python file objects like NumPy arrays.

So why use DICOM? Other file formats may be more convenient, but in clinical practice, everything uses the DICOM format. As my projects get more advanced, I’ve often found myself rewriting code to read and write DICOM files directly. Furthermore, DICOM files are unambiguous because each file contains a header that exhaustively documents the hospital, patient, scanner, and image information, much like how smartphone information and even GPS is encoded in the metadata every time you snap an iPhone picture.

Suggested reading for DICOM

The DICOM file format is documented in the DICOM Standard, which is required reading for most informatics specialists.

For a beginner’s background, this blog is also a great introduction.

Challenges of DICOM

For deep learning tasks, the endgame is usually to load the image data as a NumPy or other file, and in this case, DICOM can be difficult:

  1. DICOM saves one file per slice, so a 3D scan may have hundreds of files
  2. DICOM files are named with a unique identifier (UID). This makes it hard to sort files from the folder level (in some cases, file names are so long that they exceed the 256 character maximum on Windows computers which causes saving/loading issues)
  3. Patient and hospital information is embedded in the file header, which can make DICOM tricky to anonymize

Although confusing, this organizational structure is an important strength of DICOM. Later, I’ll share some example Python code I’ve written that can help bypass some of these issues.

Identifying a DICOM file

Each DICOM file is designed to be standalone — all the information needed to identify the file is embedded in each header. This information is organized into 4 levels of hierarchy — patient, study, series, and instance.

  • “Patient” is the person receiving the exam
  • “Study” is the imaging procedure being performed, at a certain date and time, in the hospital
  • “Series” — Each study consists of multiple series. A series may represent the patient being physically scanned multiple times in one study (typical for MRI), or it may be virtual, where the patient is scanned once and that data is reconstructed in different ways (typical for CT)
  • “Instance” — every slice of a 3D image is treated as a separate instance. In this context, “instance” is synonymous with the DICOM file itself

To illustrate this hierarchy, here’s a few files from the publicly available pancreatic cancer dataset from The Cancer Imaging Archive (TCIA):

Figure 1. Example of a DICOM file organized by the “Patient”, “Study”, “Series”, and “Instance” levels. Image by Author.

Table 1 shows a printout of the header using PyDicom (a python package which allows reading and writing of DICOM files). I’ve removed most of the information which isn’t relevant and also created some “fake” patient data.

Table 1: Printout of a portion of a DICOM header displaying Patient, Series, Study, and Instance UIDs and text descriptions. Image by author.

From the two Description fields in Table 1, we can see that this file comes from a CT exam of the abdomen where contrast was administered, which is consistent with this scan coming from a database of pancreatic cancer imaging. This series “ABD” may be the only one, or there may be others. In clinical CT studies, it’s common to see a positioning scan (“Topogram”) as well as “Sagittal” and “Coronal” scans which highlight anatomy in different planes.

Unique Identifiers: UIDs

In addition to the text descriptions, the scan is identified by the unique Patient ID (5553226), Study UID (1.2.826.0.1.3680043.2.1125.1. 38381854871216336385978062044218957), Series UID (1.2.826.0.1. 3680043.2.1125.1.68878959984837726447916707551399667), and Instance Number (20).

If you were to load the very next DICOM file in this folder, the Patient ID, Study UID, and Series UID would all have the same value, and only the Instance Number would be different (in this case, “21”).

Text descriptions are helpful, but UIDs are key to identifying a scan. Unlike the descriptions, the UIDs are unique for every single patient, series, and study performed at a hospital.

Furthermore, UIDs are not actually random numbers, they encode information about the identity of the file, and even how it’s compressed. A full description of the UID is in Part 6 of the Dicom standard.

Most DICOM studies that I’ve seen are organized in a “three-tiered” folder structure illustrated in Figure 1, with files sorted first by Patient UID, then Study Instance UID, then Series Instance UID, and finally the file name itself is the Instance Number.

Conclusion

In this example, I’ve given a brief overview of how DICOM files are identified, and how this information is used to sort datasets.

In the next post I’ll describe a python script I wrote that can re-organize a set of DICOM files into a folder structure that is consistent and easy to understand based on the UID information in the file header.

--

--