- [12/27/2024]: Release the classification datasets in Med-MAT. The detection/segmentation datasets are coming soon.
Welcome to the repository of Med-MAT, a VQA dataset consisting of 106 open-source medical datasets, which we hope will advance generalization experiments and aid in training powerful medical multimodal large language models (MLLMs).
Through this dataset, we have demonstrated that Compositional Generalization (CG) is one of the key mechanisms for MLLMs to understand unseen images, enabling them to handle unfamiliar images and achieve data-efficient training.
Here is a list of what has been released:
- QA Pairs for 106 Medical Datasets: Image-label pairs converted into VQA pairs for MLLM training.
- QA Pairs for 53 Aggregated Subsets: Datasets categorized by Modality, Anatomical Area, and Task (MAT), with identical entries merged into subsets.
- Image Download Links: Some datasets cannot be shared due to licensing. Users can download them to specified directories.
To enable MLLMs to directly train and test on Med-MAT, the image-label pairs were converted into a Visual Question-Answering (VQA) format. The process involves the following steps:
- Task Definition: Each subset was manually assigned 6 instructions to guide the MLLM in answering the task related to the subset.
- Conversion to VQA Format: All image-label pairs were converted into single-choice questions with up to four answer options.
- Distractor Selection: Distractor options were randomly drawn from other labels within the subset to ensure variety.
- Final Dataset: The resulting dataset consisted of VQA pairs, where each image is paired with a question and four options, one of which is correct.
You can access the QA pairs of Med-MAT through HF.
The tables below record the download URLs for the images and QA pairs for each dataset and subset. If you only wish to use part of Med-MAT, you can selectively download the corresponding data.
Original_Medical_Datasets
Click to view the details of 106 Medical Datasets
No. | Name with link | Modality | Area | Task | QA |
---|---|---|---|---|---|
1 | Intel and MobileODT Cervical Screening | Co | Cervix | Cervix Type in Screening | HF |
2 | CT Kindney Dataset | CT | Kidney | Normal or Cyst or Tumor | HF |
3 | SARS-COV-2 Ct-Scan | CT | Lung | COVID19, Classification Dataset | HF |
4 | COVID CT COVID-CT | CT | Lung | COVID19, Classification Dataset. | HF |
5 | Chest CT-Scan | CT | Lung | Cancer, 3 Cancer Categories, Multiple Classification Dataset | HF |
6 | COVID-19-CT SCAN IMAGES | CT | Lung | COVID19, Classification | HF |
7 | Head CT | CT | Brain | Head Hemorrhage | HF |
8 | CT of Brain | CT | Brain | Head Cancer | HF |
9 | MED-NODE | Der | Skin | Melanoma or Naevus | HF |
10 | ISIC 2020 | Der | Skin | Melanoma, Benign or Malignant | HF |
11 | PED-UFES-20 | Der | Skin | Skin Multi Classification | HF |
12 | Web-scraped Skin Image | Der | Skin | Skin Desease Multi Classification | HF |
13 | ISBI 2016 | Der | Skin | Skin Lesion Classification | HF |
14 | ISIC 2019 | Der | Skin | Skin Desease Multi Classification | HF |
15 | Skin Cancer ISIC | Der | Skin | Skin Cancer Multi Classification | HF |
16 | Dental Condition Dataset | DP | Teeth | Teeth condition classification | HF |
17 | Oral Cancer Dataset | DP | Teeth | Oral cancer Classification | HF |
18 | The Nerthus Dataset | End | Intestine | Cleanliness level | HF |
19 | Endoscopic Bladder Tissue | End | Bladder | Canser Degree Classification | HF |
20 | Kvasir | End | Intestine | Multi Disease Classification | HF |
21 | ACRIMA | FP | Fundus | Glaucoma | HF |
22 | Augemnted ocular diseases AOD | FP | Fundus | Multi Classification of eye diseases | HF |
23 | JSIEC | FP | Fundus | Multi Classification of eye diseases | HF |
24 | Multi-Label Retinal Diseases | FP | Fundus | Multi Classification of eye diseases | HF |
25 | RFMiD 2.0 | FP | Fundus | Multi Classification of eye diseases | HF |
26 | ToxoFundus(Data Processed Paper) | FP | Fundus | Ocular toxoplasmosis | HF |
27 | ToxoFundus(Data Raw 6class All) | FP | Fundus | Ocular toxoplasmosis | HF |
28 | Adam dataset | FP | Fundus | Age-related Macular Degeneration | HF |
29 | APTOS 2019 Blindness | FP | Fundus | Blindness Level Identification 0~4 | HF |
30 | DRIMBD | FP | Fundus | Quality Testing of Retinal Images | HF |
31 | Glaucoma Detection | FP | Fundus | Glaucoma Classification | HF |
32 | AIROGS | FP | Fundus | Glaucoma Classification | HF |
33 | ICPR-HEp-2 | Mic | Cell | Multi Classification | HF |
34 | SICAPv2 | Mic | Cell | Cancer Degree Classification | HF |
35 | Blood Cell Images | Mic | Cell | Blood Cell Classificaion (Multi) | HF |
36 | BreakHis | Mic | Cell | Cell type and beginormag | HF |
37 | Chaoyang | Mic | Cell | Multi Classification of pathologists | HF |
38 | HuSHeM | Mic | Cell | Sperm Head Morphology Classificaion | HF |
39 | Bone Marrow Cell Classification | Mic | Cell | Bone Marrow Cell Classification | HF |
40 | NCT-CRC-HE-100K | Mic | Cell | Multi Classification | HF |
41 | Malignant Lymphoma Classification | Mic | Cell | Multi Classification | HF |
42 | Histopathologic Cancer Detection | Mic | Cell | Cancer Classification | HF |
43 | LC25000 | Mic | Cell | Multi Classification of Lung and Colon | HF |
44 | Brain Tumor 17 Classes | MRI | Brain | Multi Classification | HF |
45 | Tumor Classification | MRI | Brain | Pituitary or Glioma or Meningioma or Notumor | HF |
46 | Malignant Lymphoma Classification | OCT | Retina | Multi Classification of eye diseases | HF |
47 | Retinal OCT-C8 | OCT | Retina | Multi Classification of eye diseases | HF |
48 | BUSI | US | Breast | Breast Cancer | HF |
49 | Digital Knee X-Ray Images | X-Ray | Bones | Degree Classification of Knee | HF |
50 | Bone Fracture Multi-Region X-ray Data | X-Ray | Bones | Fractured Classification | HF |
51 | Fracture detection | X-Ray | Bones | Fractured Classification | HF |
52 | The vertebrae X-ray image | X-Ray | Bones | Vertebrae | HF |
53 | Knee Osteoarthritis Dataset | X-Ray | Bones | Knee Osteoarthritis with severity grading | HF |
54 | Shenzhen Chest X-Ray Set | X-Ray | Lung | COVID19, Classification Dataset. | HF |
55 | Chest X-ray PD | X-Ray | Lung | COVID and Pneumonia | HF |
56 | COVID-19 CHEST X-RAY DATABASE | X-Ray | Lung | COVID and Pneumonia | HF |
COVIDGR | X-Ray | Lung | COVID19, Classification | HF | |
58 | MIAS | X-Ray | Breast | Multi Classification of Breast | HF |
59 | Tuberculosis Chest X-Ray Database | X-Ray | Lung | Tuberculosis | HF |
60 | Pediatric Pneumonia Chest X-Ray | X-Ray | Lung | Pneumonia Classification | HF |
61 | Random Sample of NIH Chest X-Ray Dataset | X-Ray | Chest | Multi Classificaiton of Chest | HF |
62 | CoronaHack-Chest X-Ray | X-Ray | Lung | Pnemonia Classifcition with Virus type | HF |
63 | Brain Tumor Dataset | X-Ray | Brain | Tumor Classification | HF |
64 | Fitzpatrick 17k (Nine Labels) | Der | Skin | Multi Classification | HF |
65 | BioMediTech | Mic | Cell | Multi Classification | HF |
66 | Diabetic retinopathy | FP | Fundus | Diabetic Retinopathy Level | HF |
67 | Leukemia | Mic | Cell | Cancer Classification | HF |
68 | ODIR-5K | FP | Fundus | Multiple Labels Classification | HF |
69 | Arthrosis | X-Ray | Bones | Bone Age Classification | HF |
70 | HSA-NRL | Mic | Cell | Multi Classification of pathologists | HF |
71 | ISIC 2018 (Task 3) | Der | Skin | Multi Classification | HF |
72 | ISIC 2017 (Task 3) | Der | Skin | Multi Classification | HF |
73 | ChestX-Det | X-Ray | Chest | Multi Classification | HF |
74 | Monkeypox Skin Lesion Dataset | Der | Skin | Only Monkeypox | HF |
75 | Cataract Dataset | FP | Fundus | Multi Classification | HF |
76 | ChestX-rays IndianaUniversity | X-Ray | Chest | Multi-label Classification | HF |
77 | CheXpert v1.0 small | X-Ray | Chest | Multi-label Classification | HF |
78 | CBIS-DDSM | X-Ray | Breast | Multi Classification | HF |
79 | NLM-TB | X-Ray | Lung | Tuberculosis | HF |
80 | ChestXray-NIHCC | X-Ray | Chest | Multi-label Classification | HF |
81 | COVIDx CXR-4 | X-Ray | Lung | COVID19, Classification | HF |
82 | VinDr-Mammo | X-Ray | Breast | Multi-label Classification | HF |
83 | PBC dataset normal DIB | Mic | Cell | Multi Classification | HF |
84 | Human Protein Atlas | Mic | Cell | Multi-label Classification (Only green) | HF |
85 | RSNA Pneumonia Detection Challenge 2018 | X-Ray | Chest | Multi-label Classification | HF |
86 | VinDr-SpineXR | X-Ray | Bones | Multi Classification of Bones Diseases | HF |
87 | VinDr-PCXR | X-Ray | Chest | Multi-label Classification | HF |
88 | PH2 | Der | Skin | Melanoma Segmentation | TODO |
89 | ISBI 2016 (Task3B) | Der | Skin | Melanoma Segmentation | TODO |
90 | ISIC 2016 (Task 1) | Der | Skin | Melanoma Segmentation | TODO |
91 | ISIC 2017 | Der | Skin | Melanoma Segmentation | TODO |
92 | CVC-ClinicDB | End | Intestine | Polyp Segmentation | TODO |
93 | Kvasir-SEG | End | Intestine | Polyp segmentation | TODO |
94 | m2caiseg | End | Intestine | Surgical Instrument Segmentation | TODO |
95 | EDD 2020 | End | Intestine | Multiple Diseases Segmentation in Intestine | TODO |
96 | SICAPv2 | Mic | Cell | Cancer Cells Segmentation | TODO |
97 | BUSI | Ultrasound | Breast | Cancer Segmentation | TODO |
98 | TN3K | Ultrasound | Thyroid | Thyroid Nodule Segmentation | TODO |
99 | NLM-TB | X-Ray | Lung | Lung Segmentation (With left or right) | TODO |
100 | VinDr-SpineXR | X-Ray | Bones | Spinal X-ray Anaomaly Detection | TODO |
101 | VinDr-PCXR | X-Ray | Chest | Multiple Diseases Segmentation in Chest | TODO |
102 | ChestX-Det | X-Ray | Chest | Multiple Diseases Segmentation in Chest | TODO |
103 | UW-Madison Gl Tract Image Segmentation | MRI | Intestine | Surgical Instrument Segmentation | TODO |
104 | Duke Liver Dataset MRI v1 | MRI | Liver | Liver Segmentation | TODO |
105 | Duke Liver Dataset MRI v2 | MRI | Liver | Liver Segmentation | TODO |
106 | SIIM-ACR Pneumothorax Segmentation | X-Ray | Lung | Pneumothorax Segmentation | TODO |
107 | FIVES | FP | Fundus | Fundus Vascular Segmentation | TODO |
108 | RIM-ONE DL | FP | Fundus | Optic Disc and Cup Segmentation | TODO |
109 | PALM19 | FP | Fundus | Optic Disc Segmentation | TODO |
Aggregated_Subsets
Click to view the details of 53 Subsets
No. | Modality | Area | Task | QA |
---|---|---|---|---|
01 | Co | Cervix | Cervical Picture Quality Evaluation | HF |
02 | CT | Kidney | Kidney Diseases Classification | HF |
03 | CT | Lung | COVID-19 Classification | HF |
04 | CT | Lung | Lung Cancer Classification | HF |
05 | CT | Brain | Brain Hemorrhage Classification | HF |
06 | CT | Brain | Brain Cancer Classification | HF |
07 | Der | Skin | Melanoma Type Classification | HF |
08 | Der | Skin | Skin Diseases Classification | HF |
09 | DP | Mouth | Teeth Condition Classification | HF |
10 | DP | Mouth | Oral Cancer Classification | HF |
11 | End | Intestine | Intestine Cleanliness Level | HF |
12 | End | Bladder | Cancer Degree Classification | HF |
13 | End | Intestine | Intestine Diseases Classification | HF |
14 | FP | Fundus | Eye Diseases Classification | HF |
15 | FP | Fundus | Multiple-labels Eye Diseases Classification | HF |
16 | FP | Fundus | Blindness Level | HF |
17 | FP | Fundus | Retinal Images Quality Evaluation | HF |
18 | Mic | Cell | Cell Type Classification | HF |
19 | Mic | Cell | Prostate Cancer Degree Classification | HF |
20 | Mic | Cell | Multiple-labels Blood Cell Classification | HF |
21 | Mic | Cell | Cancer Classification | HF |
22 | MRI | Brain | Head Diseases Classification | HF |
23 | OCT | Retina | Retina Diseases Classification | HF |
24 | US | Breast | Breast Cancer Classification | HF |
25 | X-ray | Bones | Degree Classification of Knee | HF |
26 | X-ray | Bones | Fractured Classification | HF |
27 | X-ray | Bones | Vertebrae Diseases Classification | HF |
28 | X-ray | Lung | COVID-19 and Pneumonia Classification | HF |
29 | X-ray | Breast | Breast Diseases Classification | HF |
30 | X-ray | Lung | Tuberculosis Classification | HF |
31 | X-ray | Chest | Multiple-labels Chest Classification | HF |
32 | X-ray | Brain | Tumor Classification | HF |
33 | Mic | Cell | Multi-labels Diseases | HF |
34 | FP | Fundus | Level Identification | HF |
35 | X-ray | Bones | Level Identification | HF |
36 | X-ray | Bones | Spinal lesion Classification | HF |
37 | X-ray | Breast | Multi-labels Diseases | HF |
38 | Der | Skin | Lesion Det/Seg | TODO |
39 | End | Intestine | PolyP Det/Seg | TODO |
40 | End | Intestine | Surgical Procedures Det/Seg | TODO |
41 | End | Intestine | Multi-labels Det/Seg | TODO |
42 | Mic | Cell | Cancer Cell Det/Seg | TODO |
43 | US | Chest | Cancer Det/Seg | TODO |
44 | US | Thyroid | Thyroid Nodule Region Det/Seg | TODO |
45 | MRI | Intestine | Multi-labels Det/Seg | TODO |
46 | MRI | Liver | Liver Det/Seg | TODO |
47 | X-ray | Lung | Lung Det/Seg | TODO |
48 | X-ray | Lung | Pneumothorax Det/Seg | TODO |
49 | X-ray | Bones | Spinal Anomaly Det | TODO |
50 | X-ray | Chest | Multi-labels Det | TODO |
51 | FP | Fundus | Vessel Seg | TODO |
52 | FP | Fundus | Optic Disc and Cup Seg | TODO |
53 | FP | Fundus | Optic Disc Seg | TODO |
After downloading the images to the "med-mat" folder and placing the corresponding JSON files as shown, you can easily access Med-MAT.
┬─ med-mat
│ ├─ CT_Kindney_Dataset
│ └─ ... (unzipped datasets)
└─ Aggregated_Subsets
│ ├─ Subset--01-train.json
│ ├─ Subset--02-train.json
│ └─ ... (other subsets)
└─ Original_Medical_Datasets
├─ Ori--01-train.json
├─ Ori--02-train.json
└─ ... (other medical datasets)
Here’s a sample from Med-MAT:
- caption: The original label from the collected medical datasets.
- image: Path to the corresponding image.
- Question and Answer: Caption-based QA pairs.
- Question-choice and Answer-choice: Multiple-choice QA pairs.
- data-no: Number of its original medical dataset.
{
"id": 1,
"caption": "Cyst",
"image": "med-mat/CT_Kindney_Dataset/CT-KIDNEY-DATASET-Normal-Cyst-Tumor-Stone/CT-KIDNEY-DATASET-Normal-Cyst-Tumor-Stone/Cyst/Cyst- (561).jpg",
"Question": "Review this kidney CT scan and determine the possible condition it represents.",
"Answer": "Cyst",
"Question-choice": "Review this kidney CT scan and determine the possible condition it represents.\nA: Stone\nB: Cyst\nC: Normal\nD: Tumor\nAnswer with the option's letter from the given choices directly.",
"Answer-choice": "B",
"data-no": "2"
}
We appreciate the previous efforts in open-sourcing the medical imaging datasets used in this project.
Please be sure to credit them when citing these datasets.