Challenge to the reader: Before reading, sketch your own mental map of where cancer research data comes from. Write down the data repositories, analysis tools, and databases you know. After reading, compare your map to the 15-layer stack below. What layers were you missing?


Modern biology research is drowning in tools and databases — and that’s a feature, not a bug. A single clinical research workflow might pull data from TCGA, preprocess it with an nf-core pipeline, analyze it in Seurat, integrate it through a multi-omics framework, interpret it against pathway databases, and translate findings through drug discovery platforms. Each layer depends on the ones below it.

This post is a curated list of approximately 100 resources used by real clinical and translational biology researchers across cancer, immunology, aging, multi-omics, drug discovery, and computational biology. They are grouped by practical research workflow layers so the map is actually usable.


1. Core Global Biology Data Repositories (Foundational)

These are the primary data backbones for modern biology research.

Cancer / Disease Mega-Datasets

  1. The Cancer Genome Atlas (TCGA) — multi-omics cancer dataset covering 20,000+ tumors. TCGA alone produced petabytes of multi-omics data and transformed molecular cancer classification1.
  2. COSMIC — somatic mutations in cancer.
  3. Cancer Genome Anatomy Project (CGAP).
  4. The Cancer Imaging Archive (TCIA).
  5. Network of Cancer Genes (NCG).

Major Functional Genomics Repositories

  1. NCBI GEO (Gene Expression Omnibus) — hosts millions of samples across 200,000+ studies2.
  2. ArrayExpress.
  3. ENCODE.
  4. GTEx.
  5. SRA (Sequence Read Archive).
  6. BioProject.
  7. BioSample.

Multi-Omics Integrated Resources

  1. cBioPortal.
  2. DepMap.
  3. Human Protein Atlas.
  4. ProteomicsDB.
  5. TCGA Pan-Cancer Atlas.

2. GitHub Curated Bioinformatics Resource Lists (Start Here)

These act as meta-indexes to thousands of tools.

  1. openbiox/awesome-bioinformatics
  2. mdozmorov/Immuno_notes
  3. OMICtools search engine — indexes 18,000+ bioinformatics tools3.
  4. Bioinformatics-papers list repositories.
  5. Biostar handbook repositories.

Challenge: OMICtools indexes 18,000 tools. Pick one tool from the awesome-bioinformatics list that you’ve never heard of, read its README, and write down one experiment it could enable.


3. Cancer Research Toolchains (GitHub-Heavy)

Key software pipelines used in research labs.

Genomics Analysis

  1. GATK — these are the exact variant callers used in TCGA pipelines4.
  2. MuTect2.
  3. VarScan2.
  4. Pindel.
  5. Strelka.

RNA-Seq Workflows

  1. nf-core RNA-seq.
  2. STAR aligner.
  3. HISAT2.
  4. Salmon.
  5. kallisto.
  6. DESeq2.
  7. edgeR.

Multi-Omics Integration

  1. DRPPM-EASY.
  2. Cancer Multi-Omics Benchmark (CMOB) — provides ready-processed datasets across 32 cancers5.
  3. MultiAssayExperiment.
  4. iClusterPlus.

4. Immunology-Specific Research Tools

Critical for immunotherapy and immune system modeling.

Repertoire Sequencing

  1. Immcantation framework.
  2. MiXCR.
  3. AIRRflow.

Immune Deconvolution Tools

  1. CIBERSORT.
  2. TIMER.
  3. xCell.
  4. EPIC.

Immunology Datasets

  1. ImmPort.
  2. IEDB (Immune Epitope Database).
  3. VDJdb.

5. Single-Cell Biology Research Tools

A massive frontier area.

  1. Seurat.
  2. Scanpy.
  3. Monocle.
  4. Cell Ranger.
  5. Harmony.
  6. CellPhoneDB.

Single-Cell Datasets

  1. Human Cell Atlas.
  2. Single Cell Portal.
  3. PanglaoDB.

6. Aging / Longevity Research Databases

Essential for geroscience.

  1. GenAge.
  2. LongevityMap.
  3. Human Ageing Genomic Resources (HAGR).
  4. Aging Atlas.
  5. SenNet.

7. Structural Biology & Protein Tools

Used in drug discovery and immunology.

  1. AlphaFold DB.
  2. PDB (Protein Data Bank).
  3. Rosetta.
  4. FoldX.
  5. PyMOL.

8. Drug Discovery & Pharmacogenomics Resources

Important in translational oncology.

  1. DrugBank.
  2. ChEMBL.
  3. LINCS L1000.
  4. Open Targets Platform.
  5. PharmGKB.

9. Pathway & Systems Biology Tools

  1. KEGG.
  2. Reactome.
  3. STRING.
  4. BioGRID.
  5. Cytoscape.
  6. GenMAPP — integrates gene-level datasets with pathways for disease analysis6.

10. Machine Learning in Biology Repositories

A rapidly growing frontier.

  1. DeepChem.
  2. BioBERT.
  3. DNABERT.
  4. ESM protein language models.
  5. AlphaFold-multimer.

Challenge: DeepChem vs. BioBERT — one is for molecules, one is for literature. If you had to build a system that links published cancer mutations to candidate drugs, which would you use for each step of the pipeline?


11. Clinical Research & Translational Platforms

  1. ClinicalTrials.gov dataset APIs.
  2. OHDSI / OMOP.
  3. i2b2.
  4. REDCap open tools.

12. Imaging & Radiomics Resources

  1. TCIA radiomics tools.
  2. PyRadiomics.
  3. MONAI (medical AI).

13. Microbiome / Metagenomics Tools

  1. QIIME2.
  2. Kraken2.
  3. MetaPhlAn.
  4. HUMAnN.

14. Text Mining & Knowledge Graph Resources

  1. PubTator.
  2. Europe PMC mining.
  3. BioASQ datasets.

15. Experimental Protocol Repositories

  1. Protocols.io.
  2. Addgene plasmid repository.
  3. Benchling open tools.

How Frontier Biology Research Actually Works

A real clinical research workflow typically uses:

RAW DATA → GEO / TCGA
     ↓
Preprocessing → nf-core pipelines
     ↓
Analysis → Seurat / DESeq2
     ↓
Integration → Multi-omics frameworks
     ↓
Interpretation → Pathway / protein databases
     ↓
Translation → drug discovery resources

Each arrow in this pipeline is a place where tool selection can make or break a project. The difference between a Nature paper and an unpublishable result often comes down to choosing the right tool for each layer — and knowing that the tool exists in the first place.

Final challenge: You’re a new PI starting a lab focused on immuno-oncology in colorectal cancer. Your first project aims to identify why some patients respond to checkpoint inhibitors while others don’t. Using only resources listed above, map out a complete data-to-drug pipeline: which datasets will you query, which preprocessing and analysis tools will you use, and which drug discovery databases will you search for candidate compounds? Write the pipeline as a numbered list of steps, each annotated with the specific resource from the list above.