DNA as Language
The idea is deceptively simple: DNA is a sequence of characters (A, T, G, C), much like text is a sequence of characters. If transformer-based language models can learn the grammar and semantics of human language from vast text corpora, could similar models learn the "grammar" of genomics from DNA sequences?
The answer, it turns out, is a resounding yes. Over the past two years, a new class of models — genomic foundation models — has emerged, and they're showing remarkable ability to understand the functional implications of DNA sequence variation.
Key Models in the Landscape
Geneformer
Developed at Harvard, Geneformer is a transformer model pre-trained on approximately 30 million single-cell transcriptomes. Rather than operating directly on DNA sequences, it learns representations of gene expression programs across cell types. After pre-training, Geneformer can be fine-tuned for tasks like predicting the effect of gene perturbations, identifying disease-driving genes, and classifying cell types with minimal labeled data.
scGPT
scGPT applies the generative pre-training paradigm (similar to GPT) to single-cell biology. Pre-trained on over 33 million cells from the CELLxGENE database, scGPT learns a universal cell representation that can be adapted to cell type annotation, multi-batch integration, perturbation prediction, and gene network inference. Its ability to perform these tasks with minimal fine-tuning (few-shot learning) is particularly impressive.
Nucleotide Transformer
InstaDeep's Nucleotide Transformer is trained directly on DNA sequences — 3,200 diverse genomes across multiple species. With up to 2.5 billion parameters, it can predict the functional impact of genetic variants, identify regulatory elements, and classify genomic regions by function. The key insight is that by learning from the evolutionary conservation patterns across species, the model develops a deep understanding of which sequences are functionally important.
Evo
Perhaps the most ambitious model, Evo from the Arc Institute is a 7-billion-parameter model trained on 2.7 million prokaryotic and phage genomes at single-nucleotide resolution. Evo can generate functional DNA sequences, predict the effects of mutations, and even design novel biological systems. It represents a step toward generative biology — the ability to write, not just read, the code of life.
DNABERT-2
Building on the original DNABERT, DNABERT-2 uses byte pair encoding (BPE) instead of fixed k-mer tokenization, allowing it to learn more flexible representations of DNA sequences. It achieves state-of-the-art performance on a wide range of genomic benchmarks while being more efficient than models with larger parameter counts.
Why This Matters
Traditional bioinformatics tools rely on hand-crafted features and explicit rules. Foundation models learn features directly from data, potentially capturing complex patterns that human experts might miss. This is particularly valuable for understanding non-coding DNA — the 98% of the genome that doesn't encode proteins but plays crucial regulatory roles.
Practical Applications
Variant Effect Prediction
The most immediately impactful application is predicting the functional consequences of genetic variants. When a patient's genome reveals a variant of uncertain significance (VUS), foundation models can provide a quantitative prediction of its likely impact on gene regulation, splicing, or protein function. Models like Enformer and the Nucleotide Transformer are already outperforming traditional methods on variant effect prediction benchmarks.
Regulatory Element Discovery
Understanding which non-coding sequences are functional — enhancers, promoters, silencers, insulators — is one of the biggest challenges in genomics. Foundation models trained on DNA sequences can identify these elements with high accuracy, even in cell types and tissues for which limited experimental data exists.
Cell Type Annotation
In single-cell genomics, annotating cell types is a critical but often manual and time-consuming step. Models like scGPT and Geneformer can perform automated cell type annotation with accuracy rivaling expert human annotators, dramatically accelerating single-cell analysis workflows.
Drug Target Prioritization
By understanding which genes and regulatory elements are most functionally important in disease-relevant cell types, foundation models can help prioritize drug targets. Geneformer has demonstrated the ability to identify therapeutic targets for diseases including cardiomyopathy and COVID-19.
Challenges and Limitations
Despite their promise, genomic foundation models face important challenges:
- Interpretability: Like all deep learning models, foundation models are largely black boxes. Understanding why a model makes a particular prediction is crucial for clinical applications but remains difficult
- Training data bias: Models trained predominantly on European-ancestry genomes may perform less well for other populations — amplifying existing biases in genomics
- Compute requirements: Training these models requires massive GPU resources. Evo's training, for example, required thousands of GPU-hours. Even inference can be computationally expensive for the largest models
- Validation: The field lacks standardized benchmarks for rigorously evaluating genomic foundation models. Performance on academic benchmarks doesn't always translate to real-world utility
- Hallucination: Like text LLMs, genomic models can generate plausible-looking but incorrect predictions. Robust confidence calibration is essential for clinical use
Getting Started
For organizations looking to leverage genomic foundation models, we recommend starting with well-established models (Geneformer, scGPT) applied to well-defined tasks where you have ground truth data for validation. Build evaluation frameworks before deploying predictions in production. And invest in the GPU infrastructure and ML engineering expertise needed to fine-tune and serve these models reliably.
At Next Generation Consulting, we help organizations integrate foundation models into their bioinformatics platforms — from infrastructure provisioning to model fine-tuning to production deployment with appropriate guardrails.