Friday, May 23, 2025

Learning Biological Semantics with BioSkipGram: A Deep Dive into Sequence Embeddings

Research Theme: Computational Biology, Sequence Modeling, Representation Learning
Focus: From-scratch Word2Vec (Skip-Gram) on Genomic Data


Rethinking Language Models for Genomics

Much like human languages, biological sequences carry rich, structured information. In natural language processing (NLP), word embeddings such as Word2Vec have revolutionized our ability to model context and meaning from unstructured text.

With BioSkipGram, I extended this concept to genomic data, building a skip-gram model from scratch in PyTorch, trained on coding DNA sequences (CDS) from Homo sapiens. Instead of words, the model learns embeddings for k-mers, the biological analogues of linguistic tokens.


Why This Matters

This project isn't just about modeling—it’s about discovering latent structure in DNA using ideas from language. It reflects a fundamental shift in how we approach biology:

Genes aren't just biochemical—they're computational.

With that lens, this project taught me to:

  • Map natural language concepts like context, semantics, and distributional similarity to DNA sequences.

  • Design a full learning pipeline, from data preprocessing to training and embedding visualization.

  • Understand how unsupervised representation learning uncovers subtle relationships, even in highly repetitive biological data.


Research Takeaways

Through this project, I developed more than just a working model. I gained meaningful insight into key ideas at the intersection of biology and machine learning:

1. Biological Semantics Are Contextual

Even in DNA, context matters. By training on sliding k-mer windows, the model captures recurring motifs and biological "phrases"—a principle shared with natural language semantics.

2. From NLP to Genomics: The Role of Transfer Learning

This project reaffirms that ideas born in NLP (like skip-gram and embedding vectors) are adaptable and powerful for biological sequence analysis. This strengthens the case for cross-domain learning paradigms.

3. Building from Scratch Builds Understanding

Implementing everything from the tokenizer to the training loop gave me an engineering-level appreciation of:

  • Negative sampling and sparse gradients

  • Vocabulary construction for biological tokens

  • Embedding space regularization and interpretability

This hands-on process significantly sharpened my ability to connect theoretical learning to real-world biological questions.


Sample Output

Below is a t-SNE projection of the learned k-mer embeddings. It reveals clustering behavior that hints at shared biological function or sequence origin:



Broader Impact

Projects like this are foundational for future research in:

  • Mutation detection (e.g., variant embeddings)

  • Protein-DNA interaction prediction

  • Functional annotation of non-coding regions

  • Custom embeddings for clinical genomics pipelines

As I continue my journey into bioinformatics, I see this project as a stepping stone toward more ambitious research—where computational abstraction meets biological function.


🔗 GitHub & Dataset

👉 GitHub Repository
📥 Homo sapiens CDS Dataset (NCBI)



No comments:

Post a Comment

Why RAG Beat Fine-Tuning for Technical Question Answering

Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B 🤗 Model 📊 Dataset 💻 Code Large language models have ...