SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification
Akarsh Gupta, Kenneth Rodrigues, Sagnik Chatterjee
TLDR
SCOPE introduces a Siamese MLP with protein language model embeddings for scalable operon pair classification, achieving competitive ROC-AUC.
Key contributions
- Introduces SCOPE, a Siamese MLP for operon pair classification using protein language model embeddings.
- Demonstrates protein language model embeddings significantly outperform traditional physicochemical features.
- Achieves a competitive ROC-AUC of 0.71 on the DGEB benchmark for operonic pair classification.
- Suggests the embedding space geometry already captures functional relationships for this task.
Why it matters
This paper provides a scalable computational method for identifying operons, crucial for understanding prokaryotic gene regulation. It enables automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental data, accelerating drug discovery.
Original Abstract
Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.