HLA Peptide Discovery — Scientific Data Platform Modernisation
From fragmented Excel workbooks and email-driven handoffs to a scalable, query-driven scientific intelligence platform — enabling Immuno-Oncology researchers to discover, filter, and interrogate HLA peptide targets across 283,000+ peptides with multi-dimensional precision.

A scalability crisis hidden inside spreadsheets
Immuno-Oncology peptide discovery generates an enormous and ever-growing volume of data. MS/MS experiments produce hundreds of thousands of peptide candidates that must be cross-referenced against HLA typing, RNA-Seq expression, NetMHC binding predictions, tumour expression data, and off-target risk algorithms before any target can be confidently nominated.
The legacy workflow managed this complexity through per-experiment Excel workbooks, manual script execution, and email-based coordination between Proteomics, Bioinformatics, and Therapeutic Focus Area teams. With 283,000+ peptides already accumulated and millions projected, the system was heading toward a data integrity and scalability crisis.
Scientists could not easily answer fundamental discovery questions — Was this peptide ever observed before? In how many human samples? Is it tumour-enriched? What are its off-target risks? The answers existed in the data, but the architecture made them inaccessible.
Key Challenges
No centralised database — experiments tracked in per-experiment Excel workbooks and CSV exports
30–40 MS/MS samples processed monthly with 283,000+ peptides and no structured query capability
Manual PEAKS export, script execution, and NetMHC queries driven by key-person dependency
No linkage between summarised results and raw experimental evidence
Email-based coordination between Proteomics, Bioinformatics, and Therapeutic Focus Areas
Projected exponential peptide growth to millions — existing architecture could not scale
Key Requirements
Centralised scientific database at patient, experiment, and peptide level
Automated ingestion from PEAKS MS/MS exports with HLA typing and RNA-Seq integration
Complex multi-parameter query engine with threshold-based filtering
Drill-down from summarised results to raw experimental MS evidence
Automated annotation and off-target script triggering
Scalable architecture supporting millions of peptides without performance degradation
From raw MS data to actionable discovery intelligence
Every experiment flows through an automated pipeline — ingested, integrated with genomic data, indexed, and made instantly queryable with drill-down to raw spectral evidence.
A purpose-built scientific discovery engine
The platform was delivered in three structured phases — each building on the last to progressively replace manual workflows, integrate genomic data sources, and unlock increasingly sophisticated discovery capabilities.
Automated Data Ingestion
Event-driven ingestion pipeline processes raw PEAKS MS/MS exports automatically — normalising, deduplicating, and structuring peptide data at patient, experiment, and peptide level without manual intervention
Scientific Database Architecture
Purpose-built relational database stores peptide records across 100s of cell lines and patient samples — with HLA typing, RNA-Seq metadata, and NetMHC binding predictions all co-located for unified querying
Multi-Dimensional Query Engine
Researchers can filter across expression thresholds, binding affinity, off-target risk scores, sample frequency, and tissue type simultaneously — enabling complex discovery queries that previously required days of manual consolidation
Drill-Down to Raw Evidence
Every summarised result is traceable to its source MS/MS experimental data — scientists can navigate from a peptide summary down to the raw spectral evidence in seconds
Annotation & Off-Target Automation
Annotation and off-target prediction scripts are triggered automatically on ingestion — eliminating manual execution steps and reducing the risk of missed processing or version inconsistencies
Ecosystem Integration Roadmap
Platform architected for integration with Benchling (research registry), BSI (specimen inventory), and external genomic datasets — TCGA tumour expression and GTEx normal tissue expression already incorporated
Built in three progressive phases
Each phase delivered immediate value while building the foundations for the next capability layer.
Data Ingestion, Storage & Query Foundation
Centralised scientific database design at patient and experiment level
Raw PEAKS MS data and HLA typing ingestion
Simple and complex multi-threshold query engines
Historical dataset migration (283K+ peptides)
Export interfaces for annotation and off-target pipelines
Advanced Querying & System Integration
Patient vs Cell Line frequency field split
Ion Score added as searchable metric
NetMHC predictions integrated directly
TCGA and GTEx expression datasets ingested
Drill-down from summary to experimental edited data
Tissue-type filtering: Tumour vs Normal vs Cell Line
Enhancements & Ecosystem Integration
Script version control and re-run capability
Automated notification workflows
Integration roadmap: Benchling, BSI specimen inventory, Research Data Lake API
Cross-therapeutic reuse positioning
From weeks of manual effort to seconds of structured discovery
The platform fundamentally changed how the organisation interacts with its peptide data. Scientists can now ask complex multi-dimensional questions — combining expression thresholds, binding affinity, off-target risk, and sample frequency — and receive results instantly, with full drill-down to raw spectral evidence.
By eliminating Excel-based restructuring, manual script execution, and email-driven coordination, the platform recovered 3–6 FTE equivalent effort annually — while accelerating target nomination cycles and reducing scientific attrition risk from key-person dependency.
Beyond efficiency — changing what's scientifically possible
Tumour-Specific Prioritisation
Scientists can now identify and prioritise peptides enriched in tumour tissue vs normal tissue — a query impossible in the legacy architecture
Off-Target Risk Reduction
Automated off-target prediction at ingestion reduces the risk of nominating candidates with high normal-tissue cross-reactivity
HLA Binding Confidence
Direct NetMHC integration within the platform improves confidence in binding affinity predictions and enables rank-based filtering
Raw Evidence Access
Any summarised result can be drilled to its source MS/MS spectral data — giving scientists full transparency from discovery to evidence
Cross-Therapeutic Applicability
Platform architecture is not limited to Oncology — designed for reuse across multiple Therapeutic Focus Areas as the organisation expands
Millions-Scale Readiness
Structured database architecture handles exponential peptide growth without performance degradation — ready for millions of records
Facing a similar challenge?
Our architects are ready to design a solution tailored to your scientific and enterprise constraints.