Knowledge Base
mshale-kb provides ontology-backed entity resolution for reagents, cell lines, species, and domain classification. It bridges the gap between free-text protocols and machine-comparable structured records.
What it does
🧪
Reagent resolution
Maps "lipofectamine 2000" → ChEBI:33692 with canonical name and SMILES.
🦠
Cell line normalisation
Maps "HEK cells" → EFO:0001187 (HEK293T) + Homo sapiens.
🏷️
Domain classification
Maps free text to a Mishale domain enum using a fine-tuned classifier.
Linked Ontologies
| Ontology | Purpose |
|---|---|
| ChEBI | Chemical Entities of Biological Interest — reagent and small molecule lookup. |
| GO | Gene Ontology — biological process, molecular function, cellular component. |
| EFO | Experimental Factor Ontology — cell line and assay type normalisation. |
| OBI | Ontology for Biomedical Investigations — protocol step verb normalisation. |
| NCBI Taxonomy | Species resolution for cell lines and model organisms. |
Python API
from mshale_kb import KnowledgeBase
kb = KnowledgeBase()
# Reagent lookup
reagent = kb.lookup_reagent("Lipofectamine 2000")
# ReagentEntity(chebi_id="CHEBI:33692", canonical_name="lipofectamine 2000", ...)
# Cell line lookup
cell_line = kb.lookup_cell_line("HEK293T")
# CellLineEntity(efo_id="EFO:0001187", organism="Homo sapiens", ...)
# Domain classification
domain = kb.classify_domain("CRISPR knock-out using RNP electroporation")
# "crispr_ko"
# Enrich a ProtocolSpec
from mshale_schema import ProtocolSpec
spec = ProtocolSpec(...)
enriched_spec = kb.enrich(spec) # adds .reagents[].chebi_id etc.CLI Reference
| Command | Description |
|---|---|
| mshale-kb lookup reagent <name> | Resolve a reagent name to ChEBI ID + canonical name. |
| mshale-kb lookup cell-line <name> | Resolve a cell line to EFO term + organism. |
| mshale-kb lookup domain <text> | Classify free text to a Mishale domain identifier. |
| mshale-kb enrich <spec.json> | Enrich a ProtocolSpec with resolved ontology IDs. |
| mshale-kb stats | Show KB coverage stats (reagents, cell lines, domains). |
Offline Cache
The KB ships with a pre-built SQLite cache of the most common reagents and cell lines (~50K entries). Remote ontology lookups fall through to the live OLS API when a term is not cached. The cache is updated with each Mishale release.
# Force refresh the local cache mshale-kb cache refresh # Show cache statistics mshale-kb cache stats # Reagents: 51,204 terms # Cell lines: 4,891 terms # Last updated: 2025-03-15