mshale-extract
Convert PDF, Word, HTML, plain-text, and DOI-linked protocols into validated ProtocolSpec JSON using Claude for structured extraction.
Quickstart
pip install mshale-extract # Extract a single PDF mshale-extract extract protocol.pdf --output spec.json # Extract a DOI from the literature mshale-extract extract 10.1038/s41592-021-01235-0 --doi --output spec.json # Batch extract a directory mshale-extract batch ./protocols/ --output ./specs/ --format jsonl # Validate an existing spec mshale-extract validate spec.json
Python API
from mshale_extract import extract_file, extract_doi, batch_extract
# Single file
spec = extract_file("protocol.pdf")
print(spec.title, spec.domain, len(spec.steps))
# DOI
spec = extract_doi("10.1038/s41592-021-01235-0")
# Batch (returns list[ProtocolSpec])
specs = batch_extract("./protocols/", pattern="**/*.pdf")
# Ingest directly to mshale-api
from mshale_extract import ingest_file
result = ingest_file(
"protocol.pdf",
api_base="https://api.mishale.bio",
api_key=os.environ["MSHALE_API_KEY"],
)Supported Input Formats
Uses pdfminer.six for text extraction, with layout-aware paragraph detection.
Word (.docx)
Extracts paragraphs and tables via python-docx, preserving numbered step lists.
HTML
BeautifulSoup parser strips navigation and footer boilerplate.
Plain text
Heuristic line-by-line step detection via numbered / bulleted patterns.
DOI / PubMed ID
Fetches full-text via Unpaywall or PubMed E-utilities, then parses as HTML.
Extraction Pipeline
Parse
Format-specific parser extracts raw text and structural hints (numbering, headings).
Chunk
Text is split into ≤4 000-token chunks to fit within Claude context.
Extract
Claude extracts title, domain, steps, reagents, and equipment using tool use.
Merge
Multi-chunk results are merged and deduplicated.
Validate
Output is validated against the ProtocolSpec Pydantic schema.
Tier
Data tier is assigned via mshale-fl tier_for_spec().
CLI Reference
| Command | Description |
|---|---|
| mshale-extract extract <file> | Extract a single file → ProtocolSpec JSON |
| mshale-extract extract <doi> --doi | Extract a protocol via DOI or PubMed ID |
| mshale-extract batch <dir> | Batch extract a directory of files |
| mshale-extract validate <file> | Validate a ProtocolSpec JSON against the schema |
| mshale-extract ingest <file> | Extract and submit directly to mshale-api |