MISHALE
Docs

mshale-extract

Convert PDF, Word, HTML, plain-text, and DOI-linked protocols into validated ProtocolSpec JSON using Claude for structured extraction.

Quickstart

pip install mshale-extract

# Extract a single PDF
mshale-extract extract protocol.pdf --output spec.json

# Extract a DOI from the literature
mshale-extract extract 10.1038/s41592-021-01235-0 --doi --output spec.json

# Batch extract a directory
mshale-extract batch ./protocols/ --output ./specs/ --format jsonl

# Validate an existing spec
mshale-extract validate spec.json

Python API

from mshale_extract import extract_file, extract_doi, batch_extract

# Single file
spec = extract_file("protocol.pdf")
print(spec.title, spec.domain, len(spec.steps))

# DOI
spec = extract_doi("10.1038/s41592-021-01235-0")

# Batch (returns list[ProtocolSpec])
specs = batch_extract("./protocols/", pattern="**/*.pdf")

# Ingest directly to mshale-api
from mshale_extract import ingest_file
result = ingest_file(
    "protocol.pdf",
    api_base="https://api.mishale.bio",
    api_key=os.environ["MSHALE_API_KEY"],
)

Supported Input Formats

pdf

PDF

Uses pdfminer.six for text extraction, with layout-aware paragraph detection.

docx

Word (.docx)

Extracts paragraphs and tables via python-docx, preserving numbered step lists.

html

HTML

BeautifulSoup parser strips navigation and footer boilerplate.

txt

Plain text

Heuristic line-by-line step detection via numbered / bulleted patterns.

doi

DOI / PubMed ID

Fetches full-text via Unpaywall or PubMed E-utilities, then parses as HTML.

Extraction Pipeline

1

Parse

Format-specific parser extracts raw text and structural hints (numbering, headings).

2

Chunk

Text is split into ≤4 000-token chunks to fit within Claude context.

3

Extract

Claude extracts title, domain, steps, reagents, and equipment using tool use.

4

Merge

Multi-chunk results are merged and deduplicated.

5

Validate

Output is validated against the ProtocolSpec Pydantic schema.

6

Tier

Data tier is assigned via mshale-fl tier_for_spec().

CLI Reference

CommandDescription
mshale-extract extract <file>Extract a single file → ProtocolSpec JSON
mshale-extract extract <doi> --doiExtract a protocol via DOI or PubMed ID
mshale-extract batch <dir>Batch extract a directory of files
mshale-extract validate <file>Validate a ProtocolSpec JSON against the schema
mshale-extract ingest <file>Extract and submit directly to mshale-api