Federated Learning
mshale-fl v0.1.0 — three-tier privacy-preserving data contribution system. Partners share progressively more data in exchange for progressively better model access. No proprietary protocol data ever leaves the partner's infrastructure.
Three-Layer Architecture
Partner runs mshale-extract locally. Only ProtocolSpec JSON (no raw data, no outcomes) is transmitted. Minimal trust required.
Use mshale-extract --local-mode. Zero raw data transmitted. Suitable for any lab with published or internal protocols.
Partner runs local feature extraction. Only 223-dim feature vectors are transmitted — not protocol text. Protocol content is not reconstructable from feature vectors.
Requires Data Processing Agreement. No DPA attorney needed for academic partners.
Full Flower (flwr) federated training. FedProx aggregation. Opacus differential privacy with per-partner ε-budget tracking. Cryptographic SecAgg. Hash-chain audit log.
Requires Tier 3 commercial federated training agreement. DP budget tracked per (partner_id, model_version).
Data Sensitivity Tiers
Published protocols. Apache 2.0. No restrictions.
Internal procedures, no outcomes. Feature vectors only.
Procedures + efficiency. Requires DPA or IRB agreement.
Full IP. Never transmitted. Local extraction only.
Quickstart — Tier 1
pip install mshale-fl # Partner: run extraction locally (no raw data transmitted) mshale-fl extract --tier 1 --corpus ./lab-protocols/ --output ./specs/ # Check what will be transmitted (inspection mode) mshale-fl inspect --specs ./specs/ --show-features # Submit Tier 1 data to Mishale (ProtocolSpec JSON only) mshale-fl submit --specs ./specs/ --api-key $MSHALE_API_KEY
Quickstart — Tier 3 (FL Server)
# Mishale: start the Flower aggregation server mshale-fl serve \ --model models/mshale1_v0.pt \ --rounds 10 \ --min-clients 3 \ --dp-epsilon 1.0 # Partner: join a federated training round mshale-fl join \ --server api.mishale.bio:9090 \ --corpus ./local-corpus/ \ --tier 3 \ --dp-epsilon 0.5 # local DP noise added before gradient transmission # Check DP budget remaining mshale-fl budget --partner-id your-lab-id
Privacy Guarantees
Differential Privacy
Opacus DP wraps the PyTorch training loop. Gaussian noise added to gradients before transmission. Per-partner (ε, δ)-DP budget tracked with automatic pause when budget < 10% remaining.
Secure Aggregation
Cryptographic SecAgg via threshold secret sharing. Server sees only the sum of gradients — never individual partner gradients. Implemented with the cryptography library (≥42).
Hash-Chain Audit
Every FL round is logged to an append-only hash chain: partner ID, round number, DP budget consumed, model version. Stored in per-region S3 with WORM Object Lock.
No Raw Data Transmitted
At every tier, raw protocol text never leaves the partner. Tier 1: ProtocolSpec JSON. Tier 2: feature vectors. Tier 3: DP-noised gradients from a locally-trained model.
FedProx Strategy
Mishale uses FedProx rather than FedAvg because partner datasets are heterogeneous (different cell types, organisms, protocol domains). FedProx adds a proximal term to each client's objective that limits how far local training can drift from the global model:
# Local client objective (FedProx): # min_w F_i(w) + (μ/2) * ||w - w_global||² # # μ = 0.01 (default) — controls how close local w stays to global w # Higher μ → more regularisation → better performance on heterogeneous data # # Convergence guarantee: O(1/T) under partial participation # (not all partners need to participate in every round)