Federated Learning

mshale-fl v0.1.0 — three-tier privacy-preserving data contribution system. Partners share progressively more data in exchange for progressively better model access. No proprietary protocol data ever leaves the partner's infrastructure.

Three-Layer Architecture

Tier 1Layer 1 — Local ExtractionAvailable

Partner runs mshale-extract locally. Only ProtocolSpec JSON (no raw data, no outcomes) is transmitted. Minimal trust required.

Use mshale-extract --local-mode. Zero raw data transmitted. Suitable for any lab with published or internal protocols.

Tier 2Layer 2 — Feature FederationAvailable

Partner runs local feature extraction. Only 223-dim feature vectors are transmitted — not protocol text. Protocol content is not reconstructable from feature vectors.

Requires Data Processing Agreement. No DPA attorney needed for academic partners.

Tier 3Layer 3 — Gradient FederationAvailable

Full Flower (flwr) federated training. FedProx aggregation. Opacus differential privacy with per-partner ε-budget tracking. Cryptographic SecAgg. Hash-chain audit log.

Requires Tier 3 commercial federated training agreement. DP budget tracked per (partner_id, model_version).

Data Sensitivity Tiers

Public

Published protocols. Apache 2.0. No restrictions.

Anonymised

Internal procedures, no outcomes. Feature vectors only.

Outcomes

Procedures + efficiency. Requires DPA or IRB agreement.

Proprietary

Full IP. Never transmitted. Local extraction only.

Quickstart — Tier 1

pip install mshale-fl

# Partner: run extraction locally (no raw data transmitted)
mshale-fl extract --tier 1 --corpus ./lab-protocols/ --output ./specs/

# Check what will be transmitted (inspection mode)
mshale-fl inspect --specs ./specs/ --show-features

# Submit Tier 1 data to Mishale (ProtocolSpec JSON only)
mshale-fl submit --specs ./specs/ --api-key $MSHALE_API_KEY

Quickstart — Tier 3 (FL Server)

# Mishale: start the Flower aggregation server
mshale-fl serve \
  --model models/mshale1_v0.pt \
  --rounds 10 \
  --min-clients 3 \
  --dp-epsilon 1.0

# Partner: join a federated training round
mshale-fl join \
  --server api.mishale.bio:9090 \
  --corpus ./local-corpus/ \
  --tier 3 \
  --dp-epsilon 0.5   # local DP noise added before gradient transmission

# Check DP budget remaining
mshale-fl budget --partner-id your-lab-id

Privacy Guarantees

Differential Privacy

Opacus DP wraps the PyTorch training loop. Gaussian noise added to gradients before transmission. Per-partner (ε, δ)-DP budget tracked with automatic pause when budget < 10% remaining.

Secure Aggregation

Cryptographic SecAgg via threshold secret sharing. Server sees only the sum of gradients — never individual partner gradients. Implemented with the cryptography library (≥42).

Hash-Chain Audit

Every FL round is logged to an append-only hash chain: partner ID, round number, DP budget consumed, model version. Stored in per-region S3 with WORM Object Lock.

No Raw Data Transmitted

At every tier, raw protocol text never leaves the partner. Tier 1: ProtocolSpec JSON. Tier 2: feature vectors. Tier 3: DP-noised gradients from a locally-trained model.

FedProx Strategy

Mishale uses FedProx rather than FedAvg because partner datasets are heterogeneous (different cell types, organisms, protocol domains). FedProx adds a proximal term to each client's objective that limits how far local training can drift from the global model:

# Local client objective (FedProx):
#   min_w  F_i(w)  +  (μ/2) * ||w - w_global||²
#
# μ = 0.01 (default) — controls how close local w stays to global w
# Higher μ → more regularisation → better performance on heterogeneous data
#
# Convergence guarantee: O(1/T) under partial participation
# (not all partners need to participate in every round)