Provenance & disclosure·Cohorts

Collections

Collections organise candidates, hypotheses, experiments, papers, and provenance into coherent research cohorts. They are scientific operating units, not just lists.

Target shape

Collection exists today as a deployed RecordType on the Protean Ledger. Off-chain collection manifests still exist as supplemental packaging for review and mirror publication, but the current canonical public object is a Ledger Collection record linked by typed Includes edges.

What collections are for

benchmark sets
assay cohorts
experiment batches
validation groups
exploration branches
public full-sequence releases

Lifecycle

private
-> internal review
-> assay pending
-> assay active
-> public reviewed
-> public full
-> archived

Most collection preparation remains private or under internal review before Ledger submission. Public collections expose membership, candidate/family sequences, lifecycle state, disclosure state, and typed lineage. Supplemental manifests pass through the publication guard before mirror publication.

Collections as scientific operating units

Collections let the runtime reason about groups, not just individual candidates. They bind candidate families, evidence traces, assay readiness, Ledger records, and review state without exposing private payloads. Collections support wet-lab batch planning, failure-memory studies, exploration branches, and public proof surfaces.

How they connect to the workflow

The canonical workflow DAG places update_collections after prepare_provenance, gated by the public_collection_redaction review gate. The active loop wires collection preparation after lineage and provenance stages where the relevant artifacts are available.

python3 pipelines/collection_system.py --build
python3 pipelines/collection_system.py --verify

Collection manifests are public-safe by construction: published sequences remain visible, while salts, local paths, embeddings, scoring weights, source traces, and wet-lab automation fields are excluded.