Skip to content
Protean

Evidence layer·Intake

Ingestion method

Ingestion is the first control surface. The goal is not to collect the largest possible corpus. The goal is to convert selected scientific records into traceable evidence that can safely influence constraints, proposal, ranking, and review.

Source classes

The runtime currently supports source classes for curated biological records, biomedical literature, open publication metadata, and clinical-stage context.

  • UniProt-style sequence and annotation records
  • PubMed and NCBI literature metadata
  • Europe PMC abstracts and publication availability metadata
  • ClinicalTrials.gov study metadata for peptide-related context
  • internal research observations and failure records

These sources are not treated as equivalent truth. Each record enters with source identity, query context, timestamps, lineage, and extraction confidence so downstream systems can reason about provenance.

The evidence flow

source selection
-> rate-limited retrieval
-> raw record preservation
-> dedupe and fingerprinting
-> structured extraction
-> evidence normalization
-> negative-signal capture
-> constraint and ranking inputs

Raw records are preserved before extraction. Normalised evidence is shaped into a fixed internal contract that ranking, explanation, and failure-memory systems can read.

The normalisation contract

Each normalised record preserves the properties that matter for research orchestration:

  • source type and source identifier
  • query or acquisition context
  • candidate sequence mentions when conservatively detected
  • biological or formulation context when available
  • positive, neutral, and negative evidence language
  • failure cues (degradation, cleavage, instability, poor solubility, low permeability, short half-life, terminated development)
  • traceability back to the originating record

The extraction layer is intentionally conservative. If a sequence or claim cannot be parsed with sufficient confidence, it is kept as context rather than promoted into candidate evidence.

Negative evidence is first-class

Failure data is not a side note. During ingestion, the runtime looks for structured signs of peptide failure: proteolytic degradation, gastric or intestinal instability, low oral exposure, poor permeability, poor solubility, immunogenicity signals, short half-life, delivery limitations, abandoned or terminated development paths. Those records can influence failure-motif memory, candidate warnings, ranking penalties, constraint refinement, and bounded feedback signals.

Dedupe and provenance

Ingestion uses source identifiers, record keys, and fingerprints so that scheduled loops do not inflate the influence of repeated records. Duplicate mentions do not automatically become stronger evidence; source lineage stays visible to the review surface.

Model-assisted extraction is bounded

urchade/gliner_large-v2 and local reasoning models can assist extraction when available, but they do not become the source of truth. Model-assisted extraction is bounded by schema, timeout, routing, and deterministic fallback behaviour. If model extraction is unavailable or uncertain, the system continues through conservative rule-based extraction.