Evidence layer·Vectorize
Text embeddings
Text embeddings serve scientific evidence organisation. They do not serve peptide similarity. Natural-language records and amino-acid sequences are different data surfaces with different failure modes.
The route
The primary route is BAAI/bge-m3. If it is unavailable, the runtime falls back to nomic-ai/nomic-embed-text-v1.5, and then to deterministic lexical embeddings for degraded operation.
The boundary
Text embeddings are used for literature and evidence record retrieval, candidate explanation context, generated paper context, failure-record search, and local paper/patent record organisation when those records exist.
Text embeddings are not used for peptide sequence similarity, protein language modeling, novelty scoring over amino-acid strings, deterministic validation, or scoring override. The sequence layer (facebook/esm2_t12_35M_UR50D) handles those.
Cache behaviour
data/cache/text_embeddings.sqlite
-> bounded row cache
-> local-only model loading
-> deterministic fallback when models are unavailableThe cache is operational infrastructure — it improves speed and reproducibility, but it does not turn retrieved sources into biological proof.
Degraded operation
If BGE and Nomic are unavailable, the runtime uses deterministic text hashing and lexical overlap. The fallback is lower quality but stable, local, and explicitly marked in artifacts.
