Skip to content
Protean

Evidence layer·Vectorize

Text embeddings

Text embeddings serve scientific evidence organisation. They do not serve peptide similarity. Natural-language records and amino-acid sequences are different data surfaces with different failure modes.

The route

The primary route is BAAI/bge-m3. If it is unavailable, the runtime falls back to nomic-ai/nomic-embed-text-v1.5, and then to deterministic lexical embeddings for degraded operation.

The boundary

Text embeddings are used for literature and evidence record retrieval, candidate explanation context, generated paper context, failure-record search, and local paper/patent record organisation when those records exist.

Text embeddings are not used for peptide sequence similarity, protein language modeling, novelty scoring over amino-acid strings, deterministic validation, or scoring override. The sequence layer (facebook/esm2_t12_35M_UR50D) handles those.

Cache behaviour

data/cache/text_embeddings.sqlite
-> bounded row cache
-> local-only model loading
-> deterministic fallback when models are unavailable

The cache is operational infrastructure — it improves speed and reproducibility, but it does not turn retrieved sources into biological proof.

Degraded operation

If BGE and Nomic are unavailable, the runtime uses deterministic text hashing and lexical overlap. The fallback is lower quality but stable, local, and explicitly marked in artifacts.