Skip to content
Protean

Research study/2026-06-11

Computational prioritization of cysteine-rich secreted-like peptide candidates from psilocybin-producing fungal proteomes

A reproducible computational survey of cysteine-rich secreted-like peptide families across fungal proteomes

Date
2026-06-11
Study type
Formal computational study
Source
Protean Psilocybe Secretome Campaign
Computational surveyFungal proteomesCysteine-rich peptidesSecreted-like candidatesFamily clusteringDBAASP context

Proteins analyzed

58,449

Mature candidates

470

70% families

346

Study boundary

This study is computational prioritization only. Secreted-like means sequence features consistent with a deterministic heuristic, not confirmed secretion. DBAASP and other local reference archives provide nearest-neighbor context only and do not establish activity, toxicity profile, molecular function, experimental validation, or global novelty for any Protean candidate.

Abstract

Psilocybin-producing fungi are primarily known for small-molecule biosynthesis, but their predicted proteomes also encode short peptide-like sequences that remain sparsely characterized. This study reports a reproducible computational survey of available fungal proteomes to identify short cysteine-rich, secreted-like peptide candidates. Protean analyzed 58,449 predicted proteins from four included fungal assemblies, retained 8,063 proteins of 150 amino acids or fewer, identified 1,394 secreted-like sequences by deterministic sequence heuristics, retained 1,099 cysteine-rich precursors, and produced 470 unique mature peptide candidates. Phase II family clustering resolved these candidates into 346 families at 70 percent identity and 0.8 coverage, including 292 singleton families and 24 cross-species families. Local reference-space comparison using acquired peptide archives, including a DBAASP snapshot used only for nearest-neighbor context, found no exact DBAASP matches and no candidate nearest neighbors at or above 90 percent identity. These results define a bounded computational map of cysteine-rich, secreted-like fungal peptide candidates for future scientific review. The study does not claim confirmed secretion, biological activity, therapeutic relevance, toxicity profile, global novelty, or experimental validation.

Plain-English Summary

What this study
means.

Protean screened public fungal proteomes for short, cysteine-rich peptide candidates with secreted-like sequence features. The campaign reduced tens of thousands of predicted proteins to a smaller set of mature peptide candidates, then grouped those candidates into sequence families so the result could be read as a family-level map instead of a raw candidate list.

This is computational prioritization only. The page identifies candidates and reference-space relationships for future scientific review; it does not show that any candidate is secreted in an organism, has biological activity, has a toxicity profile, or has been experimentally validated.

Key findings

  • The workflow reduced 58,449 analyzed fungal proteins to 470 unique mature peptide candidates without rerunning discovery for this publication page.
  • MMseqs2 family collapse yielded 346 primary families at 70 percent identity, showing that the set is not reducible to one repeated sequence motif.
  • The largest 70 percent family contained 17 members; 322 families were species-specific and 24 families included more than one species.
  • Recurring cysteine-spacing architectures included C7-1-33-13-6-1-13C in 16 families, C7-1-33-14-6-1-13C in 10 families, and C22-8-31C in 6 families.
  • The DBAASP comparison used 25,069 raw records, 20,582 normalized rows, and 16,989 unique peptide sequences as local nearest-neighbor context only.
  • No exact DBAASP matches and no candidate nearest neighbors at or above 90 percent identity were observed in the local DBAASP snapshot.

01

Introduction

Fungal proteomes encode many short proteins and peptide-like sequences whose biological roles are incompletely understood. In psilocybin-producing fungal lineages, public attention often centers on secondary metabolites, but predicted proteomes provide a separate computational substrate for asking whether short cysteine-rich, secreted-like peptide candidates are present across available assemblies.

This study frames those sequences as a reproducible sequence-space survey. It does not infer therapeutic relevance, biological activity, secretion, or risk profile from sequence alone. Instead, the goal is to create a transparent family-level map that can support future human scientific review.

02

Study Design

The study was built from existing Protean campaign artifacts. No discovery rerun, score mutation, rank mutation, scoring-weight change, validator change, or new model/tool prediction was performed for the website publication step.

The campaign followed a fixed flow: proteome acquisition, normalization, small-protein extraction, deterministic secreted-like prediction, mature peptide extraction, cysteine-rich filtering, deduplication, candidate generation, scoring, family collapse, and local reference comparison. Phase II shifted the interpretation from 470 individual candidates to a family-level view of redundancy, motifs, and reference-space context.

03

Data Sources

The acquisition phase selected four assemblies that exposed the required NCBI protein FASTA, genomic GFF, and assembly report files: Psilocybe cubensis, Psilocybe cyanescens, Panaeolus cyanescens, and Gymnopilus dilepis. Psilocybe azurescens and Pluteus salicinus were excluded because the acquisition manifest did not find a required protein FASTA plus genomic GFF package.

Reference-space comparison used the acquired local reference archive including DBAASP, DRAMP, APD, MEROPS, and PeptideAtlas where available. DBAASP was integrated as a local-first snapshot from dbaasp.org for nearest-neighbor retrieval, reference-space analysis, and annotation context only.

04

Computational Methods

FASTA records were normalized into source-tracked sequence objects and screened for short proteins of <=150 amino acids. Secreted-like status was assigned by deterministic sequence features including N-terminal hydrophobicity, positively charged N-region features, cleavage-motif heuristics, and transmembrane-risk checks.

Cysteine-rich precursors required mature peptide lengths from 20 to 150 amino acids and either at least four cysteine residues or a cysteine fraction of at least 0.06. Candidate properties, including cysteine count, cysteine fraction, spacing motif, charge, pI, hydrophobicity, aromatic fraction, glycine enrichment, proline enrichment, and protease-vulnerability proxy, were computed for prioritization context.

05

Candidate Generation

The campaign retained 470 mature peptide candidates after filtering and global deduplication. Candidate IDs, precursor sequences, mature sequences, species, source assembly, source database, computed features, and provenance were preserved in native Protean candidate objects.

The public study reports the existing candidate and family outputs. It does not modify the top-25 ranking, scoring stack, weights, validators, frontier behavior, or database-comparison effects.

06

Family Clustering

Phase II clustered mature peptide sequences with MMseqs2 version 18-8cc5c at 90, 80, and 70 percent identity. The 70 percent identity and 0.8 coverage threshold was used as the primary family definition.

At 70 percent identity, the 470 candidates collapsed into 346 families, including 292 singletons. The largest family contained 17 members. Species-specific families numbered 322, while 24 families included candidates from more than one species.

07

Reference Corpus Comparison

The DBAASP snapshot contained 25,069 raw records, 20,582 canonical normalized peptide rows, 16,989 unique peptide sequences, and 171 hydrated detail records. The snapshot was local-first and introduced no live scoring dependency.

Candidate-level DBAASP nearest-neighbor comparison found 0 exact matches, 0 matches at or above 90 percent identity, 4 matches at or above 80 percent identity, 26 at or above 70 percent identity, and 441 at or above 50 percent identity. The median nearest-neighbor identity was 0.5625 and the maximum was 0.875. DBAASP activity, toxicity, hemolysis, structure, and literature annotations, when present, describe nearest neighbors only and are not labels for Protean candidates.

08

Results

The central result is a family-level map rather than a claim about individual molecule function. The set contains broad family structure, a high singleton fraction, several recurring cysteine-spacing architectures, and a small number of cross-species families.

Comparison against experimentally characterized peptide space, represented here by the local DBAASP snapshot, found no exact or 90 percent identity candidate matches. Lower-identity nearest neighbors provide reference context, but do not support function transfer.

09

Discussion

The results suggest that psilocybin-producing fungal proteomes contain a broad computational landscape of short cysteine-rich, secreted-like peptide candidates. Family collapse matters because the original count of 470 candidates includes both singleton sequence hypotheses and larger repeated architectures.

The recurring motifs should be interpreted as sequence architectures. Some motifs recur across multiple families, but recurrence alone does not establish homology, secretion, activity, or ecological function. The strongest contribution of the campaign is a reproducible, artifact-backed prioritization surface for future review.

10

Limitations

Secretion status was treated as secreted-like based on deterministic sequence features and was not experimentally validated. Protein annotations, mature peptide boundaries, cysteine-rich filters, and family assignments are all computational and depend on the input assemblies and acquisition state.

Reference comparisons are local-archive bounded. The phrase unmatched means unmatched against the acquired local reference archive at the stated thresholds. It does not mean globally novel. DBAASP provides experimentally derived peptide context but does not establish activity, risk profile, or function for Protean candidates.

11

Reproducibility

The public study is backed by campaign manifests, Phase II validation artifacts, DBAASP snapshot verification, neighbor JSONL outputs, figure-generation scripts, and a public claims audit. Figure data are generated from the existing campaign outputs rather than synthetic or decorative data.

The publication build preserves the original discovery outputs. The website layer converts existing artifacts into public-safe narrative, figures, tables, metadata, and dynamic social images.

12

Data and Code Availability

The study page exposes public-safe figures, captions, tables, and source-artifact names. Raw campaign data and local database snapshots remain governed by the repository's publication boundary and are not automatically exposed through the website.

The source code for the publication page, content registry, figure export script, dynamic Open Graph image route, and claims audit is retained in the Protean repository for review.

13

Conclusion

Protean's Psilocybe Secretome Campaign moved from 470 computational mature peptide candidates to a family-level map of 346 cysteine-rich, secreted-like candidate families. The family structure, recurring motifs, cross-species families, and local reference-space distances provide a more defensible scientific view than candidate count alone.

The study's strongest claim is deliberately bounded: this is a reproducible computational survey and prioritization resource. It is not a validation study, not an activity claim, and not a therapeutic claim.

Figures

Publication figures from
existing artifacts.

figure 1 · campaign_manifest.json, phase2_validation_report.md

Study workflow

Workflow from proteomes through small protein filtering, secreted-like heuristic screening, cysteine-rich filtering, mature peptide candidates, clustering, and reference comparison.
The public study summarizes an existing Protean campaign from immutable proteome acquisition through family-level clustering and reference-space comparison. No additional discovery or model prediction was run for this publication page.Download SVGDownload PNG

figure 2 · candidate_statistics.json

Filtering funnel

Filtering funnel showing 58,449 proteins, 8,063 small proteins, 1,394 secreted-like sequences, 1,099 cysteine-rich precursors, and 470 mature peptide candidates.
Reduction from 58,449 analyzed fungal proteins to 470 mature peptide candidates using length, deterministic secreted-like, cysteine-rich, mature-peptide, and deduplication filters.Download SVGDownload PNG

figure 3 · cluster_report.json

Cluster collapse

Cluster summary showing 442 clusters and 415 singletons at 90 percent identity, 401 clusters and 356 singletons at 80 percent identity, and 346 clusters and 292 singletons at 70 percent identity.
MMseqs2 clustering reduced the 470 unique candidates into 346 primary families at 70 percent identity and 0.8 coverage, with 292 singleton families.Download SVGDownload PNG

figure 4 · candidate_statistics.json

Species contribution

Species contribution bar chart showing Psilocybe cyanescens 142 candidates, Gymnopilus dilepis 125, Panaeolus cyanescens 109, and Psilocybe cubensis 94.
Candidate contribution by included species. The chart reports retained mature peptide candidates, not biological expression or activity.Download SVGDownload PNG

figure 5 · motif_analysis.json

Motif and family distribution

Recurring cysteine spacing motifs, including C7-1-33-13-6-1-13C in 16 families, C7-1-33-14-6-1-13C in 10 families, and C22-8-31C in 6 families.
Recurring cysteine-spacing architectures among 70 percent identity families. Motifs are sequence patterns and are not functional assignments.Download SVGDownload PNG

figure 6 · dbaasp_novelty_summary.json, dbaasp_candidate_neighbors.jsonl

DBAASP nearest-neighbor identity

Histogram of DBAASP nearest-neighbor identity values, with most candidates between 50 and 69 percent identity and no exact or 90 percent identity matches.
Nearest-neighbor identity distribution against the local DBAASP snapshot. DBAASP annotations were used only as context for nearest neighbors and were not transferred to Protean candidates.Download SVGDownload PNG

figure 7 · dbaasp_novelty_summary.json, novelty_report.json

Reference archive comparison

Reference summary showing zero exact DBAASP matches, zero DBAASP matches at or above 90 percent identity, four at or above 80 percent identity, 26 at or above 70 percent identity, and 441 at or above 50 percent identity.
Threshold summary for DBAASP nearest-neighbor comparison. The broader reference-archive statement remains local and acquired-archive bounded.Download SVGDownload PNG

figure 8 · family_manifest.json, dbaasp_family_neighbors.jsonl

Top family representatives

Largest 70 percent identity families, led by mmseqs_70_0001 with 17 members, mmseqs_70_0002 with 12 members, and mmseqs_70_0003 with 9 members.
Largest 70 percent identity families and representative candidates. These representatives are anchors for family-level review, not experimentally confirmed molecules.Download SVGDownload PNG

Tables

Counts, thresholds,
and boundaries.

Table · assemblies

Included and excluded assemblies

Assembly inclusion was determined during the original acquisition pass by availability of required NCBI protein FASTA, genomic GFF, and assembly report artifacts.

SpeciesAssemblyLevelSourceStatus
Psilocybe cubensisGCF_017499595.1 MGC_Penvy_1ChromosomeRefSeqIncluded
Psilocybe cyanescensGCA_002938375.1 Psicy2ScaffoldGenBankIncluded
Panaeolus cyanescensGCA_002938355.1 ASM293835v1ScaffoldGenBankIncluded
Gymnopilus dilepisGCA_002938385.1 ASM293838v1ScaffoldGenBankIncluded
Psilocybe azurescensNo selected assemblyn/aNCBIExcluded: no required protein FASTA plus genomic GFF package
Pluteus salicinusNo selected assemblyn/aNCBIExcluded: no required protein FASTA plus genomic GFF package

Table · filtering funnel

Filtering funnel counts

Counts are taken from the completed campaign statistics and were not recomputed by the website build.

StageCountInterpretation
Predicted fungal proteins analyzed58449Input proteome records after acquisition and normalization
Small proteins <=150 amino acids8063Length screen
Secreted-like by deterministic sequence heuristic1394Signal-peptide-like feature screen
Cysteine-rich precursors1099Cysteine count or fraction screen
Mature peptide candidates470Mature candidate set
Unique mature peptide candidates470Global deduplication result

Table · cluster summary

MMseqs2 cluster summary

MMseqs2 version 18-8cc5c was used for family collapse at 90, 80, and 70 percent identity.

Identity thresholdClustersSingletonsPrimary use
90%442415Near-duplicate collapse
80%401356Intermediate similarity view
70%346292Primary family definition

Table · dbaasp snapshot

DBAASP snapshot summary

DBAASP was integrated as a local-first nearest-neighbor and reference-context corpus with no live scoring dependency.

Snapshot fieldValueBoundary
Raw DBAASP records25069Local snapshot from dbaasp.org
Canonical normalized peptide rows20582Normalization output
Unique peptide sequences16989Reproducible unique count
Hydrated detail records present171Context where available
Scoring or ranking mutationNoneDBAASP was not used to change scores, ranks, or weights

Table · dbaasp thresholds

DBAASP nearest-neighbor threshold summary

Threshold counts are candidate-level nearest-neighbor summaries across all 470 mature peptide candidates.

ThresholdCandidate countInterpretation
Exact DBAASP match0No exact sequence match in the local DBAASP snapshot
>=90% identity0No high-identity nearest neighbor at this threshold
>=80% identity4Small number of higher-similarity neighbors
>=70% identity26Lower high-similarity context
>=50% identity441Broad lower-identity neighbor context
Median nearest-neighbor identity0.5625Candidate-level median
Maximum nearest-neighbor identity0.875Candidate-level maximum

Table · top 25 candidates

Top 25 Phase II candidates

Ranks are the existing Phase II prioritization output and were not modified for the public study page.

RankFamilyCandidateFamily sizeSpeciesPhase II scoreNovelty
1mmseqs_70_0099psilocybe-secretome-gymnopilus-dilepis-7248e9bd1Gymnopilus dilepis:10.2903unmatched
2mmseqs_70_0313psilocybe-secretome-psilocybe-cyanescens-a17435b51Psilocybe cyanescens:10.2779unmatched
3mmseqs_70_0169psilocybe-secretome-panaeolus-cyanescens-62372ca71Panaeolus cyanescens:10.2717unmatched
4mmseqs_70_0087psilocybe-secretome-gymnopilus-dilepis-55ea84081Gymnopilus dilepis:10.2559unmatched
5mmseqs_70_0067psilocybe-secretome-gymnopilus-dilepis-249f6d4a1Gymnopilus dilepis:10.2457unmatched
6mmseqs_70_0121psilocybe-secretome-gymnopilus-dilepis-bdc120101Gymnopilus dilepis:10.2389unmatched
7mmseqs_70_0143psilocybe-secretome-gymnopilus-dilepis-f3c7cf8e1Gymnopilus dilepis:10.2347unmatched
8mmseqs_70_0212psilocybe-secretome-psilocybe-cubensis-076256a31Psilocybe cubensis:10.2319unmatched
9mmseqs_70_0265psilocybe-secretome-psilocybe-cyanescens-14143d991Psilocybe cyanescens:10.2315unmatched
10mmseqs_70_0345psilocybe-secretome-psilocybe-cyanescens-f8b2d44c1Psilocybe cyanescens:10.2307unmatched
11mmseqs_70_0064psilocybe-secretome-gymnopilus-dilepis-1185c2a61Gymnopilus dilepis:10.2289unmatched
12mmseqs_70_0131psilocybe-secretome-gymnopilus-dilepis-d80707461Gymnopilus dilepis:10.2287unmatched
13mmseqs_70_0250psilocybe-secretome-psilocybe-cubensis-c244b7e31Psilocybe cubensis:10.2262unmatched
14mmseqs_70_0016psilocybe-secretome-panaeolus-cyanescens-3c0b62303Panaeolus cyanescens:30.2228unmatched
15mmseqs_70_0332psilocybe-secretome-psilocybe-cyanescens-dbacbc9a1Psilocybe cyanescens:10.2212unmatched
16mmseqs_70_0153psilocybe-secretome-panaeolus-cyanescens-14669d141Panaeolus cyanescens:10.2186unmatched
17mmseqs_70_0053psilocybe-secretome-psilocybe-cyanescens-a2f0ebce2Psilocybe cyanescens:1, Psilocybe cubensis:10.2181unmatched
18mmseqs_70_0055psilocybe-secretome-gymnopilus-dilepis-00fd0ecb1Gymnopilus dilepis:10.2162unmatched
19mmseqs_70_0278psilocybe-secretome-psilocybe-cyanescens-36a5a4ea1Psilocybe cyanescens:10.2099unmatched
20mmseqs_70_0310psilocybe-secretome-psilocybe-cyanescens-94a830d91Psilocybe cyanescens:10.2099unmatched
21mmseqs_70_0199psilocybe-secretome-panaeolus-cyanescens-d1d9dd381Panaeolus cyanescens:10.2063unmatched
22mmseqs_70_0083psilocybe-secretome-gymnopilus-dilepis-4a527cb21Gymnopilus dilepis:10.2054unmatched
23mmseqs_70_0324psilocybe-secretome-psilocybe-cyanescens-bfbac08d1Psilocybe cyanescens:10.2033unmatched
24mmseqs_70_0056psilocybe-secretome-gymnopilus-dilepis-042324621Gymnopilus dilepis:10.2017unmatched
25mmseqs_70_0222psilocybe-secretome-psilocybe-cubensis-29265e621Psilocybe cubensis:10.2006unmatched

Table · top 10 families

Top 10 largest family representatives

Representatives are the largest 70 percent identity families. DBAASP values are nearest-neighbor context only.

FamilyRepresentative candidateSizeSpecies distributionMotifDBAASP identityCoverage
mmseqs_70_0001psilocybe-secretome-panaeolus-cyanescens-4483cb7717Gymnopilus dilepis:3; Panaeolus cyanescens:2; Psilocybe cubensis:6; Psilocybe cyanescens:6C7-1-33-13-6-1-13C0.50000.2917
mmseqs_70_0002psilocybe-secretome-psilocybe-cyanescens-35b32b4b12Gymnopilus dilepis:2; Psilocybe cubensis:3; Psilocybe cyanescens:7C15-32-14C0.55000.2174
mmseqs_70_0003psilocybe-secretome-panaeolus-cyanescens-e50cec729Gymnopilus dilepis:1; Panaeolus cyanescens:8C7-1-33-14-6-1-13C0.47620.2247
mmseqs_70_0004psilocybe-secretome-panaeolus-cyanescens-5e0f62565Panaeolus cyanescens:5C7-1-33-13-6-1-13C0.58820.1932
mmseqs_70_0005psilocybe-secretome-panaeolus-cyanescens-73d400765Panaeolus cyanescens:5C1-4-10-21-7C0.60000.1705
mmseqs_70_0006psilocybe-secretome-psilocybe-cubensis-4600dcef5Psilocybe cubensis:3; Psilocybe cyanescens:2C22-8-28C0.53850.2308
mmseqs_70_0007psilocybe-secretome-psilocybe-cubensis-901c7b275Psilocybe cubensis:3; Psilocybe cyanescens:2C7-1-33-13-6-1-13C0.56520.2556
mmseqs_70_0008psilocybe-secretome-psilocybe-cyanescens-518db8995Psilocybe cyanescens:5C8-10-25-8-40C0.53330.1271
mmseqs_70_0009psilocybe-secretome-gymnopilus-dilepis-17ecf5774Gymnopilus dilepis:4C7-10-25-8-40C0.52940.1518
mmseqs_70_0010psilocybe-secretome-gymnopilus-dilepis-bae127734Gymnopilus dilepis:1; Panaeolus cyanescens:1; Psilocybe cubensis:1; Psilocybe cyanescens:1C10-11-3-5-11-6C0.56250.1923

Table · limitations

Limitations and unresolved validation steps

These limitations are part of the public claim boundary and should be read with the results.

TopicCurrent statusConservative public wording
Secretion/localizationDeterministic sequence features onlySecreted-like; not experimentally localized
Biological functionNot assayedNo activity or molecular function is claimed
Toxicity or risk profileNot assayed for Protean candidatesNearest-neighbor annotations are context only
Reference archive scopeLocal acquired archiveUnmatched against the acquired local reference archive, not globally novel
Assembly and annotation dependenceNCBI assembly-dependentCandidate set depends on available predicted proteomes and annotations
RankingExisting scores and ranks retainedDBAASP and publication work did not modify scores, ranks, or weights

Artifacts

Public links and
background sources.

Public-safe artifacts

Background sources