Acknowledgements

Integrated methods, tools and databases

AggreProt

AggreProt is a web server for predicting aggregation-prone regions in protein sequences using deep neural networks trained on experimentally validated hexapeptides. It provides per-residue aggregation scores and related sequence features and shows competitive performance relative to existing methods.
Reference: https://pubmed.ncbi.nlm.nih.gov/38801076/
Homepage: https://loschmidt.chemi.muni.cz/aggreprot/

BLAST+

BLAST suite represents the most widely used tools for searching protein and DNA databases for sequence similarities. BLAST finds regions of local similarity between sequences, compares sequences to sequence databases and calculates the statistical significance of matches.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/20003500
Homepage: https://blast.ncbi.nlm.nih.gov/Blast.cgi

CataPro

CataPro is a deep learning model that predicts enzyme kinetic parameters—turnover number (kcat), Michaelis constant (Km), and catalytic efficiency (kcat/Km)—using pre-trained models and molecular fingerprints. It demonstrates improved accuracy and generalization compared with previous methods and can guide enzyme discovery and engineering.
Reference: https://www.nature.com/articles/s41467-025-58038-4
Homepage: https://github.com/zchwang/CataPro

Clustal Omega

Clustal Omega is an accurate and very fast program for creating multiple alignments of protein and nucleotide sequences. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/21988835
Homepage: https://www.ebi.ac.uk/jdispatcher/msa/clustalo

Cytoscape

Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/14597658
Homepage: https://cytoscape.org/

Farthest-first traversal algorithm

The farthest-first traversal of a compact metric space is a sequence of points in the space, where the first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points. EnzymeMiner uses this algorithm in the diversification step of the Selection wizard to generate a target number of clusters.
Homepage: https://en.wikipedia.org/wiki/Farthest-first_traversal

InterProScan

InterProScan is a tool that combines different protein signature recognition methods from the InterPro consortium member databases into one resource.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/15980438
Homepage: https://www.ebi.ac.uk/interpro/

MGnify protein database

Non-redundant protein database from metagenomic and metatranscriptomic assemblies.
Reference: https://pubmed.ncbi.nlm.nih.gov/36477304/
Homepage: https://www.ebi.ac.uk/metagenomics

MMseqs2

MMseqs2 improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/29035372
Homepage: https://mmseqs.com

NCBI NR

The NCBI NR database contains non-redundant protein sequences from GenPept, Swiss-Prot, PIR, PDF, PDB, and NCBI RefSeq.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/20003500
Homepage: https://ftp.ncbi.nlm.nih.gov/blast/db/

NCBI BioProject

The NCBI BioProject database was established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/22139929
Homepage: https://www.ncbi.nlm.nih.gov/bioproject

NCBI Taxonomy

The NCBI Taxonomy database is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank, ENA (EMBL) and DDBJ databases. It includes organism names and taxonomic lineages for each of the sequences represented in the INSDC's nucleotide and protein sequence databases.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/22139910
Homepage: https://www.ncbi.nlm.nih.gov/taxonomy

OpenBabel

OpenBabel is an open-source chemical toolbox designed to facilitate the conversion, analysis, and manipulation of molecular structures and data. It supports a wide range of chemical file formats and provides tools for cheminformatics, molecular modeling, and computational chemistry workflows.
Reference: https://pubmed.ncbi.nlm.nih.gov/21982300/
Homepage: https://openbabel.org/

OphPred

OpHpred predicts the optimal pH range of enzymes from their amino acid sequences using a language-model-based machine learning approach. The method shows high accuracy across diverse protein families and low-similarity sequences, enabling high-throughput in silico screening of enzymes for desired pH activity.
Reference: https://pubs.acs.org/doi/full/10.1021/acssynbio.4c00465
Homepage: https://github.com/i-Molecule/optimalPh

SoluProt

SoluProt is one of the latest additions to the family of solubility predictors based on machine learning. The training set is based on the TargetTrack database, which was carefully filtered to keep only targets expressed in Escherichia coli.
Reference: https://pubmed.ncbi.nlm.nih.gov/33416864/
Homepage: https://loschmidt.chemi.muni.cz/soluprot/

TMHMM

TMHMM predicts transmembrane protein topology with a hidden Markov model.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/11152613
Homepage: https://services.healthtech.dtu.dk/services/TMHMM-2.0/

TmProt

TmProt predicts protein melting temperatures from amino acid sequences using a machine-learning model trained on a curated dataset of experimentally determined Tm values. It supports the identification of stable proteins for biotechnological and biopharmaceutical applications.
Homepage: https://loschmidt.chemi.muni.cz/tmprot/

UniProt/Swiss-Prot

Swiss-Prot is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/25348405
Homepage: https://www.uniprot.org/

USEARCH

USEARCH is an ultrafast algorithm for sequence database search that seek high-scoring local and global alignments, respectively. High-throughput is achieved by using a fast heuristic designed to enable rapid identification of one or a few good hits rather than all homologous sequences.
Reference: https://www.ncbi.nlm.nih.gov/pubmed/20709691
Homepage: https://drive5.com/usearch/