Acknowledgements
Integrated methods, tools and databases
AggreProt
- AggreProt is a web server for predicting aggregation-prone regions in protein sequences using deep neural networks trained on experimentally validated hexapeptides. It provides per-residue aggregation scores and related sequence features and shows competitive performance relative to existing methods.
- Reference: https://pubmed.ncbi.nlm.nih.gov/38801076/
- Homepage: https://loschmidt.chemi.muni.cz/aggreprot/
BLAST+
- BLAST suite represents the most widely used tools for searching protein and DNA databases for sequence similarities. BLAST finds regions of local similarity between sequences, compares sequences to sequence databases and calculates the statistical significance of matches.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/20003500
- Homepage: https://blast.ncbi.nlm.nih.gov/Blast.cgi
CataPro
- CataPro is a deep learning model that predicts enzyme kinetic parameters—turnover number (kcat), Michaelis constant (Km), and catalytic efficiency (kcat/Km)—using pre-trained models and molecular fingerprints. It demonstrates improved accuracy and generalization compared with previous methods and can guide enzyme discovery and engineering.
- Reference: https://www.nature.com/articles/s41467-025-58038-4
- Homepage: https://github.com/zchwang/CataPro
Clustal Omega
- Clustal Omega is an accurate and very fast program for creating multiple alignments of protein and nucleotide sequences. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/21988835
- Homepage: https://www.ebi.ac.uk/jdispatcher/msa/clustalo
Cytoscape
- Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/14597658
- Homepage: https://cytoscape.org/
Farthest-first traversal algorithm
- The farthest-first traversal of a compact metric space is a sequence of points in the space, where the first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points. EnzymeMiner uses this algorithm in the diversification step of the Selection wizard to generate a target number of clusters.
- Homepage: https://en.wikipedia.org/wiki/Farthest-first_traversal
InterProScan
- InterProScan is a tool that combines different protein signature recognition methods from the InterPro consortium member databases into one resource.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/15980438
- Homepage: https://www.ebi.ac.uk/interpro/
MGnify protein database
- Non-redundant protein database from metagenomic and metatranscriptomic assemblies.
- Reference: https://pubmed.ncbi.nlm.nih.gov/36477304/
- Homepage: https://www.ebi.ac.uk/metagenomics
MMseqs2
- MMseqs2 improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/29035372
- Homepage: https://mmseqs.com
NCBI NR
- The NCBI NR database contains non-redundant protein sequences from GenPept, Swiss-Prot, PIR, PDF, PDB, and NCBI RefSeq.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/20003500
- Homepage: https://ftp.ncbi.nlm.nih.gov/blast/db/
NCBI BioProject
- The NCBI BioProject database was established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/22139929
- Homepage: https://www.ncbi.nlm.nih.gov/bioproject
NCBI Taxonomy
- The NCBI Taxonomy database is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank, ENA (EMBL) and DDBJ databases. It includes organism names and taxonomic lineages for each of the sequences represented in the INSDC's nucleotide and protein sequence databases.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/22139910
- Homepage: https://www.ncbi.nlm.nih.gov/taxonomy
OpenBabel
- OpenBabel is an open-source chemical toolbox designed to facilitate the conversion, analysis, and manipulation of molecular structures and data. It supports a wide range of chemical file formats and provides tools for cheminformatics, molecular modeling, and computational chemistry workflows.
- Reference: https://pubmed.ncbi.nlm.nih.gov/21982300/
- Homepage: https://openbabel.org/
OphPred
- OpHpred predicts the optimal pH range of enzymes from their amino acid sequences using a language-model-based machine learning approach. The method shows high accuracy across diverse protein families and low-similarity sequences, enabling high-throughput in silico screening of enzymes for desired pH activity.
- Reference: https://pubs.acs.org/doi/full/10.1021/acssynbio.4c00465
- Homepage: https://github.com/i-Molecule/optimalPh
SoluProt
- SoluProt is one of the latest additions to the family of solubility predictors based on machine learning. The training set is based on the TargetTrack database, which was carefully filtered to keep only targets expressed in Escherichia coli.
- Reference: https://pubmed.ncbi.nlm.nih.gov/33416864/
- Homepage: https://loschmidt.chemi.muni.cz/soluprot/
TMHMM
- TMHMM predicts transmembrane protein topology with a hidden Markov model.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/11152613
- Homepage: https://services.healthtech.dtu.dk/services/TMHMM-2.0/
TmProt
- TmProt predicts protein melting temperatures from amino acid sequences using a machine-learning model trained on a curated dataset of experimentally determined Tm values. It supports the identification of stable proteins for biotechnological and biopharmaceutical applications.
- Homepage: https://loschmidt.chemi.muni.cz/tmprot/
UniProt/Swiss-Prot
- Swiss-Prot is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/25348405
- Homepage: https://www.uniprot.org/
USEARCH
- USEARCH is an ultrafast algorithm for sequence database search that seek high-scoring local and global alignments, respectively. High-throughput is achieved by using a fast heuristic designed to enable rapid identification of one or a few good hits rather than all homologous sequences.
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/20709691
- Homepage: https://drive5.com/usearch/