v2.0
Automated mining of soluble enzymes with diverse structures, catalytic properties and stabilities

Acknowledgements

Integrated methods, tools and databases

AggreProt

BLAST+

CataPro

  • CataPro is a deep learning model that predicts enzyme kinetic parameters—turnover number (kcat), Michaelis constant (Km), and catalytic efficiency (kcat/Km)—using pre-trained models and molecular fingerprints. It demonstrates improved accuracy and generalization compared with previous methods and can guide enzyme discovery and engineering.
  • Reference: https://www.nature.com/articles/s41467-025-58038-4
  • Homepage: https://github.com/zchwang/CataPro

Clustal Omega

  • Clustal Omega is an accurate and very fast program for creating multiple alignments of protein and nucleotide sequences. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models.
  • Reference: https://www.ncbi.nlm.nih.gov/pubmed/21988835
  • Homepage: https://www.ebi.ac.uk/jdispatcher/msa/clustalo

Cytoscape

Farthest-first traversal algorithm

  • The farthest-first traversal of a compact metric space is a sequence of points in the space, where the first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points. EnzymeMiner uses this algorithm in the diversification step of the Selection wizard to generate a target number of clusters.
  • Homepage: https://en.wikipedia.org/wiki/Farthest-first_traversal

InterProScan

MGnify protein database

MMseqs2

NCBI NR

NCBI BioProject

  • The NCBI BioProject database was established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability.
  • Reference: https://www.ncbi.nlm.nih.gov/pubmed/22139929
  • Homepage: https://www.ncbi.nlm.nih.gov/bioproject

NCBI Taxonomy

  • The NCBI Taxonomy database is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank, ENA (EMBL) and DDBJ databases. It includes organism names and taxonomic lineages for each of the sequences represented in the INSDC's nucleotide and protein sequence databases.
  • Reference: https://www.ncbi.nlm.nih.gov/pubmed/22139910
  • Homepage: https://www.ncbi.nlm.nih.gov/taxonomy

OpenBabel

  • OpenBabel is an open-source chemical toolbox designed to facilitate the conversion, analysis, and manipulation of molecular structures and data. It supports a wide range of chemical file formats and provides tools for cheminformatics, molecular modeling, and computational chemistry workflows.
  • Reference: https://pubmed.ncbi.nlm.nih.gov/21982300/
  • Homepage: https://openbabel.org/

OphPred

SoluProt

TMHMM

TmProt

  • TmProt predicts protein melting temperatures from amino acid sequences using a machine-learning model trained on a curated dataset of experimentally determined Tm values. It supports the identification of stable proteins for biotechnological and biopharmaceutical applications.
  • Homepage: https://loschmidt.chemi.muni.cz/tmprot/

UniProt/Swiss-Prot

  • Swiss-Prot is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.
  • Reference: https://www.ncbi.nlm.nih.gov/pubmed/25348405
  • Homepage: https://www.uniprot.org/

USEARCH

  • USEARCH is an ultrafast algorithm for sequence database search that seek high-scoring local and global alignments, respectively. High-throughput is achieved by using a fast heuristic designed to enable rapid identification of one or a few good hits rather than all homologous sequences.
  • Reference: https://www.ncbi.nlm.nih.gov/pubmed/20709691
  • Homepage: https://drive5.com/usearch/