Help

Outline

Calculation setup
Results
Target Selection Table
Selection Wizard
Sequence similarity network
Advanced options

Precalculated results are available at /enzymeminer/job/example

Calculation setup

Go to the EnzymeMiner homepage.

A. Search for homologous enzymes using the automatic mode
1. To load the Example case, click on the Load example button on the top right corner of the Job input box.
  The table of haloalkane dehalogenases from the Swiss-Prot database (Enzyme Commission number 3.8.1.5) is loaded, and two sample sequences are selected: D4Z2G1 (LinB) and P22643 (DhlA). Both sequences have five catalytic residues and a single Pfam domain Abhydrolase_1. They represent two different subfamilies of the haloalkane dehalogenase family. Click on the sequence accession for more details about the sequence.
2. To specify your own enzyme instead of loading the Example case, enter the Enzyme Commission number of your enzyme (3.8.1.5 in our previous example).
3. Then select the UniProtKB accession number (one or several) corresponding to your proteins of interest (D4Z2G1 and P22643 in our example).

B. Search for homologous enzymes by Custom sequences & Advanced options
In some cases, specifying the EC number in automatic mode may not yield the desired list of enzymes, or it may fail to identify the correct essential residues. In such cases, the user can specify their protein(s) of interest by their FASTA sequence.
1. For that, click on the Custom sequences tab on the Job input box.
2. If you click on the Load example button,
3. the FASTA sequences from our Example (D4Z2G1 and P22643) will appear in the Query sequences field automatically.
  If you want to specify your own sequences instead, you can paste the FASTA sequence(s) of your enzymes in the Query sequences field.
4. Alternatively, you can load your sequences from files in FASTA format by clicking on the Upload FASTA button.
  Keep in mind that the FASTA sequences require a header line that begins with the ">" symbol, followed by the protein name or identifier. The header is followed by one or more lines containing the sequence itself, with or without spaces or line breaks.
Optionally, you can add sequences to the Other known sequences field, also in FASTA format. These sequences are not used for database search, EnzymeMiner will use them only to calculate the sequence identity between those known sequences and all the hits returned by the search. This feature can help you identify how similar or different your candidate hits are relative to the “other known” ones.
Next, you need to specify the essential residues (catalytic or not only). This field is prefilled for our Example.
To specify your own Essential residue template, follow the sequence of steps:
1. Add a protein (a row).
2. Specify its name or identifier from the list of proteins entered above.
3. Add a residue (a column).
4. Specify its position in the sequence.
5. Click on Set residues to define the amino acid(s) that are allowed for that position.
6. On the new pop-up window, select one or several amino acids.
7. Click on the OK button.
8. You must give each essential residue a unique name, which you can use to describe its function.
To add a new essential residue template for a different protein sequence, repeat steps 1–8.
By default, the search for homologs is performed on the NCBI nr database. If this results in fewer sequences than you are looking for, you can:
1. Turn on the Advanced options at the bottom of the Job input panel.
2. Select the EMBL-EBI MGnify database, which contains metagenomics data and potentially more results.
However, for our Example, we will not do this. Keep in mind that searching the EMBL-EBI MGnify database will take considerably longer to complete than searching the NCBI nr database.

C. Specifying substrates
Optionally, users can define the substrates of interest for predicting catalytic activity. Currently, EnzymeMiner allows a maximum of three substrates.
In the Activity prediction panel, type the SMILES code(s) in the dedicated field, separated by lines. You can easily find the SMILES of chemical substances in the PubChem database, for instance. You can also customize the substrate name by specifying it after the SMILES.
For our Example, this field is prefilled. We used substrates 1,2-dibromomoethane (DBE, a commonly used substrate for haloalkane dehalogenases) and 1,2,3-trichloropropane (TCP, a very toxic halogenated pollutant and a difficult substrate for most known haloalkane dehalogenases), and we named them DBE and TCP, respectively.
Note: By default, all the Selection Wizard strategies will maximize the efficiency (kcat/Km) for all the substrates specified here. To change that, you can use the Advanced settings (see below).

D. Submission
Normally, users may also specify a Job title and an Email address to receive a notification about finished calculation (recommended). These options are not relevant for the example job, as the results are precomputed.
At the bottom of the page, click on the Next button to get the Job summary page.
In the Job summary page:
1. Click on the Run job button.
2. In the Example job modal window, select the Yes, show the results option to continue directly to the precomputed results. Please note that if you change any parameter of the example job input or any advanced option, the Example job modal will not pop up, and the job will be calculated as a new job.

Results

To access the job results, click on the link in the notification email or enter the job ID in the upper right corner of the EnzymeMiner web and click on the “Find job” button. The result page has four sections: (i) job information, (ii) download results, (iii), selection table, and (iv) sequence similarity network. In this section, we describe the first panels.

Job output information
In the job information box, you can find the job ID, title, start time, and status of the job.

Download results
In the download results box, you can download the result table in Microsoft Excel XLSX format or tab-separated file TSV format. A ZIP archive containing all output files from the EnzymeMiner workflow can be downloaded by clicking on the Raw results button.

Target selection Table: selection based on default properties

Once the calculation is finished, users can explore putative enzyme homolog sequences and select targets for experimental characterization using the default target selection table.

The table is organized into eleven sheets:

Selected – All the selected sequences. In this sheet, an additional column (Selection description) is included to track the reason for the selection. By default, it is prefilled with the name of the sheet from which the sequence was selected, or with the name of the selection strategy used in the Selection Wizard (see below). However, it can be edited by double-clicking on the cell.
Full Dataset – All identified sequences.
Extra domain – Sequences with extra domains. Extra domains are Pfam domains found in the sequence but not listed in the Primary domains select box.
Organism – Sequences with known source organism. The taxonomy of the source organism is retrieved from the NCBI Taxonomy database. For most sequences, the source organism is well-defined; however, some sequences are multispecies.
Temperature – Sequences from organisms having optimum temperature annotation in the NCBI BioProject database. Here, sequences from thermophilic or cryophilic organisms can be found.
Salinity – Sequences from organisms having salinity annotation in the NCBI BioProject database.
Biotic Relationship – Sequences from organisms having biotic relationship annotations in the NCBI BioProject database.
Disease – Sequences from organisms having disease annotation in the NCBI BioProject database.
Transmembrane – Sequences with transmembrane regions predicted by the TMHMM tool.
3D Structure – Sequences with available 3D structure in the Protein Data Bank.
Network – Sequences clustered into a node selected from sequence similarity network.

We recommend going through the target selection table sheet by sheet and selecting sequences from each sheet to have a diverse set of proteins for experimental characterization. For example, sequences from Archea, Bacteria, and Eukaryota, from thermophiles, cryophiles, extremely halophilic organisms, from organisms with unusual biotic relationships, and from disease-related organisms.

There are six options to filter the identified sequences displayed in the target selection table:

Minimum solubility threshold – Move the slider to increase the minimum predicted solubility. Sequences with lower solubility will be hidden. We recommend setting the solubility threshold to 0.5 or more to increase the success rate of protein production.
Query identity range – Move the sliders to set the minimum and maximum global sequence identity to a query sequence. We recommend setting at least the maximum query identity to 90% to exclude very similar sequences.
First seen dates – Set the earliest and the latest date to show sequences that were first submitted to some protein database (first seen) after and/or before some date. Click on the Histogram button to see how many sequences were added to databases in different years. The histogram allows to select the earliest and the latest date by clicking and draging.
Exclude transmembrane proteins – Click on the switch to exclude sequences with predicted transmembrane regions from all sheets except the Transmembrane sheet. We recommend removing these sequences as they might be hard to produce in the lab. They also tend to have lower predicted solubility.
Exclude extra domain proteins – Click on the switch to exclude sequences with a predicted extra domain from all sheets except the Extra domain sheet. Extra domains are domains found in the sequence but not listed in the Primary domains select box. We recommend avoiding sequences with extra domains to stay safe. On the other hand, these sequences might show unusual activity.
Exclude Swiss-Prot proteins – Click on the switch to exclude sequences found in the Swiss-Prot database. As these proteins are well studied and annotated, they are unlikely to show novel properties.

Description of the column headers in the Target Selection Table:

Header	Source*	Description
Accession	Database	Unique accession code in the source database
Annotation	Database	Annotations and other information for the current protein
First seen	Database	Date when the protein was first identified and reported
Closest query	Analysis	The query protein that is closest in identity to the current sequence
Identity closest query	Analysis	Identity percentage between the closest query and the sequence
Kingdom	Database	Kingdom of the organism where the protein was identified
Solubility	Predicted	Solubility score predicted by SoluProt
Relative aggregation propensity**	Predicted	The base-2 logarithm ratio between the aggregation propensity of sequence and the first template; the propensity is the number of APRs per residue, as predicted by AggreProt; value > 0 indicates higher aggregation-propensity than the template, whereas value < 0 indicates lower.
Source databases	Database	The database where the protein was found
Sequence length	Database	Total number of amino acids in the sequence
Optimum pH**	Predicted	Optimum pH, predicted by OphPred
Melting temperature**	Predicted	Melting temperature Tm, predicted by TmProt
Domain annotation	Database	The main domain identified in the protein
Extra domains	Database	Any extra domains, whenever present
Closest known	Analysis	The known protein that is closest in identity to the sequence, when "Other known sequences" were specified
Identity closest known	Analysis	Identity percentage between the sequence and the closest known protein, when "Other known sequences" were specified
Closest all	Analysis	The closest protein to the current sequence, among all those found in the current search
Identity closest all	Analysis	The identity percentage of the closest protein to the current sequence
Swiss-Prot	Database	Swiss-Prot/UniProtKB accession code(s), when available
Organism	Database	Organism(s) where the protein was identified
Salinity	Database	Salinity preference
Optimum temp.	Database	Optimum temperature, when available
Temp. range	Database	Temperature preference
Biotic relationship	Database	Any known interaction occurring between the source organism with others in the same ecosystem
Disease	Database	Relationship with any diseases, when known
Transmembrane	Database	Whether or not the sequence is part of a transmembrane protein
ER template	Database/analysis	Which input protein and template was used for the essential residue search
Essential residues (by name; one column per residue)	Database/analysis	The amino acid present in this sequence for the particular essential residue
Essential residues	Database/analysis	The complete list of essential residues present in the sequence as a list of amino acids
GI	Database	GenInfo Identifier, a unique integer ID assigned to each sequence record in NCBI databases
Structure	Database	The PDB ID code for the 3D structure, when available
kcat (one column per substrate)**	Predicted	kcat predicted by CataPro
kcat/Km (one column per substrate)**	Predicted	kcat/Km predicted by CataPro
Sequence	Database	The complete amino acid sequence of the current protein

* Source of the listed property: database, obtained from the database and its annotations; analysis, obtained from the current analysis; predicted, predicted by external tools.
** Newly available in EnzymeMiner 2.

Selection Wizard: selection of enzymes using smart strategies

The Selection Wizard is a new feature in EnzymeMiner 2 that enables users to smartly select proteins from the current search results. It helps balance multiple properties of interest, allowing users to prioritize candidates according to their specific goals.

To use it, click on the Selection Wizard button, located in the top-left corner of the Target Selection Table panel, and a pop-up window will open. How to enter Selection Wizard

Number of targets
The target number corresponds to the maximum number of enzymes that the user wants to select and save for each strategy. Type a number in this field (e.g., 5) and press Enter.

Use predefined strategies
Six predefined strategies are available in the Selection Wizard, each listed as a separate button.
To use one of the predefined strategies:
1. Click on the respective button (e.g., Robust enzymes)
2. Press Submit.
3. You may modify the Selection description, which by default is the name of the predefined strategy. This description will be displayed later, in the Selected tab of the Target Selection Table, for the proteins selected in this panel.
After this, the Selected tab of the Target Selection Table will contain the top 5 enzymes prioritized according to the Robust enzymes strategy.
If you subsequently use the Selection Wizard, you can choose to append the new selection to the previous table or replace the existing selection with the new one.
For this Example, select the Stable enzymes strategy, also with 5 target sequences, and select Append to the current selection. The newly selected enzymes will be listed in the Target Selection Table, alongside the previously selected ones, and the table will now have 10 entries (or fewer, if some enzymes were present in both selections).
Description of the strategies:
1. Robust enzymes - Select enzymes with the highest predicted stability and solubility, with a length similar to the query and with the lowest predicted aggregation propensity. Excludes transmembrane proteins and proteins with extra domains.
2. Soluble enzymes - Select enzymes with the highest predicted solubility, with length not much larger than that of the query, and the lowest aggregation propensity. Excludes transmembrane proteins and proteins with extra domains.
3. Stable enzymes - Select enzymes with a melting temperature of at least 40°C, with the highest predicted optimal temperature and salinity. Will likely return thermophilic proteins. Excludes transmembrane proteins and proteins with extra domains.
4. Extremophiles enzymes - Select enzymes that are annotated as extremophiles and with the highest predicted stability, salinity, and solubility, with length similar to the query and the lowest aggregation propensity. Excludes transmembrane proteins and proteins with extra domains.
5. Close homologs - Select enzymes that are closest to the query in sequence identity, are from the same kingdom as the query, and have the highest predicted stability and solubility and the lowest aggregation propensity. Excludes transmembrane proteins and proteins with extra domains.
6. Distant homologs - Select enzymes that are the most distant to the query in sequence identity (with max. 80% identity), with the highest predicted stability and solubility, and the lowest aggregation propensity. Excludes transmembrane proteins. It will likely return proteins from different organism kingdoms.
Note: If one or more substrates are specified in the Activity Prediction input panel, all strategies will automatically prioritize enzymes by maximizing the catalytic efficiency (k_cat/K_m) for all selected substrates, alongside the other properties defined in the chosen strategy. To modify this behavior, use the Advanced settings (see below).

Types of rules in the Selection Wizard algorithm:

Filtering - Define the hard filters that are applied to different properties. Only enzymes within the thresholds or requirements will appear in the selection list.
Diversification - Define the properties that are diversified in the enzymes. The selected enzymes will be diversified based on the properties that you select here (spanning the entire range of values). For instance, if “Optimal temperature” is selected for diversification, the enzymes listed in the results will span the entire range of optimal temperatures. When different properties are selected for diversification, all of them are considered simultaneously for this diversification procedure. This means that if “Optimal temperature” and “Sequence length” are selected for diversification, the results will contain a combination of small, medium, and large proteins that can operate at low, medium, or high temperatures.
Prioritization - Define the properties that will be prioritized in the selected enzymes. The selected enzymes will have their properties maximized or minimized as defined by the user.

Advanced settings to personalize selection strategies
The Advanced settings of the Selection Wizard allow users to define their own selection rules or modify predefined selection strategies.
Click the Selection Wizard button, and then select Show advanced settings.

The Selection Wizard window expands, and the new three sections of the window display all the
1. Filtering rules
2. Diversification rules
3. Prioritization rules
4. If any enzymes are listed in the Selected sheet, you will be able to choose to append the new selection or replace the current one.
5. You can also modify and personalize the Selection description, especially if any rules were modified.
As mentioned above, when several substrates are specified by the user, by default, all strategies prioritize the enzymes by maximizing the catalytic efficiency (k_cat/K_m) for all substrates.
If the user is interested, for instance, only in the activity of one substrate (e.g., TCP in our Example) and not in the other (e.g., DBE in our Example), this can be achieved by changing the Prioritization rules, turning off the prioritization of k_cat/K_m for DBE, while keeping maximizing k_cat/K_mfor TCP.
On the other hand, if the user is interested in maximizing the activity for one substrate (e.g., TCP in ourExample) and minimizing the other (e.g., if the goal is to search for selective enzymes towards TCP), this can be achieved by switching the prioritization to minimize k_cat/K_m for DBE, while keeping maximizing k_cat/K_m for TCP.

Sequence similarity network

The sequence similarity network (SSN) visualizes the sequence space of all identified sequences. Clusters of similar sequences can be easily identified, as well as sequence outliers.

As there might be thousands of sequences, the sequences are clustered by the identity threshold, and only the SSN of the representative sequences is shown for performance reasons. Sequences having greater sequence identity are consolidated into a single metanode. Edges indicate sequence identity between representative sequences of the connected metanodes. To see which sequences are represented by a metanode, hover the cursor over a metanode.

The SSN can be downloaded as a Cytoscape session file for further analysis and visualization. You can select from networks clustered at different identities. The number of nodes and edges is indicated for each identity threshold.

The SSN is interactively linked to the Selection table. All sequences selected in the Selection table are automatically highlighted in the SSN. This helps you to track how your selection covers the whole sequence space. Click on a node to fill the Network tab with sequences that are clustered into the node.

Advanced options

Homology search

Databases - Which database to perform the search on. The NCBI nr database is a large, non-redundant, comprehensive collection of protein sequences curated by the National Center for Biotechnology Information, merging identical or highly similar sequences from major sources (like GenBank, PDB, SwissProt, PIR, and PRF). EMBL-EBI MGnify is a large, comprehensive database with microbiome sequence data, handling everything from raw sequence submissions to public datasets, and optimized for metagenomic, metatranscriptomic, amplicon, and assembly analyses.
E-value - The number of hits one can "expect" to see by chance when searching a database of a particular size. See BLAST FAQ for more details.
Inclusion E-value threshold - The statistical significance threshold to include a sequence in the model used by PSI-BLAST to create the PSSM on the next iteration.
Number of iterations - Number of PSI-BLAST search iterations.
Maximum number of hits - Limit for the number of PSI-BLAST hits. All hits are sorted by E-value, and only the best are used for the EnzymeMiner analysis.
Artificial tags - All sequences having these tags in their description are excluded.

Filtration

Minimum identity - Minimum sequence identity threshold. All hits must have greater sequence identity than the threshold to at least one of the query sequences. Identity is computed using global alignment (Needleman-Wunsch).
Clustal iterations - Number of multiple sequence alignment construction iterations using Clustal Omega.

Visualization

Score threshold - Minimum bitscore to include an edge in the network graph. The bitscore is calculated by mmseqs2 (BLAST-like local alignment) for a pair of representative sequences.