Benchstab client

Client

class benchstab.client.BenchStab(input_file: str | DataFrame, outfolder: str | None = None, predictor_config: Dict[str, Any] | None = None, include: List[str] | None = None, exclude: List[str] | None = None, allow_struct_predictors: bool = True, allow_sequence_predictors: bool = True, verbosity: int = 0, permissive: bool = False, *args, **kwargs)[source]

Bases: object

The BenchStab class is responsible for managing the predictors. It selects the predictors based on the input data and runs them. The results are returned as a pandas DataFrame. If an outfolder is specified, the results will be saved there as a CSV file.

sequence_predictors = [IMutant2, IMutant3, INPS, iStable, DDGun, Mupro, SAAFEC, PONSol2, Prostata]
pdbid_predictors = [IMutant2, IMutant3, CUPSAT, AutoMute, INPS, DUET, iStable, DDGun, SDM, Maestro, PremPS, SRide, PoPMuSiC, DDMut, Dynamut2]
pdbfile_predictors = [IMutant3, DUET, DDGun, mCSM, Mupro, SDM, PremPS, SRide, DDMut, Dynamut2]
filter_predictors()[source]

Filter the predictors based on the include and exclude lists. If the include list is specified, only the predictors in the include list will be selected. If the exclude list is specified, the predictors in the exclude list will be removed from the list of predictors. If both lists are specified, the exclude list will be ignored.

__map(row)

Map the predictors to the input data. If the input data is a Fasta object, only sequence predictors will be selected. If the input data is a Pdb object, only structure predictors will be selected. If the input data is a PdbFile object, both sequence and structure predictors will be selected.

Parameters:

row (pandas.Series) – A row of the input data.

Returns:

A list of selected predictors.

Return type:

list

save_results(results: DataFrame)[source]

Save the results to a CSV file in the outfolder.

Parameters:

results (pandas.DataFrame) – The results to be saved.

gather_results()[source]

Gather the results from the predictors.

async __periodically_gather_results()

Gather the results from the predictors. The results are concatenated and saved to a CSV file in the outfolder.

async __run()

Run the predictors asynchronously and gather the results. The results are returned as a pandas DataFrame.

Returns: :return: The results as a pandas DataFrame. :rtype: pandas.DataFrame

Preprocessor

class benchstab.preprocessor.PreprocessorRow(identifier: benchstab.utils.structure.PDB | benchstab.utils.structure.Fasta = None, mutation: str = None, chain: str = None, fasta: benchstab.utils.structure.Fasta = None, fasta_mutation: str = None, ph: float = 7.0, temperature: float = 25.0)[source]

Bases: object

identifier: PDB | Fasta = None
mutation: str = None
chain: str = None
fasta: Fasta = None
fasta_mutation: str = None
ph: float = 7.0
temperature: float = 25.0
to_dict()[source]

Converts the PreprocessorRow object to a dictionary.

Returns:

Dictionary containing the PreprocessorRow object

Return type:

Dict[str, Any]

is_valid() bool[source]

Check if the PreprocessorRow object is valid.

Returns:

True if the PreprocessorRow object is valid, False otherwise

Return type:

bool

class benchstab.preprocessor.Preprocessor(input: str | list | TextIO, outfolder: str | None = None, permissive: bool = True, verbosity: int = 0, skip_header: bool = False, *args, **kwargs)[source]

Bases: object

Preprocessor class is used to parse the input file and create a dataset that can be used by the predictor.

The input file can be in the following formats:
  • PDB identifier, mutation and chain

  • Fasta identifier, mutation, pH and temperature

  • PDB identifier, mutation, pH and temperature

  • Fasta identifier, mutation and chain

The class can also generate a summary of the dataset.

The summary includes the following information:
  • Number of mutations

  • Number of proteins

  • Average number of mutations per protein

  • Number of mutations with positive charge

  • Number of mutations with negative charge

  • Number of mutations with no charge

  • Number of mutations with acidic chemical properties

  • Number of mutations with basic chemical properties

  • Number of mutations with aromatic chemical properties

  • Number of mutations with aliphatic chemical properties

  • Number of mutations with hydroxyl chemical properties

  • Number of mutations with sulfur chemical properties

  • Number of mutations with amide chemical properties

  • Number of mutations with non-polar chemical properties

  • Number of mutations with polar chemical properties

Parameters:
  • input (Union[str, list, TextIO]) – Input file containing the protein identifier, mutation and chain

  • outfolder (str) – Folder where the preprocessed input will be saved

  • permissive (bool) – If True, the preprocessing script will continue if it encounters an error

  • verbosity (int) – Verbosity level

  • skip_header (bool) – If True, the header in the input file will be skipped

logger = <Logger benchstab.preprocessor (INFO)>
classmethod print_summary(summary, logger: Logger | None = None)[source]

Print the summary generated by create_summary to stdout using the provided logger or the default logger.

Parameters:
  • summary (Dict[str, str]) – Summary to be printed

  • logger (logging.Logger) – Logger to be used for printing the summary

Returns:

None

Return type:

None

classmethod create_summary(data: PredictorDataset, verbose: bool = True, outfolder: str | None = None, logger: Logger | None = None) Dict[str, str][source]

Create a summary of the dataset.

The summary includes the following information:
  • Number of mutations

  • Number of proteins

  • Average number of mutations per protein

  • Number of mutations with positive charge

  • Number of mutations with negative charge

  • Number of mutations with no charge

  • Number of mutations with acidic chemical properties

  • Number of mutations with basic chemical properties

  • Number of mutations with aromatic chemical properties

  • Number of mutations with aliphatic chemical properties

  • Number of mutations with hydroxyl chemical properties

  • Number of mutations with sulfur chemical properties

  • Number of mutations with amide chemical properties

  • Number of mutations with non-polar chemical properties

  • Number of mutations with polar chemical properties

Parameters:
  • data (PredictorDataset) – Dataset to be summarized

  • verbosity (bool) – If True, the summary will be printed to stdout

  • outfolder (str) – If provided, the summary will be saved to a file in the provided folder

  • logger (logging.Logger) – Logger to be used for printing the summary

Returns:

Dictionary with the summary

Return type:

Dict[str, str]

parse_fasta(data: str) PreprocessorRow[source]

Parse the line containing the fasta identifier and the mutation.

If the fasta identifier is valid, return:
  • Fasta object.

  • mutation in the format WT_RESIUDE + POSITION + MUT_RESIDUE.

  • pH (default: 7.0 if not supplied).

  • temperature (default: 25.0 if not supplied).

Parameters:

data (str) – Line containing the fasta identifier and the mutation

Returns:

Dictionary containing the fasta object, mutation, pH and temperature

Return type:

PreprocessorRow

extract_fasta_from_pdb(identifier: PDB, chain: str, source: str) Fasta[source]

Extract fasta record from PDB file.

Parameters:
  • identifier (PDB) – PDB identifier

  • chain (str) – Chain identifier

  • source (str) – Source of the PDB file.Accepted values are: * file * rcsb * uniprot

Returns:

Fasta record

Return type:

Fasta

parse_fasta_mutation(mutation: str, fasta: Fasta, permissive: bool = True) str[source]

Wraps the __parse_fasta_mutation function in a try/except block. If the mutation string is not valid, the function will raise a PreprocessorError with the permissive flag set to True. This flag indicates that the error is not critical and that the preprocessing script can continue.

Parameters:
  • mutation (str) – Mutation string to be parsed

  • fasta (Fasta) – Fasta record

Returns:

Mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE

Return type:

str

parse_struct(data: List[str]) PreprocessorRow | None[source]

Parse the line containing the protein identifier, the mutation and the chain.

If the protein identifier is valid, return:
  • PDB object

  • mutation in the format WT_RESIUDE + POSITION + MUT_RESIDUE

  • chain

  • fasta object

  • pH (default: 7.0 if not supplied)

  • temperature (default: 25.0 if not supplied)

Parameters:

data (List[str]) – Line containing the protein identifier, the mutation and the chain

Returns:

Dictionary containing the PDB object, mutation, chain, fasta object, pH and temperature

Return type:

PreprocessorRow

parse_mutation(mutation: str) str[source]

Parse the mutation string and check if it is valid.

Parameters:

mutation (str) – Mutation string to be parsed

Returns:

Mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE

Return type:

str

__parse_fasta_mutation(mutation: str, fasta: Fasta, permissive: bool = False) str

Parse the mutation string and check if it is valid. If the mutation is valid, return the mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE.

As this function also handles the parsing of the mutation string for the fasta record extracted from PDBs, it is possible that the mutation string is not valid. In this case, the function will raise a PreprocessorError with the permissive flag set to True. This flag indicates that the error is not critical and that the preprocessing script can continue.

Parameters:
  • mutation (str) – Mutation string to be parsed

  • fasta (Fasta) – Fasta record

  • permissive (bool) – If True, the function will raise a PreprocessorError with the permissive flag set to True

Returns:

Mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE

Return type:

str

parse_line(line: str, sep: str | None = None) PreprocessorRow | None[source]

Parse the line containing the protein identifier and mutation (chain).

If the protein identifier is valid, return:
  • PDB object

  • mutation in the format WT_RESIUDE + POSITION + MUT_RESIDUE

  • chain

  • fasta object

  • pH (default: 7.0 if not supplied)

  • temperature (default: 25.0 if not supplied)

Parameters:
  • line (str) – Line containing the protein identifier, the mutation and the chain

  • sep (str) – Column separator

Returns:

Dictionary containing the PDB/Fasta object, mutation, chain, fasta object, pH and temperature

Return type:

Union[PreprocessorRow, None]

__exception_wrapper(func: callable, *args, **kwargs)

Wrap the function call in a try/except block. If the function raises a PreprocessorError or FileNotFoundError, the function will return None and the error will be logged. If the function raises any other exception, the exception will be raised.

Parameters:

func (callable) – Function to be wrapped

Returns:

Function result or None

Return type:

Union[None, Any]

parse() PredictorDataset[source]

Initiates the mutation file parsing process.

Base Predictor

class benchstab.predictors.base.PredictorFlags(webkit: bool = False, group_mutations: bool = False, group_mutations_by: list[str] = <factory>, mutation_delimiter: str = ', ')[source]

Bases: object

Class for storing predictor flags. The flags are used to control the behaviour of the predictor.

webkit: bool = False
group_mutations: bool = False
group_mutations_by: list[str]
mutation_delimiter: str = ','
class benchstab.predictors.base.BaseCredentials(username: str = '', password: str = '', email: str = '', url: str = '')[source]

Bases: object

Base class for predictor credentials. The credentials are used to authenticate the user. The credentials are stored in a dictionary and sent as a POST request to the url specified in the credentials. The credentials class variable should be overwritten by the child class.

username: str = ''
password: str = ''
email: str = ''
url: str = ''
get_payload(**kwargs)[source]

Create a dictionary of parameters to be sent as a POST request. This function should be implemented by the child class.

Parameters:

kwargs (dict) – extra parameters

Returns:

dictionary of parameters

Return type:

dict

class benchstab.predictors.base.PredictorHeader(name: str = '', input_type: str = '', classname: str = '', mutation_column: str = 'mutation')[source]

Bases: object

Class for storing predictor headers. The predictor headers are used to identify the predictor.

name: str = ''
input_type: str = ''
classname: str = ''
mutation_column: str = 'mutation'
class benchstab.predictors.base.BasePredictor(data: PredictorDataset, flags: PredictorFlags | None = None, outfolder: str | None = None, username: str = '', email: str = 'generic@email.com', password: str = '', wait_interval: int = 60, batch_size: int = -1, verbosity: int = 0, *args, **kwargs)[source]

Bases: object

Base class for predictors. The class is responsible for the following:
  1. Sending the query to the predictor.

  2. Retrieving the results from the predictor.

  3. Aggregating the results.

  4. Returning the results as a PredictorDataset.

url = ''
aggr_columns = {'chain', 'fasta_mutation', 'mutation'}
credentials

alias of BaseCredentials

async classmethod is_available_async(url: str) str[source]

Check if the predictor is available. This is done by sending a GET request to the specified url. If the request is successful, the predictor is available.

Parameters:

url (str) – url of the predictor

Returns:

status of the predictor. ‘Available’ if the predictor is available, ‘Offline’ otherwise

Return type:

str

classmethod is_available(url: str) str[source]

Check if the predictor is available. This is done by sending a GET request to the specified url. If the request is successful, the predictor is available.

Parameters:

url (str) – url of the predictor

Returns:

status of the predictor. ‘Available’ if the predictor is available, ‘Offline’ otherwise

Return type:

str

classmethod header()[source]

Return the header of the predictor. The header is used to identify the predictor.

Returns:

predictor header

Return type:

PredictorHeader

async classmethod async_default_callback(index: int, response: ClientResponse, session: ClientSession)[source]

Default callback function for the GET request. It checks if the request was successful and updates the status of the row accordingly.

Parameters:
  • index (int) – index of the row

  • response (aiohttp.ClientResponse) – response of the GET request

  • session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the request was successful, False otherwise

Return type:

bool

process_result(index: int, mutation: str, chain: str, result: Any) None[source]
format_mutation(data: str | Dict | DatasetRow) str[source]

Format the mutation to the format required by the predictor. This function should be implemented by the child class.

Parameters:

data (Union[str, Dict, DatasetRow]) – mutation

Returns:

formatted mutation

Return type:

str

prepare_mutation(row: DatasetRow) str[source]

Prepare the mutation to be sent to the predictor.

This function performs the following steps:
  1. Convert the mutation to the format required by the predictor.

  2. Group the mutations if needed.

Parameters:

row (DatasetRow) – row of the dataset

Returns:

mutation

Return type:

str

async send_query(session: ClientSession, index: int, *args, **kwargs) bool[source]

Send the query to the predictor. This function should be implemented by the child class.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • index (int) – index of the row

Returns:

True if the query was sent successfully, False otherwise

Return type:

bool

async retrieve_result(session: ClientSession, index: int) bool[source]

Retrieve the results of the prediction. This function should be implemented by the child class.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • index (int) – index of the row

Returns:

True if the prediction was successful, False otherwise

Return type:

bool

__prepare_payload(row: DatasetRow) Dict

Wrapper around the prepare_payload function. It catches any exceptions and updates the status of the row accordingly.

Parameters:

row (DatasetRow) – row of the dataset

Returns:

payload

Return type:

dict

prepare_payload(row: DatasetRow) Dict[source]

Prepare the payload to be sent to the predictor. This function should be implemented by the child class.

Parameters:

row (DatasetRow) – row of the dataset

Returns:

payload

Return type:

dict

get_results() PredictorDataset[source]

Get the results of the prediction.

Returns:

prediction results

Return type:

PredictorDataset

_aggregate(data) List[Dict[Any, Any]][source]

Helper function aggregating the data into a list of dictionaries.

Parameters:

data (PredictorDataset) – data to be aggregated

Returns:

aggregated data

Return type:

list[DatasetRow]

setup() None[source]

Set up the dataset. This includes grouping mutations, creating the payload, etc.

async __exception_wrapper(func: Callable, index: int | None = None, *args, **kwargs) bool

Wrapper around the async functions. It catches any exceptions and updates the status of the row accordingly. If the exception is HTMLParserError with permissive=True, it returns False, otherwise it returns True.

Parameters:
  • func (Callable) – function to be executed

  • index (int) – index of the row

Returns:

True if the function was executed successfully, False otherwise

Return type:

bool

async login(session: ClientSession, index: int, login_extra: Dict[str, Any] | None = None) bool[source]

Login to the predictor. This function should be implemented by the child class.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • index (int) – index of the row

  • login_extra (dict) – extra parameters for the login function

Returns:

True if the login was successful, False otherwise

Return type:

bool

async _queue_prediction(queue)[source]

Create a queue of tasks to be executed in parallel. The queue is created from the indices of the dataset. The queue is filled until it reaches the batch_size. If the queue is full, the function waits for the queue to be emptied. If the dataset is exhausted, the function returns.

Parameters:

queue (asyncio.Queue) – queue of tasks

async compute()[source]

The main function of the predictor.

It is responsible for the following:
  1. Check if the predictor is available (if not, return immediately).

  2. Set up the dataset (group mutations, etc.)

  3. Create a queue of tasks to be executed in parallel.

  4. Create a queue of workers to execute the tasks.

  5. Wait for all tasks to be completed (join the queue).

  6. Return the results as a PredictorDataset.

Returns:

prediction results

Return type:

PredictorDataset

async _run_prediction(queue)[source]

Run the prediction. This function is executed in parallel by the workers.

It takes an index from the queue and executes the following steps:
  1. Login to the predictor.

  2. Send the query to the predictor.

  3. Retrieve the results from the predictor.

Parameters:

queue (asyncio.Queue) – queue of tasks

make_form(payload)[source]

Create a multipart form from a dictionary of parameters. Since the current version (==3.8.5) of aiohttp does not support assigning a custom boundary to the FormData object directly, we need to create a custom MultipartWriter and assign it to the FormData object.

Parameters:

payload (dict) – dictionary of parameters

Returns:

multipart form

Return type:

aiohttp.FormData

async __get(session: ClientSession, callback: Callable, index: int, *args, **kwargs) bool

Wrapper around the aiohttp GET request. It catches any exceptions and updates the status of the row accordingly.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • callback (Callable) – callback function

  • index (int) – index of the row

Returns:

result of the callback function

Return type:

bool

async get(session: ClientSession, dataset: DatasetRow | Dict, callback: Callable, index: int | None = None, *args, **kwargs) bool[source]

Send a GET request to the predictor. The request is sent to the url specified in the dataset. The response is handled by the callback function. The callback function should return True if the request was successful, False otherwise. If the callback function is not specified, the default callback function is used.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • dataset (Union[DatasetRow, Dict]) – dataset

  • callback (Callable) – callback function

  • index (int) – index of the row

Returns:

result of the callback function

Return type:

bool

async __post(session: ClientSession, callback: Callable, index: int, *args, **kwargs) bool

Wrapper around the aiohttp POST request. It catches any exceptions and updates the status of the row accordingly.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • callback (Callable) – callback function

  • index (int) – index of the row

Returns:

result of the callback function

Return type:

bool

async post(session: ClientSession, dataset: DatasetRow | Dict, callback: Callable, index: int | None = None, *args, **kwargs) bool[source]

Send a POST request to the predictor. The request is sent to the url specified in the dataset. The response is handled by the callback function. The callback function should return True if the request was successful, False otherwise. If the callback function is not specified, the default callback function is used.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • dataset (Union[DatasetRow, Dict]) – dataset

  • callback (Callable) – callback function

  • index (int) – index of the row

Returns:

result of the callback function

class benchstab.predictors.base.BasePostPredictor(data: PredictorDataset, flags: PredictorFlags | None = None, outfolder: str | None = None, username: str = '', email: str = 'generic@email.com', password: str = '', wait_interval: int = 60, batch_size: int = -1, verbosity: int = 0, *args, **kwargs)[source]

Bases: BasePredictor

Base class for predictors that require a POST request. The POST request is sent to the url specified in the dataset. The response is handled by the default_post_handler.

async send_query(session: ClientSession, index: int, *args, **kwargs) bool[source]

Send the query to the predictor. If the predictor is a form-data predictor, the query is sent as a multipart form. Otherwise, it is sent as a JSON object.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • index (int) – index of the row

Returns:

True if the query was sent successfully, False otherwise

Return type:

bool

async default_post_handler(index: int, response: ClientResponse, session: ClientSession)[source]

Default callback function for the POST request. It checks if the request was successful and updates the status of the row accordingly.

Parameters:
  • index (int) – index of the row

  • response (aiohttp.ClientResponse) – response of the POST request

  • session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the request was successful, False otherwise

Return type:

bool

class benchstab.predictors.base.BaseAuthentication(data: PredictorDataset, flags: PredictorFlags | None = None, outfolder: str | None = None, username: str = '', email: str = 'generic@email.com', password: str = '', wait_interval: int = 60, batch_size: int = -1, verbosity: int = 0, *args, **kwargs)[source]

Bases: BasePostPredictor

Base class for predictors that require authentication. The authentication is done by sending a POST request to the url specified in the credentials. The response is handled by the login_handler.

async login(session: ClientSession, index: int, login_extra: Dict[str, Any] | None = None) bool[source]

Login to the predictor. The login is done by sending a POST request to the url specified in the credentials. The response is handled by the login_handler function. The login_handler function should return True if the login was successful, False otherwise. The function uses the credentials specified in the credentials class variable.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • index (int) – index of the row

  • login_extra (dict) – extra parameters for the login function

Returns:

True if the login was successful, False otherwise

Return type:

bool

async login_handler(index: int, response: ClientResponse, session: ClientSession) bool[source]

Default callback function for the login request. It checks if the login was successful and updates the status of the row accordingly.

Parameters:
  • index (int) – index of the row

  • response (aiohttp.ClientResponse) – response of the login request

  • session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the login was successful, False otherwise

Return type:

bool

class benchstab.predictors.base.BaseGetPredictor(max_retries: int = 100, *args, **kwargs)[source]

Bases: BasePostPredictor

Base class for predictors that require a GET request. The GET request is sent to the url specified in the dataset. The response is handled by the default_get_handler.

async retrieve_result(session: ClientSession, index: int) bool[source]

Retrieve the results of the prediction. The results are retrieved by sending a GET request to the url specified in the dataset. IF the datapoint is already processed, the function returns True, otherwise it returns the result of the default_get_handler function.

Parameters:
  • session (aiohttp.ClientSession) – aiohttp session

  • index (int) – index of the row

Returns:

True if the request was successful, False otherwise

Return type:

bool

async default_get_handler(index: int, response: ClientResponse, session: ClientSession) bool[source]

Default callback function for the GET request. It checks if the request was successful and updates the status of the row accordingly.

Parameters:
  • index (int) – index of the row

  • response (aiohttp.ClientResponse) – response of the GET request

  • session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the request was successful, False otherwise

Return type:

bool