Benchstab client

Client

class benchstab.client.BenchStab(input_file: str | DataFrame, outfolder: str | None = None, predictor_config: Dict[str, Any] | None = None, include: List[str] | None = None, exclude: List[str] | None = None, allow_struct_predictors: bool = True, allow_sequence_predictors: bool = True, verbosity: int = 0, permissive: bool = False, *args, **kwargs)[source]

Bases: object

The BenchStab class is responsible for managing the predictors. It selects the predictors based on the input data and runs them. The results are returned as a pandas DataFrame. If an outfolder is specified, the results will be saved there as a CSV file.

sequence_predictors = [IMutant2, IMutant3, INPS, iStable, DDGun, Mupro, SAAFEC, PONSol2, Prostata]

pdbid_predictors = [IMutant2, IMutant3, CUPSAT, AutoMute, INPS, DUET, iStable, DDGun, SDM, Maestro, PremPS, SRide, PoPMuSiC, DDMut, Dynamut2]

pdbfile_predictors = [IMutant3, DUET, DDGun, mCSM, Mupro, SDM, PremPS, SRide, DDMut, Dynamut2]

filter_predictors()[source]: Filter the predictors based on the include and exclude lists. If the include list is specified, only the predictors in the include list will be selected. If the exclude list is specified, the predictors in the exclude list will be removed from the list of predictors. If both lists are specified, the exclude list will be ignored.

__map(row)

Map the predictors to the input data. If the input data is a Fasta object, only sequence predictors will be selected. If the input data is a Pdb object, only structure predictors will be selected. If the input data is a PdbFile object, both sequence and structure predictors will be selected.

Parameters:: row (pandas.Series) – A row of the input data.
Returns:: A list of selected predictors.
Return type:: list

save_results(results: DataFrame)[source]

Save the results to a CSV file in the outfolder.

Parameters:: results (pandas.DataFrame) – The results to be saved.

gather_results()[source]: Gather the results from the predictors.

async __periodically_gather_results(): Gather the results from the predictors. The results are concatenated and saved to a CSV file in the outfolder.

async __run()

Run the predictors asynchronously and gather the results. The results are returned as a pandas DataFrame.

Returns: :return: The results as a pandas DataFrame. :rtype: pandas.DataFrame

Preprocessor

class benchstab.preprocessor.PreprocessorRow(identifier: benchstab.utils.structure.PDB | benchstab.utils.structure.Fasta = None, mutation: str = None, chain: str = None, fasta: benchstab.utils.structure.Fasta = None, fasta_mutation: str = None, ph: float = 7.0, temperature: float = 25.0)[source]

Bases: object

identifier: PDB | Fasta = None

mutation: str = None

chain: str = None

fasta: Fasta = None

fasta_mutation: str = None

ph: float = 7.0

temperature: float = 25.0

to_dict()[source]

Converts the PreprocessorRow object to a dictionary.

Returns:: Dictionary containing the PreprocessorRow object
Return type:: Dict[str, Any]

is_valid() → bool[source]

Check if the PreprocessorRow object is valid.

Returns:: True if the PreprocessorRow object is valid, False otherwise
Return type:: bool

class benchstab.preprocessor.Preprocessor(input: str | list | TextIO, outfolder: str | None = None, permissive: bool = True, verbosity: int = 0, skip_header: bool = False, *args, **kwargs)[source]

Bases: object

Preprocessor class is used to parse the input file and create a dataset that can be used by the predictor.

The input file can be in the following formats:

PDB identifier, mutation and chain
Fasta identifier, mutation, pH and temperature
PDB identifier, mutation, pH and temperature
Fasta identifier, mutation and chain

The class can also generate a summary of the dataset.

The summary includes the following information:

Number of mutations
Number of proteins
Average number of mutations per protein
Number of mutations with positive charge
Number of mutations with negative charge
Number of mutations with no charge
Number of mutations with acidic chemical properties
Number of mutations with basic chemical properties
Number of mutations with aromatic chemical properties
Number of mutations with aliphatic chemical properties
Number of mutations with hydroxyl chemical properties
Number of mutations with sulfur chemical properties
Number of mutations with amide chemical properties
Number of mutations with non-polar chemical properties
Number of mutations with polar chemical properties

Parameters:

input (Union[str, list, TextIO]) – Input file containing the protein identifier, mutation and chain
outfolder (str) – Folder where the preprocessed input will be saved
permissive (bool) – If True, the preprocessing script will continue if it encounters an error
verbosity (int) – Verbosity level
skip_header (bool) – If True, the header in the input file will be skipped

logger = <Logger benchstab.preprocessor (INFO)>

classmethod print_summary(summary, logger: Logger | None = None)[source]

Print the summary generated by create_summary to stdout using the provided logger or the default logger.

Parameters:

summary (Dict[str, str]) – Summary to be printed
logger (logging.Logger) – Logger to be used for printing the summary

Returns:

None

Return type:

None

classmethod create_summary(data: PredictorDataset, verbose: bool = True, outfolder: str | None = None, logger: Logger | None = None) → Dict[str, str][source]

Create a summary of the dataset.

The summary includes the following information:

Number of mutations
Number of proteins
Average number of mutations per protein
Number of mutations with positive charge
Number of mutations with negative charge
Number of mutations with no charge
Number of mutations with acidic chemical properties
Number of mutations with basic chemical properties
Number of mutations with aromatic chemical properties
Number of mutations with aliphatic chemical properties
Number of mutations with hydroxyl chemical properties
Number of mutations with sulfur chemical properties
Number of mutations with amide chemical properties
Number of mutations with non-polar chemical properties
Number of mutations with polar chemical properties

Parameters:

data (PredictorDataset) – Dataset to be summarized
verbosity (bool) – If True, the summary will be printed to stdout
outfolder (str) – If provided, the summary will be saved to a file in the provided folder
logger (logging.Logger) – Logger to be used for printing the summary

Returns:

Dictionary with the summary

Return type:

Dict[str, str]

parse_fasta(data: str) → PreprocessorRow[source]

Parse the line containing the fasta identifier and the mutation.

If the fasta identifier is valid, return:

Fasta object.
mutation in the format WT_RESIUDE + POSITION + MUT_RESIDUE.
pH (default: 7.0 if not supplied).
temperature (default: 25.0 if not supplied).

Parameters:: data (str) – Line containing the fasta identifier and the mutation
Returns:: Dictionary containing the fasta object, mutation, pH and temperature
Return type:: PreprocessorRow

extract_fasta_from_pdb(identifier: PDB, chain: str, source: str) → Fasta[source]

Extract fasta record from PDB file.

Parameters:

identifier (PDB) – PDB identifier
chain (str) – Chain identifier
source (str) – Source of the PDB file.Accepted values are: * file * rcsb * uniprot

Returns:

Fasta record

Return type:

Fasta

parse_fasta_mutation(mutation: str, fasta: Fasta, permissive: bool = True) → str[source]

Wraps the __parse_fasta_mutation function in a try/except block. If the mutation string is not valid, the function will raise a PreprocessorError with the permissive flag set to True. This flag indicates that the error is not critical and that the preprocessing script can continue.

Parameters:

mutation (str) – Mutation string to be parsed
fasta (Fasta) – Fasta record

Returns:

Mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE

Return type:

str

parse_struct(data: List[str]) → PreprocessorRow | None[source]

Parse the line containing the protein identifier, the mutation and the chain.

If the protein identifier is valid, return:

PDB object
mutation in the format WT_RESIUDE + POSITION + MUT_RESIDUE
chain
fasta object
pH (default: 7.0 if not supplied)
temperature (default: 25.0 if not supplied)

Parameters:: data (List[str]) – Line containing the protein identifier, the mutation and the chain
Returns:: Dictionary containing the PDB object, mutation, chain, fasta object, pH and temperature
Return type:: PreprocessorRow

parse_mutation(mutation: str) → str[source]

Parse the mutation string and check if it is valid.

Parameters:: mutation (str) – Mutation string to be parsed
Returns:: Mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE
Return type:: str

__parse_fasta_mutation(mutation: str, fasta: Fasta, permissive: bool = False) → str

Parse the mutation string and check if it is valid. If the mutation is valid, return the mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE.

As this function also handles the parsing of the mutation string for the fasta record extracted from PDBs, it is possible that the mutation string is not valid. In this case, the function will raise a PreprocessorError with the permissive flag set to True. This flag indicates that the error is not critical and that the preprocessing script can continue.

Parameters:

mutation (str) – Mutation string to be parsed
fasta (Fasta) – Fasta record
permissive (bool) – If True, the function will raise a PreprocessorError with the permissive flag set to True

Returns:

Mutation string in the format WT_RESIUDE + POSITION + MUT_RESIDUE

Return type:

str

parse_line(line: str, sep: str | None = None) → PreprocessorRow | None[source]

Parse the line containing the protein identifier and mutation (chain).

If the protein identifier is valid, return:

PDB object
mutation in the format WT_RESIUDE + POSITION + MUT_RESIDUE
chain
fasta object
pH (default: 7.0 if not supplied)
temperature (default: 25.0 if not supplied)

Parameters:

line (str) – Line containing the protein identifier, the mutation and the chain
sep (str) – Column separator

Returns:

Dictionary containing the PDB/Fasta object, mutation, chain, fasta object, pH and temperature

Return type:

Union[PreprocessorRow, None]

__exception_wrapper(func: callable, *args, **kwargs)

Wrap the function call in a try/except block. If the function raises a PreprocessorError or FileNotFoundError, the function will return None and the error will be logged. If the function raises any other exception, the exception will be raised.

Parameters:: func (callable) – Function to be wrapped
Returns:: Function result or None
Return type:: Union[None, Any]

parse() → PredictorDataset[source]: Initiates the mutation file parsing process.

Base Predictor

class benchstab.predictors.base.PredictorFlags(webkit: bool = False, group_mutations: bool = False, group_mutations_by: list[str] = <factory>, mutation_delimiter: str = ', ')[source]

Bases: object

Class for storing predictor flags. The flags are used to control the behaviour of the predictor.

webkit: bool = False

group_mutations: bool = False

group_mutations_by: list[str]

mutation_delimiter: str = ','

class benchstab.predictors.base.BaseCredentials(username: str = '', password: str = '', email: str = '', url: str = '')[source]

Bases: object

Base class for predictor credentials. The credentials are used to authenticate the user. The credentials are stored in a dictionary and sent as a POST request to the url specified in the credentials. The credentials class variable should be overwritten by the child class.

username: str = ''

password: str = ''

email: str = ''

url: str = ''

get_payload(**kwargs)[source]

Create a dictionary of parameters to be sent as a POST request. This function should be implemented by the child class.

Parameters:: kwargs (dict) – extra parameters
Returns:: dictionary of parameters
Return type:: dict

class benchstab.predictors.base.PredictorHeader(name: str = '', input_type: str = '', classname: str = '', mutation_column: str = 'mutation')[source]

Bases: object

Class for storing predictor headers. The predictor headers are used to identify the predictor.

name: str = ''

input_type: str = ''

classname: str = ''

mutation_column: str = 'mutation'

class benchstab.predictors.base.BasePredictor(data: PredictorDataset, flags: PredictorFlags | None = None, outfolder: str | None = None, username: str = '', email: str = 'generic@email.com', password: str = '', wait_interval: int = 60, batch_size: int = -1, verbosity: int = 0, *args, **kwargs)[source]

Bases: object

Base class for predictors. The class is responsible for the following:

Sending the query to the predictor.
Retrieving the results from the predictor.
Aggregating the results.
Returning the results as a PredictorDataset.

url = ''

aggr_columns = {'chain', 'fasta_mutation', 'mutation'}

credentials: alias of BaseCredentials

async classmethod is_available_async(url: str) → str[source]

Check if the predictor is available. This is done by sending a GET request to the specified url. If the request is successful, the predictor is available.

Parameters:: url (str) – url of the predictor
Returns:: status of the predictor. ‘Available’ if the predictor is available, ‘Offline’ otherwise
Return type:: str

classmethod is_available(url: str) → str[source]

Check if the predictor is available. This is done by sending a GET request to the specified url. If the request is successful, the predictor is available.

Parameters:: url (str) – url of the predictor
Returns:: status of the predictor. ‘Available’ if the predictor is available, ‘Offline’ otherwise
Return type:: str

classmethod header()[source]

Return the header of the predictor. The header is used to identify the predictor.

Returns:: predictor header
Return type:: PredictorHeader

async classmethod async_default_callback(index: int, response: ClientResponse, session: ClientSession)[source]

Default callback function for the GET request. It checks if the request was successful and updates the status of the row accordingly.

Parameters:

index (int) – index of the row
response (aiohttp.ClientResponse) – response of the GET request
session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the request was successful, False otherwise

Return type:

bool

process_result(index: int, mutation: str, chain: str, result: Any) → None[source]

format_mutation(data: str | Dict | DatasetRow) → str[source]

Format the mutation to the format required by the predictor. This function should be implemented by the child class.

Parameters:: data (Union[str, Dict, DatasetRow]) – mutation
Returns:: formatted mutation
Return type:: str

prepare_mutation(row: DatasetRow) → str[source]

Prepare the mutation to be sent to the predictor.

This function performs the following steps:

Convert the mutation to the format required by the predictor.
Group the mutations if needed.

Parameters:: row (DatasetRow) – row of the dataset
Returns:: mutation
Return type:: str

async send_query(session: ClientSession, index: int, *args, **kwargs) → bool[source]

Send the query to the predictor. This function should be implemented by the child class.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
index (int) – index of the row

Returns:

True if the query was sent successfully, False otherwise

Return type:

bool

async retrieve_result(session: ClientSession, index: int) → bool[source]

Retrieve the results of the prediction. This function should be implemented by the child class.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
index (int) – index of the row

Returns:

True if the prediction was successful, False otherwise

Return type:

bool

__prepare_payload(row: DatasetRow) → Dict

Wrapper around the prepare_payload function. It catches any exceptions and updates the status of the row accordingly.

Parameters:: row (DatasetRow) – row of the dataset
Returns:: payload
Return type:: dict

prepare_payload(row: DatasetRow) → Dict[source]

Prepare the payload to be sent to the predictor. This function should be implemented by the child class.

Parameters:: row (DatasetRow) – row of the dataset
Returns:: payload
Return type:: dict

get_results() → PredictorDataset[source]

Get the results of the prediction.

Returns:: prediction results
Return type:: PredictorDataset

_aggregate(data) → List[Dict[Any, Any]][source]

Helper function aggregating the data into a list of dictionaries.

Parameters:: data (PredictorDataset) – data to be aggregated
Returns:: aggregated data
Return type:: list[DatasetRow]

setup() → None[source]: Set up the dataset. This includes grouping mutations, creating the payload, etc.

async __exception_wrapper(func: Callable, index: int | None = None, *args, **kwargs) → bool

Wrapper around the async functions. It catches any exceptions and updates the status of the row accordingly. If the exception is HTMLParserError with permissive=True, it returns False, otherwise it returns True.

Parameters:

func (Callable) – function to be executed
index (int) – index of the row

Returns:

True if the function was executed successfully, False otherwise

Return type:

bool

async login(session: ClientSession, index: int, login_extra: Dict[str, Any] | None = None) → bool[source]

Login to the predictor. This function should be implemented by the child class.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
index (int) – index of the row
login_extra (dict) – extra parameters for the login function

Returns:

True if the login was successful, False otherwise

Return type:

bool

async _queue_prediction(queue)[source]

Create a queue of tasks to be executed in parallel. The queue is created from the indices of the dataset. The queue is filled until it reaches the batch_size. If the queue is full, the function waits for the queue to be emptied. If the dataset is exhausted, the function returns.

Parameters:: queue (asyncio.Queue) – queue of tasks

async compute()[source]

The main function of the predictor.

It is responsible for the following:

Check if the predictor is available (if not, return immediately).
Set up the dataset (group mutations, etc.)
Create a queue of tasks to be executed in parallel.
Create a queue of workers to execute the tasks.
Wait for all tasks to be completed (join the queue).
Return the results as a PredictorDataset.

Returns:: prediction results
Return type:: PredictorDataset

async _run_prediction(queue)[source]

Run the prediction. This function is executed in parallel by the workers.

It takes an index from the queue and executes the following steps:

Login to the predictor.
Send the query to the predictor.
Retrieve the results from the predictor.

Parameters:: queue (asyncio.Queue) – queue of tasks

make_form(payload)[source]

Create a multipart form from a dictionary of parameters. Since the current version (==3.8.5) of aiohttp does not support assigning a custom boundary to the FormData object directly, we need to create a custom MultipartWriter and assign it to the FormData object.

Parameters:: payload (dict) – dictionary of parameters
Returns:: multipart form
Return type:: aiohttp.FormData

async __get(session: ClientSession, callback: Callable, index: int, *args, **kwargs) → bool

Wrapper around the aiohttp GET request. It catches any exceptions and updates the status of the row accordingly.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
callback (Callable) – callback function
index (int) – index of the row

Returns:

result of the callback function

Return type:

bool

async get(session: ClientSession, dataset: DatasetRow | Dict, callback: Callable, index: int | None = None, *args, **kwargs) → bool[source]

Send a GET request to the predictor. The request is sent to the url specified in the dataset. The response is handled by the callback function. The callback function should return True if the request was successful, False otherwise. If the callback function is not specified, the default callback function is used.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
dataset (Union[DatasetRow, Dict]) – dataset
callback (Callable) – callback function
index (int) – index of the row

Returns:

result of the callback function

Return type:

bool

async __post(session: ClientSession, callback: Callable, index: int, *args, **kwargs) → bool

Wrapper around the aiohttp POST request. It catches any exceptions and updates the status of the row accordingly.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
callback (Callable) – callback function
index (int) – index of the row

Returns:

result of the callback function

Return type:

bool

async post(session: ClientSession, dataset: DatasetRow | Dict, callback: Callable, index: int | None = None, *args, **kwargs) → bool[source]

Send a POST request to the predictor. The request is sent to the url specified in the dataset. The response is handled by the callback function. The callback function should return True if the request was successful, False otherwise. If the callback function is not specified, the default callback function is used.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
dataset (Union[DatasetRow, Dict]) – dataset
callback (Callable) – callback function
index (int) – index of the row

Returns:

result of the callback function

class benchstab.predictors.base.BasePostPredictor(data: PredictorDataset, flags: PredictorFlags | None = None, outfolder: str | None = None, username: str = '', email: str = 'generic@email.com', password: str = '', wait_interval: int = 60, batch_size: int = -1, verbosity: int = 0, *args, **kwargs)[source]

Bases: BasePredictor

Base class for predictors that require a POST request. The POST request is sent to the url specified in the dataset. The response is handled by the default_post_handler.

async send_query(session: ClientSession, index: int, *args, **kwargs) → bool[source]

Send the query to the predictor. If the predictor is a form-data predictor, the query is sent as a multipart form. Otherwise, it is sent as a JSON object.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
index (int) – index of the row

Returns:

True if the query was sent successfully, False otherwise

Return type:

bool

async default_post_handler(index: int, response: ClientResponse, session: ClientSession)[source]

Default callback function for the POST request. It checks if the request was successful and updates the status of the row accordingly.

Parameters:

index (int) – index of the row
response (aiohttp.ClientResponse) – response of the POST request
session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the request was successful, False otherwise

Return type:

bool

class benchstab.predictors.base.BaseAuthentication(data: PredictorDataset, flags: PredictorFlags | None = None, outfolder: str | None = None, username: str = '', email: str = 'generic@email.com', password: str = '', wait_interval: int = 60, batch_size: int = -1, verbosity: int = 0, *args, **kwargs)[source]

Bases: BasePostPredictor

Base class for predictors that require authentication. The authentication is done by sending a POST request to the url specified in the credentials. The response is handled by the login_handler.

async login(session: ClientSession, index: int, login_extra: Dict[str, Any] | None = None) → bool[source]

Login to the predictor. The login is done by sending a POST request to the url specified in the credentials. The response is handled by the login_handler function. The login_handler function should return True if the login was successful, False otherwise. The function uses the credentials specified in the credentials class variable.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
index (int) – index of the row
login_extra (dict) – extra parameters for the login function

Returns:

True if the login was successful, False otherwise

Return type:

bool

async login_handler(index: int, response: ClientResponse, session: ClientSession) → bool[source]

Default callback function for the login request. It checks if the login was successful and updates the status of the row accordingly.

Parameters:

index (int) – index of the row
response (aiohttp.ClientResponse) – response of the login request
session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the login was successful, False otherwise

Return type:

bool

class benchstab.predictors.base.BaseGetPredictor(max_retries: int = 100, *args, **kwargs)[source]

Bases: BasePostPredictor

Base class for predictors that require a GET request. The GET request is sent to the url specified in the dataset. The response is handled by the default_get_handler.

async retrieve_result(session: ClientSession, index: int) → bool[source]

Retrieve the results of the prediction. The results are retrieved by sending a GET request to the url specified in the dataset. IF the datapoint is already processed, the function returns True, otherwise it returns the result of the default_get_handler function.

Parameters:

session (aiohttp.ClientSession) – aiohttp session
index (int) – index of the row

Returns:

True if the request was successful, False otherwise

Return type:

bool

async default_get_handler(index: int, response: ClientResponse, session: ClientSession) → bool[source]

Default callback function for the GET request. It checks if the request was successful and updates the status of the row accordingly.

Parameters:

index (int) – index of the row
response (aiohttp.ClientResponse) – response of the GET request
session (aiohttp.ClientSession) – aiohttp session

Returns:

True if the request was successful, False otherwise

Return type:

bool