Utilities

HTML Parser

class benchstab.utils.html_parser.HTMLParser[source]

Bases: object

with_xpath(xpath: str, html: str | None = None, root=None, index: int = 0, permissive=True)[source]

Parse HTML with XPath expression and return the result. If index is not None, return the value at the index.

Parameters:

xpath – XPath expression
html – HTML string
root – lxml.html
index – int
permissive – bool

Returns:

str or list

Raises:

HTMLParserError

with_pandas(html: str, index: int = 0, pandas_args: Dict[Any, Any] | None = None, permissive=True)[source]

Parse HTML with pandas and return the result. If index is not None, return the value at the index.

Parameters:

html – HTML string
index – int
pandas_args – dict
permissive – bool

Returns:

pd.DataFrame or list

Raises:

HTMLParserError

__check_enough_values(result, index, permissive)

Exceptions

exception benchstab.utils.exceptions._BaseError(*args: object, **kwargs: object)[source]: Bases: Exception

exception benchstab.utils.exceptions.HTMLParserError(*args: object, **kwargs: object)[source]: Bases: _BaseError

exception benchstab.utils.exceptions.PreprocessorError(*args: object, **kwargs: object)[source]: Bases: _BaseError

exception benchstab.utils.exceptions.PredictorError(*args: object, **kwargs: object)[source]: Bases: _BaseError

exception benchstab.utils.exceptions.DatasetError(*args: object, **kwargs: object)[source]: Bases: _BaseError

exception benchstab.utils.exceptions.BenchStabError(*args: object, **kwargs: object)[source]: Bases: _BaseError

Structure

class benchstab.utils.structure.File(file: str = '', name: str = '')[source]

Bases: object

Base class for all file types. Contains methods for converting the file to different formats, as well as methods for fetching files from URLs and opening files.

_rcsb_url = ''

_uniprot_url = ''

__to_multipart(content_type: str) → Dict[str, str | bytes | StringIO]

Convert the file to a multipart (from-data) file. If the file does not exist, create a temporary file. The final format depends on the content type.

Parameters:: content_type (str) – content type of the file. Allowed types: - text/plain - application/octet-stream
Returns:: multipart file
Return type:: dict

to_bytes() → bytes[source]

Convert the file to BytesIO.

Returns:: file as bytes
Return type:: BytesIO object

to_octet_stream() → Dict[str, str | bytes | StringIO][source]

Convert the file to an octet-stream.

Returns:: file as octet-stream
Return type:: bytes

to_plain_text() → Dict[str, str | bytes | StringIO][source]

Convert the file to plain text.

Returns:: file as plain text
Return type:: str

classmethod get_from_url_by_id(url: str, **kwargs) → Response[source]

Get the file from a URL by ID. Raise an exception if the file does not exist.

Parameters:

id (str) – ID
url (str) – URL

Returns:

file

Return type:

requests.Response

Raises:

PreprocessorError – if the record does not exist

classmethod open(file_path: str, mode: str = 'r', encoding: str = 'utf-8') → TextIOWrapper | BufferedReader[source]

Open a file. Raise an exception if the file does not exist.

Parameters:

file_path (str) – file path
mode (str) – file mode
encoding (str) – file encoding

Returns:

file

Return type:

Union[TextIOWrapper, BufferedReader]

Raises:

FileNotFoundError – if the file does not exist

class benchstab.utils.structure.Fasta(sequence: str, chain: str, header: str, name: str | None = None, offsets: Dict[int, int] | None = None)[source]

Bases: File

FASTA file class. Contains methods for extracting FASTA sequences from different sources.

_sifts_url = 'https://www.ebi.ac.uk/pdbe/api/mappings/uniprot/{id}'

_rcsb_url = 'https://www.rcsb.org/fasta/entry/{id}'

_uniprot_url = 'https://rest.uniprot.org/uniprotkb/{id}.fasta'

_rcsb_polymer_instance_url = 'https://data.rcsb.org/rest/v1/core/polymer_entity_instance/{id}/{chain}'

logger = <Logger benchstab.utils.structure (INFO)>

classmethod _find_delimiter(header: str)[source]

classmethod create(datapoint: str)[source]

Determine the FASTA sequence format and return a FASTA object.

Parameters:: datapoint (str) – FASTA string or file path
Returns:: FASTA object
Return type:: Fasta
Raises:: PreprocessorError – if the sequence is invalid

classmethod from_uniprot(datapoint)[source]

Extract the FASTA sequence from the Uniprot database.

Parameters:: datapoint (str) – Uniprot ID
Returns:: FASTA object
Return type:: Fasta

classmethod __get_author_residue_number_offset(pdb_id: str, chain: str) → List[str]

Get the author residue number start from the RCSB database.

Parameters:

pdb_id (str) – PDB ID
chain (str) – chain ID

Returns:

author residue number offset

Return type:

List[str]

classmethod __fasta_offsets_from_sifts(pdb_id: str, chain: str) → Dict[str, str | List[str] | int]

Get the offset between the PDB and Uniprot FASTA sequences.

Parameters:

pdb_id (str) – PDB ID
chain (str) – chain ID

Returns:

Uniprot ID, PDB offsets, Uniprot start, Uniprot end

Return type:

Dict[str, Union[str, List[str], int]]

classmethod from_uniprot_by_pdb_id(pdb_id: str, chain: str)[source]

Extract the FASTA sequence from the Uniprot database.

Parameters:

pdb_id (str) – PDB ID
chain (str) – chain ID

Returns:

FASTA object

Return type:

Fasta

Raises:

PreprocessorError – if the chain is invalid for the structure
PreprocessorError – if the structure is not mapped to Uniprot

classmethod from_rcsb_by_pdb_id(pdb_id: str, chain: str)[source]

Extract the FASTA sequence from the RCSB PDB database.

Parameters:

pdb_id (str) – PDB ID
chain (str) – chain ID

Returns:

FASTA object

Return type:

Fasta

Raises:

PreprocessorError – if the chain is invalid for the structure
PreprocessorError – if the structure is not mapped to PDB

classmethod from_file(file_path)[source]

Extract the FASTA sequence from a FASTA file.

Parameters:

file_path (str) – FASTA file path

Returns:

FASTA object

Return type:

Fasta

Raises:

PreprocessorError – if the structure is invalid by BioPython standards
PreprocessorError – if the file does not contain any sequences

classmethod from_pdb_file(file_path: str, chain: str)[source]

Extract the FASTA sequence from a PDB file.

Parameters:

file_path (str) – PDB file path
chain (str) – chain ID

Returns:

FASTA object

Return type:

Fasta

Raises:

PreprocessorError – if the structure is invalid by BioPython standards
PreprocessorError – if the chain is invalid for the structure
PreprocessorError – if the file does not contain any sequences

classmethod extract_chain(chain)[source]

Extract the chain ID from the FASTA header.

Parameters:: chain (str) – chain ID
Returns:: chain ID
Return type:: str

class benchstab.utils.structure.PDB(pdb_file: str, pdb_id: str | None = None, source: str = 'file', chains: List[str] | None = None)[source]

Bases: File

PDB file class. Contains methods for extracting PDB structures from different sources.

_rcsb_url = 'https://files.rcsb.org/download/{id}.pdb'

logger = <Logger benchstab.utils.structure (INFO)>

classmethod create(pdb: str)[source]

Determine the PDB structure format and return a PDB object

Parameters:: pdb (str) – PDB structure or file path
Returns:: PDB object
Return type:: PDB

classmethod from_id(pdb_id: str)[source]

Fetch the PDB record from the RCSB PDB database, process it and return a PDB object.

Parameters:: pdb_id (str) – PDB ID
Returns:: PDB object
Return type:: PDB

classmethod from_file(file_handle: str | StringIO)[source]

Process the PDB file and return a PDB object. Log a warning if the structure is faulty by BioPython standards. Raise an exception if the structure is invalid by BioPython standards.

Parameters:: file_handle (Union[str, StringIO]) – PDB file handle
Returns:: PDB object
Return type:: PDB
Raises:: PreprocessorError – if the structure is invalid by BioPython standards

Status

class benchstab.utils.status._Status(status: str = '', message: str = '')[source]

Bases: object

blocking = False

status: str = ''

message: str = ''

class benchstab.utils.status.NotStarted(status: str = 'not started', message: str = 'The job has not started yet.')[source]

Bases: _Status

This status represents the initial state of the job.

blocking = False

status: str = 'not started'

message: str = 'The job has not started yet.'

class benchstab.utils.status.Authenticaton(status: str = 'authentication', message: str = 'The job is being authenticated.')[source]

Bases: _Status

This status represents the that the user is being authenticated to the predictor.

blocking = True

status: str = 'authentication'

message: str = 'The job is being authenticated.'

class benchstab.utils.status.Waiting(status: str = 'waiting', message: str = "The job is waiting in predictor's queue.")[source]

Bases: _Status

This status represents that the job is waiting in the predictor’s queue (in case of POST-GET predictors), or that the job is waiting for the response (in case of POST predictors).

blocking = True

status: str = 'waiting'

message: str = "The job is waiting in predictor's queue."

class benchstab.utils.status.Processing(status: str = 'processing', message: str = 'The job request is being currently processed.')[source]

Bases: _Status

This status represents that job has not been queued to the predictor yet, but the predictor is processing the payload. It’s usually asocciated with predictors that require some middle step (e.g., multiple post requests) before the job is queued.

blocking = True

status: str = 'processing'

message: str = 'The job request is being currently processed.'

class benchstab.utils.status.Finished(status: str = 'finished', message: str = 'The job has finished successfully.')[source]

Bases: _Status

This status represents that the job has finished successfully.

status: str = 'finished'

message: str = 'The job has finished successfully.'

class benchstab.utils.status.Failed(status: str = 'failed', message: str = 'The job has failed for unknown reasons.')[source]

Bases: _Status

This status represents that the job has failed for unknown (other) reasons. It is a good practice to pass the reason for the failure (exception) in the message.

status: str = 'failed'

message: str = 'The job has failed for unknown reasons.'

class benchstab.utils.status.ParsingFailed(status: str = 'parsing failed', message: str = 'The job has failed during data parsing.')[source]

Bases: _Status

This status represents that the job has failed during data parsing. It can be HTML processing or data manipulation - Indexing, slicing, etc. It is a good practice to pass the reason for the failure (exception) in the message.

status: str = 'parsing failed'

message: str = 'The job has failed during data parsing.'

class benchstab.utils.status.ConnectionFailed(status: str = 'connection failed', message: str = 'The job has failed during connection.')[source]

Bases: _Status

This status represents that the job has failed during network communication with the predictor. This often means that the predictor is down, the network connection is unstable or that the predictor has moved to a different URL.

status: str = 'connection failed'

message: str = 'The job has failed during connection.'

class benchstab.utils.status.AuthenticationFailed(status: str = 'authentication failed', message: str = 'The job has failed during authentication.')[source]

Bases: _Status

This status represents failed attempts to authenticate to the predictor.

status: str = 'authentication failed'

message: str = 'The job has failed during authentication.'

class benchstab.utils.status.PredictorNotAvailable(status: str = 'predictor not available', message: str = 'The predictor is not available.')[source]

Bases: _Status

This status represents that the predictor is not available.

status: str = 'predictor not available'

message: str = 'The predictor is not available.'

class benchstab.utils.status.Timeout(status: str = 'timeout', message: str = 'The job has timed out.')[source]

Bases: _Status

This status represents that the job has timed out.

status: str = 'timeout'

message: str = 'The job has timed out.'

Amino Acids

class benchstab.utils.aminoacids.Mapper[source]

Bases: object

Aminoacid mapper class. It maps aminoacid one letter code to three letter code and vice versa. It also provides information about aminoacid properties. The information is based on the IMGT Aide-memoire for aminoacids.

Url: https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/IMGTclasses.html

Full table of aminoacid properties is as follows:

	Three letters	One letter	Polarity	Charge	Chemical P.	Volume	Hydropathy
0	ALA	A	Non-Polar	Uncharged	Aliphatic	Very small	Hydrophobic
1	ARG	R	Polar	Positive	Basic	Large	Hydrophilic
2	ASN	N	Polar	Uncharged	Amide	Small	Hydrophilic
3	ASP	D	Polar	Negative	Acidic	Small	Hydrophilic
4	ASX	B	Polar	Uncharged	Aliphatic	Medium	Hydrophilic
5	CYS	C	Non-Polar	Uncharged	Sulfur	Small	Hydrophobic
6	GLU	E	Polar	Negative	Acidic	Medium	Hydrophilic
7	GLN	Q	Polar	Uncharged	Amide	Medium	Hydrophilic
8	GLY	G	Non-Polar	Uncharged	Aliphatic	Very small	Neutral
9	HIS	H	Polar	Positive	Basic	Medium	Neutral
10	ILE	I	Non-Polar	Uncharged	Aliphatic	Large	Hydrophobic
11	LEU	L	Non-Polar	Uncharged	Aliphatic	Large	Hydrophobic
12	LYS	K	Non-Polar	Positive	Basic	Large	Hydrophilic
13	MET	M	Non-Polar	Uncharged	Sulfur	Large	Hydrophobic
14	PHE	F	Non-Polar	Uncharged	Aromatic	Very large	Hydrophobic
15	PRO	P	Non-Polar	Uncharged	Aliphatic	Small	Neutral
16	SER	S	Polar	Uncharged	Hydroxyl	Very small	Neutral
17	THR	T	Polar	Uncharged	Hydroxyl	Small	Neutral
18	TRP	W	Non-Polar	Uncharged	Aromatic	Very large	Hydrophobic
19	TYR	Y	Non-Polar	Uncharged	Aromatic	Very large	Neutral
20	VAL	V	Non-Polar	Uncharged	Aliphatic	Medium	Hydrophobic

classmethod three_to_one_letter(aminoacid: str) → str[source]

Converts three letter aminoacid code to one letter aminoacid code.

Parameters:: aminoacid (str) – three letter aminoacid code
Returns:: one letter aminoacid code
Return type:: str

classmethod one_to_three_letter(aminoacid: str) → str[source]

Converts one letter aminoacid code to three letter aminoacid code.

Parameters:: aminoacid (str) – one letter aminoacid code
Returns:: three letter aminoacid code
Return type:: str

classmethod get_polarity(aminoacid: str)[source]

Returns aminoacid polarity.

Parameters:: aminoacid (str) – one letter aminoacid code
Returns:: aminoacid polarity
Return type:: str

classmethod get_charge(aminoacid: str)[source]

Returns aminoacid charge.

Parameters:: aminoacid (str) – one letter aminoacid code
Returns:: aminoacid charge
Return type:: str

classmethod get_chemical_properties(aminoacid: str)[source]

Returns aminoacid chemical properties.

Parameters:: aminoacid (str) – one letter aminoacid code
Returns:: aminoacid chemical properties
Return type:: str

classmethod get_volume_size(aminoacid: str)[source]

Returns aminoacid volume size.

Parameters:: aminoacid (str) – one letter aminoacid code
Returns:: aminoacid volume size
Return type:: str

classmethod get_hydropathy(aminoacid: str)[source]

Returns aminoacid hydropathy.

Parameters:: aminoacid (str) – one letter aminoacid code
Returns:: aminoacid hydropathy
Return type:: str

Dataset

class benchstab.utils.dataset.DatasetRow(data=None, index=None, dtype: Dtype | None = None, name=None, copy: bool | None = None, fastpath: bool | lib.NoDefault = <no_default>)[source]

Bases: Series

property _constructor: Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim: Used when a manipulation result has one higher dimension as the original, such as Series.to_frame()

property fasta

Get fasta sequence from the dataset’s row.

Returns:: fasta sequence
Return type:: str

property ph

Get pH from the dataset’s row.

Returns:: pH
Return type:: str

property temperature

Get temperature from the dataset’s row.

Returns:: temperature
Return type:: str

property mutation

Get mutation from the dataset’s row.

Returns:: mutation
Return type:: str

property fasta_mutation

Get fasta mutation from the dataset’s row.

Returns:: fasta mutation
Return type:: str

class benchstab.utils.dataset.PredictorDataset(*args, **kwargs)[source]

Bases: DataFrame

Wrapper around pandas.DataFrame with additional methods.

blocking_statuses = [Authenticaton(status='authentication', message='The job is being authenticated.'), Waiting(status='waiting', message="The job is waiting in predictor's queue."), Processing(status='processing', message='The job request is being currently processed.')]

update_statuses = [Authenticaton(status='authentication', message='The job is being authenticated.'), Waiting(status='waiting', message="The job is waiting in predictor's queue."), Processing(status='processing', message='The job request is being currently processed.'), NotStarted(status='not started', message='The job has not started yet.')]

classmethod concat(*args, **kwargs)[source]: Wrappper around pandas.concat, casting it back to PredictorDataset.

property _constructor: Used when a manipulation result has the same dimensions as the original.

property _constructor_sliced

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

Parameters

dataarray-like, Iterable, dict, or scalar value: Contains data stored in Series. If data is a dict, argument order is maintained.
indexarray-like or Index (1d): Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
dtypestr, numpy.dtype, or ExtensionDtype, optional: Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.
nameHashable, default None: The name to give to the Series.
copybool, default False: Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes

Please reference the User Guide for more information.

Examples

Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result.

Constructing Series from a list with copy=False.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.

Constructing Series from a 1d ndarray with copy=False.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a view on the original data, so the data is changed as well.

property logger: Getter for the logger.

start_timer(index)[source]

Start the timer for the prediction.

Parameters:: index (int) – index of the row

update_status(index, status)[source]

Update status of the prediction. If prediction succeeded/failed, stop the timer.

Parameters:

index (int) – index of the row to update
status (str) – new status

format_to_output(verbose: int = 0)[source]

Format the dataset to output format.

Parameters:: verbose (int) – verbosity level
Returns:: formatted dataset
Return type:: PredictorDataset

is_blocking_status(index)[source]

Check if the status is blocking.

Parameters:: index (int) – index of the row
Returns:: True if the status is blocking, False otherwise
Return type:: bool

get_sequence(index)[source]

Get fasta sequence from the dataset.

Parameters:: index (int) – index of the row
Returns:: fasta sequence
Return type:: str