Utilities

HTML Parser

class benchstab.utils.html_parser.HTMLParser[source]

Bases: object

with_xpath(xpath: str, html: str | None = None, root=None, index: int = 0, permissive=True)[source]

Parse HTML with XPath expression and return the result. If index is not None, return the value at the index.

Parameters:
  • xpath – XPath expression

  • html – HTML string

  • root – lxml.html

  • index – int

  • permissive – bool

Returns:

str or list

Raises:

HTMLParserError

with_pandas(html: str, index: int = 0, pandas_args: Dict[Any, Any] | None = None, permissive=True)[source]

Parse HTML with pandas and return the result. If index is not None, return the value at the index.

Parameters:
  • html – HTML string

  • index – int

  • pandas_args – dict

  • permissive – bool

Returns:

pd.DataFrame or list

Raises:

HTMLParserError

__check_enough_values(result, index, permissive)

Exceptions

exception benchstab.utils.exceptions._BaseError(*args: object, **kwargs: object)[source]

Bases: Exception

exception benchstab.utils.exceptions.HTMLParserError(*args: object, **kwargs: object)[source]

Bases: _BaseError

exception benchstab.utils.exceptions.PreprocessorError(*args: object, **kwargs: object)[source]

Bases: _BaseError

exception benchstab.utils.exceptions.PredictorError(*args: object, **kwargs: object)[source]

Bases: _BaseError

exception benchstab.utils.exceptions.DatasetError(*args: object, **kwargs: object)[source]

Bases: _BaseError

exception benchstab.utils.exceptions.BenchStabError(*args: object, **kwargs: object)[source]

Bases: _BaseError

Structure

class benchstab.utils.structure.File(file: str = '', name: str = '')[source]

Bases: object

Base class for all file types. Contains methods for converting the file to different formats, as well as methods for fetching files from URLs and opening files.

_rcsb_url = ''
_uniprot_url = ''
__to_multipart(content_type: str) Dict[str, str | bytes | StringIO]

Convert the file to a multipart (from-data) file. If the file does not exist, create a temporary file. The final format depends on the content type.

Parameters:

content_type (str) – content type of the file. Allowed types: - text/plain - application/octet-stream

Returns:

multipart file

Return type:

dict

to_bytes() bytes[source]

Convert the file to BytesIO.

Returns:

file as bytes

Return type:

BytesIO object

to_octet_stream() Dict[str, str | bytes | StringIO][source]

Convert the file to an octet-stream.

Returns:

file as octet-stream

Return type:

bytes

to_plain_text() Dict[str, str | bytes | StringIO][source]

Convert the file to plain text.

Returns:

file as plain text

Return type:

str

classmethod get_from_url_by_id(url: str, **kwargs) Response[source]

Get the file from a URL by ID. Raise an exception if the file does not exist.

Parameters:
  • id (str) – ID

  • url (str) – URL

Returns:

file

Return type:

requests.Response

Raises:

PreprocessorError – if the record does not exist

classmethod open(file_path: str, mode: str = 'r', encoding: str = 'utf-8') TextIOWrapper | BufferedReader[source]

Open a file. Raise an exception if the file does not exist.

Parameters:
  • file_path (str) – file path

  • mode (str) – file mode

  • encoding (str) – file encoding

Returns:

file

Return type:

Union[TextIOWrapper, BufferedReader]

Raises:

FileNotFoundError – if the file does not exist

class benchstab.utils.structure.Fasta(sequence: str, chain: str, header: str, name: str | None = None, offsets: Dict[int, int] | None = None)[source]

Bases: File

FASTA file class. Contains methods for extracting FASTA sequences from different sources.

_sifts_url = 'https://www.ebi.ac.uk/pdbe/api/mappings/uniprot/{id}'
_rcsb_url = 'https://www.rcsb.org/fasta/entry/{id}'
_uniprot_url = 'https://rest.uniprot.org/uniprotkb/{id}.fasta'
_rcsb_polymer_instance_url = 'https://data.rcsb.org/rest/v1/core/polymer_entity_instance/{id}/{chain}'
logger = <Logger benchstab.utils.structure (INFO)>
classmethod _find_delimiter(header: str)[source]
classmethod create(datapoint: str)[source]

Determine the FASTA sequence format and return a FASTA object.

Parameters:

datapoint (str) – FASTA string or file path

Returns:

FASTA object

Return type:

Fasta

Raises:

PreprocessorError – if the sequence is invalid

classmethod from_uniprot(datapoint)[source]

Extract the FASTA sequence from the Uniprot database.

Parameters:

datapoint (str) – Uniprot ID

Returns:

FASTA object

Return type:

Fasta

classmethod __get_author_residue_number_offset(pdb_id: str, chain: str) List[str]

Get the author residue number start from the RCSB database.

Parameters:
  • pdb_id (str) – PDB ID

  • chain (str) – chain ID

Returns:

author residue number offset

Return type:

List[str]

classmethod __fasta_offsets_from_sifts(pdb_id: str, chain: str) Dict[str, str | List[str] | int]

Get the offset between the PDB and Uniprot FASTA sequences.

Parameters:
  • pdb_id (str) – PDB ID

  • chain (str) – chain ID

Returns:

Uniprot ID, PDB offsets, Uniprot start, Uniprot end

Return type:

Dict[str, Union[str, List[str], int]]

classmethod from_uniprot_by_pdb_id(pdb_id: str, chain: str)[source]

Extract the FASTA sequence from the Uniprot database.

Parameters:
  • pdb_id (str) – PDB ID

  • chain (str) – chain ID

Returns:

FASTA object

Return type:

Fasta

Raises:
classmethod from_rcsb_by_pdb_id(pdb_id: str, chain: str)[source]

Extract the FASTA sequence from the RCSB PDB database.

Parameters:
  • pdb_id (str) – PDB ID

  • chain (str) – chain ID

Returns:

FASTA object

Return type:

Fasta

Raises:
classmethod from_file(file_path)[source]

Extract the FASTA sequence from a FASTA file.

Parameters:

file_path (str) – FASTA file path

Returns:

FASTA object

Return type:

Fasta

Raises:
classmethod from_pdb_file(file_path: str, chain: str)[source]

Extract the FASTA sequence from a PDB file.

Parameters:
  • file_path (str) – PDB file path

  • chain (str) – chain ID

Returns:

FASTA object

Return type:

Fasta

Raises:
classmethod extract_chain(chain)[source]

Extract the chain ID from the FASTA header.

Parameters:

chain (str) – chain ID

Returns:

chain ID

Return type:

str

class benchstab.utils.structure.PDB(pdb_file: str, pdb_id: str | None = None, source: str = 'file', chains: List[str] | None = None)[source]

Bases: File

PDB file class. Contains methods for extracting PDB structures from different sources.

_rcsb_url = 'https://files.rcsb.org/download/{id}.pdb'
logger = <Logger benchstab.utils.structure (INFO)>
classmethod create(pdb: str)[source]

Determine the PDB structure format and return a PDB object

Parameters:

pdb (str) – PDB structure or file path

Returns:

PDB object

Return type:

PDB

classmethod from_id(pdb_id: str)[source]

Fetch the PDB record from the RCSB PDB database, process it and return a PDB object.

Parameters:

pdb_id (str) – PDB ID

Returns:

PDB object

Return type:

PDB

classmethod from_file(file_handle: str | StringIO)[source]

Process the PDB file and return a PDB object. Log a warning if the structure is faulty by BioPython standards. Raise an exception if the structure is invalid by BioPython standards.

Parameters:

file_handle (Union[str, StringIO]) – PDB file handle

Returns:

PDB object

Return type:

PDB

Raises:

PreprocessorError – if the structure is invalid by BioPython standards

Status

class benchstab.utils.status._Status(status: str = '', message: str = '')[source]

Bases: object

blocking = False
status: str = ''
message: str = ''
class benchstab.utils.status.NotStarted(status: str = 'not started', message: str = 'The job has not started yet.')[source]

Bases: _Status

This status represents the initial state of the job.

blocking = False
status: str = 'not started'
message: str = 'The job has not started yet.'
class benchstab.utils.status.Authenticaton(status: str = 'authentication', message: str = 'The job is being authenticated.')[source]

Bases: _Status

This status represents the that the user is being authenticated to the predictor.

blocking = True
status: str = 'authentication'
message: str = 'The job is being authenticated.'
class benchstab.utils.status.Waiting(status: str = 'waiting', message: str = "The job is waiting in predictor's queue.")[source]

Bases: _Status

This status represents that the job is waiting in the predictor’s queue (in case of POST-GET predictors), or that the job is waiting for the response (in case of POST predictors).

blocking = True
status: str = 'waiting'
message: str = "The job is waiting in predictor's queue."
class benchstab.utils.status.Processing(status: str = 'processing', message: str = 'The job request is being currently processed.')[source]

Bases: _Status

This status represents that job has not been queued to the predictor yet, but the predictor is processing the payload. It’s usually asocciated with predictors that require some middle step (e.g., multiple post requests) before the job is queued.

blocking = True
status: str = 'processing'
message: str = 'The job request is being currently processed.'
class benchstab.utils.status.Finished(status: str = 'finished', message: str = 'The job has finished successfully.')[source]

Bases: _Status

This status represents that the job has finished successfully.

status: str = 'finished'
message: str = 'The job has finished successfully.'
class benchstab.utils.status.Failed(status: str = 'failed', message: str = 'The job has failed for unknown reasons.')[source]

Bases: _Status

This status represents that the job has failed for unknown (other) reasons. It is a good practice to pass the reason for the failure (exception) in the message.

status: str = 'failed'
message: str = 'The job has failed for unknown reasons.'
class benchstab.utils.status.ParsingFailed(status: str = 'parsing failed', message: str = 'The job has failed during data parsing.')[source]

Bases: _Status

This status represents that the job has failed during data parsing. It can be HTML processing or data manipulation - Indexing, slicing, etc. It is a good practice to pass the reason for the failure (exception) in the message.

status: str = 'parsing failed'
message: str = 'The job has failed during data parsing.'
class benchstab.utils.status.ConnectionFailed(status: str = 'connection failed', message: str = 'The job has failed during connection.')[source]

Bases: _Status

This status represents that the job has failed during network communication with the predictor. This often means that the predictor is down, the network connection is unstable or that the predictor has moved to a different URL.

status: str = 'connection failed'
message: str = 'The job has failed during connection.'
class benchstab.utils.status.AuthenticationFailed(status: str = 'authentication failed', message: str = 'The job has failed during authentication.')[source]

Bases: _Status

This status represents failed attempts to authenticate to the predictor.

status: str = 'authentication failed'
message: str = 'The job has failed during authentication.'
class benchstab.utils.status.PredictorNotAvailable(status: str = 'predictor not available', message: str = 'The predictor is not available.')[source]

Bases: _Status

This status represents that the predictor is not available.

status: str = 'predictor not available'
message: str = 'The predictor is not available.'
class benchstab.utils.status.Timeout(status: str = 'timeout', message: str = 'The job has timed out.')[source]

Bases: _Status

This status represents that the job has timed out.

status: str = 'timeout'
message: str = 'The job has timed out.'

Amino Acids

class benchstab.utils.aminoacids.Mapper[source]

Bases: object

Aminoacid mapper class. It maps aminoacid one letter code to three letter code and vice versa. It also provides information about aminoacid properties. The information is based on the IMGT Aide-memoire for aminoacids.

Url: https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/IMGTclasses.html

Full table of aminoacid properties is as follows:

Three letters

One letter

Polarity

Charge

Chemical P.

Volume

Hydropathy

0

ALA

A

Non-Polar

Uncharged

Aliphatic

Very small

Hydrophobic

1

ARG

R

Polar

Positive

Basic

Large

Hydrophilic

2

ASN

N

Polar

Uncharged

Amide

Small

Hydrophilic

3

ASP

D

Polar

Negative

Acidic

Small

Hydrophilic

4

ASX

B

Polar

Uncharged

Aliphatic

Medium

Hydrophilic

5

CYS

C

Non-Polar

Uncharged

Sulfur

Small

Hydrophobic

6

GLU

E

Polar

Negative

Acidic

Medium

Hydrophilic

7

GLN

Q

Polar

Uncharged

Amide

Medium

Hydrophilic

8

GLY

G

Non-Polar

Uncharged

Aliphatic

Very small

Neutral

9

HIS

H

Polar

Positive

Basic

Medium

Neutral

10

ILE

I

Non-Polar

Uncharged

Aliphatic

Large

Hydrophobic

11

LEU

L

Non-Polar

Uncharged

Aliphatic

Large

Hydrophobic

12

LYS

K

Non-Polar

Positive

Basic

Large

Hydrophilic

13

MET

M

Non-Polar

Uncharged

Sulfur

Large

Hydrophobic

14

PHE

F

Non-Polar

Uncharged

Aromatic

Very large

Hydrophobic

15

PRO

P

Non-Polar

Uncharged

Aliphatic

Small

Neutral

16

SER

S

Polar

Uncharged

Hydroxyl

Very small

Neutral

17

THR

T

Polar

Uncharged

Hydroxyl

Small

Neutral

18

TRP

W

Non-Polar

Uncharged

Aromatic

Very large

Hydrophobic

19

TYR

Y

Non-Polar

Uncharged

Aromatic

Very large

Neutral

20

VAL

V

Non-Polar

Uncharged

Aliphatic

Medium

Hydrophobic

classmethod three_to_one_letter(aminoacid: str) str[source]

Converts three letter aminoacid code to one letter aminoacid code.

Parameters:

aminoacid (str) – three letter aminoacid code

Returns:

one letter aminoacid code

Return type:

str

classmethod one_to_three_letter(aminoacid: str) str[source]

Converts one letter aminoacid code to three letter aminoacid code.

Parameters:

aminoacid (str) – one letter aminoacid code

Returns:

three letter aminoacid code

Return type:

str

classmethod get_polarity(aminoacid: str)[source]

Returns aminoacid polarity.

Parameters:

aminoacid (str) – one letter aminoacid code

Returns:

aminoacid polarity

Return type:

str

classmethod get_charge(aminoacid: str)[source]

Returns aminoacid charge.

Parameters:

aminoacid (str) – one letter aminoacid code

Returns:

aminoacid charge

Return type:

str

classmethod get_chemical_properties(aminoacid: str)[source]

Returns aminoacid chemical properties.

Parameters:

aminoacid (str) – one letter aminoacid code

Returns:

aminoacid chemical properties

Return type:

str

classmethod get_volume_size(aminoacid: str)[source]

Returns aminoacid volume size.

Parameters:

aminoacid (str) – one letter aminoacid code

Returns:

aminoacid volume size

Return type:

str

classmethod get_hydropathy(aminoacid: str)[source]

Returns aminoacid hydropathy.

Parameters:

aminoacid (str) – one letter aminoacid code

Returns:

aminoacid hydropathy

Return type:

str

Dataset

class benchstab.utils.dataset.DatasetRow(data=None, index=None, dtype: Dtype | None = None, name=None, copy: bool | None = None, fastpath: bool | lib.NoDefault = <no_default>)[source]

Bases: Series

property _constructor

Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim

Used when a manipulation result has one higher dimension as the original, such as Series.to_frame()

property fasta

Get fasta sequence from the dataset’s row.

Returns:

fasta sequence

Return type:

str

property ph

Get pH from the dataset’s row.

Returns:

pH

Return type:

str

property temperature

Get temperature from the dataset’s row.

Returns:

temperature

Return type:

str

property mutation

Get mutation from the dataset’s row.

Returns:

mutation

Return type:

str

property fasta_mutation

Get fasta mutation from the dataset’s row.

Returns:

fasta mutation

Return type:

str

class benchstab.utils.dataset.PredictorDataset(*args, **kwargs)[source]

Bases: DataFrame

Wrapper around pandas.DataFrame with additional methods.

blocking_statuses = [Authenticaton(status='authentication', message='The job is being authenticated.'), Waiting(status='waiting', message="The job is waiting in predictor's queue."), Processing(status='processing', message='The job request is being currently processed.')]
update_statuses = [Authenticaton(status='authentication', message='The job is being authenticated.'), Waiting(status='waiting', message="The job is waiting in predictor's queue."), Processing(status='processing', message='The job request is being currently processed.'), NotStarted(status='not started', message='The job has not started yet.')]
classmethod concat(*args, **kwargs)[source]

Wrappper around pandas.concat, casting it back to PredictorDataset.

property _constructor

Used when a manipulation result has the same dimensions as the original.

property _constructor_sliced

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

Parameters

dataarray-like, Iterable, dict, or scalar value

Contains data stored in Series. If data is a dict, argument order is maintained.

indexarray-like or Index (1d)

Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.

dtypestr, numpy.dtype, or ExtensionDtype, optional

Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.

nameHashable, default None

The name to give to the Series.

copybool, default False

Copy input data. Only affects Series or 1d ndarray input. See examples.

Notes

Please reference the User Guide for more information.

Examples

Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result.

Constructing Series from a list with copy=False.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.

Constructing Series from a 1d ndarray with copy=False.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a view on the original data, so the data is changed as well.

property logger

Getter for the logger.

start_timer(index)[source]

Start the timer for the prediction.

Parameters:

index (int) – index of the row

update_status(index, status)[source]

Update status of the prediction. If prediction succeeded/failed, stop the timer.

Parameters:
  • index (int) – index of the row to update

  • status (str) – new status

format_to_output(verbose: int = 0)[source]

Format the dataset to output format.

Parameters:

verbose (int) – verbosity level

Returns:

formatted dataset

Return type:

PredictorDataset

is_blocking_status(index)[source]

Check if the status is blocking.

Parameters:

index (int) – index of the row

Returns:

True if the status is blocking, False otherwise

Return type:

bool

get_sequence(index)[source]

Get fasta sequence from the dataset.

Parameters:

index (int) – index of the row

Returns:

fasta sequence

Return type:

str