Utilities
HTML Parser
- class benchstab.utils.html_parser.HTMLParser[source]
Bases:
object- with_xpath(xpath: str, html: str | None = None, root=None, index: int = 0, permissive=True)[source]
Parse HTML with XPath expression and return the result. If index is not None, return the value at the index.
- Parameters:
xpath – XPath expression
html – HTML string
root – lxml.html
index – int
permissive – bool
- Returns:
str or list
- Raises:
HTMLParserError
- with_pandas(html: str, index: int = 0, pandas_args: Dict[Any, Any] | None = None, permissive=True)[source]
Parse HTML with pandas and return the result. If index is not None, return the value at the index.
- Parameters:
html – HTML string
index – int
pandas_args – dict
permissive – bool
- Returns:
pd.DataFrame or list
- Raises:
HTMLParserError
- __check_enough_values(result, index, permissive)
Exceptions
- exception benchstab.utils.exceptions._BaseError(*args: object, **kwargs: object)[source]
Bases:
Exception
- exception benchstab.utils.exceptions.HTMLParserError(*args: object, **kwargs: object)[source]
Bases:
_BaseError
- exception benchstab.utils.exceptions.PreprocessorError(*args: object, **kwargs: object)[source]
Bases:
_BaseError
- exception benchstab.utils.exceptions.PredictorError(*args: object, **kwargs: object)[source]
Bases:
_BaseError
- exception benchstab.utils.exceptions.DatasetError(*args: object, **kwargs: object)[source]
Bases:
_BaseError
- exception benchstab.utils.exceptions.BenchStabError(*args: object, **kwargs: object)[source]
Bases:
_BaseError
Structure
- class benchstab.utils.structure.File(file: str = '', name: str = '')[source]
Bases:
objectBase class for all file types. Contains methods for converting the file to different formats, as well as methods for fetching files from URLs and opening files.
- _rcsb_url = ''
- _uniprot_url = ''
- __to_multipart(content_type: str) Dict[str, str | bytes | StringIO]
Convert the file to a multipart (from-data) file. If the file does not exist, create a temporary file. The final format depends on the content type.
- Parameters:
content_type (str) – content type of the file. Allowed types: - text/plain - application/octet-stream
- Returns:
multipart file
- Return type:
dict
- to_bytes() bytes[source]
Convert the file to BytesIO.
- Returns:
file as bytes
- Return type:
BytesIO object
- to_octet_stream() Dict[str, str | bytes | StringIO][source]
Convert the file to an octet-stream.
- Returns:
file as octet-stream
- Return type:
bytes
- to_plain_text() Dict[str, str | bytes | StringIO][source]
Convert the file to plain text.
- Returns:
file as plain text
- Return type:
str
- classmethod get_from_url_by_id(url: str, **kwargs) Response[source]
Get the file from a URL by ID. Raise an exception if the file does not exist.
- Parameters:
id (str) – ID
url (str) – URL
- Returns:
file
- Return type:
requests.Response
- Raises:
PreprocessorError – if the record does not exist
- classmethod open(file_path: str, mode: str = 'r', encoding: str = 'utf-8') TextIOWrapper | BufferedReader[source]
Open a file. Raise an exception if the file does not exist.
- Parameters:
file_path (str) – file path
mode (str) – file mode
encoding (str) – file encoding
- Returns:
file
- Return type:
Union[TextIOWrapper, BufferedReader]
- Raises:
FileNotFoundError – if the file does not exist
- class benchstab.utils.structure.Fasta(sequence: str, chain: str, header: str, name: str | None = None, offsets: Dict[int, int] | None = None)[source]
Bases:
FileFASTA file class. Contains methods for extracting FASTA sequences from different sources.
- _sifts_url = 'https://www.ebi.ac.uk/pdbe/api/mappings/uniprot/{id}'
- _rcsb_url = 'https://www.rcsb.org/fasta/entry/{id}'
- _uniprot_url = 'https://rest.uniprot.org/uniprotkb/{id}.fasta'
- _rcsb_polymer_instance_url = 'https://data.rcsb.org/rest/v1/core/polymer_entity_instance/{id}/{chain}'
- logger = <Logger benchstab.utils.structure (INFO)>
- classmethod create(datapoint: str)[source]
Determine the FASTA sequence format and return a FASTA object.
- Parameters:
datapoint (str) – FASTA string or file path
- Returns:
FASTA object
- Return type:
- Raises:
PreprocessorError – if the sequence is invalid
- classmethod from_uniprot(datapoint)[source]
Extract the FASTA sequence from the Uniprot database.
- Parameters:
datapoint (str) – Uniprot ID
- Returns:
FASTA object
- Return type:
- classmethod __get_author_residue_number_offset(pdb_id: str, chain: str) List[str]
Get the author residue number start from the RCSB database.
- Parameters:
pdb_id (str) – PDB ID
chain (str) – chain ID
- Returns:
author residue number offset
- Return type:
List[str]
- classmethod __fasta_offsets_from_sifts(pdb_id: str, chain: str) Dict[str, str | List[str] | int]
Get the offset between the PDB and Uniprot FASTA sequences.
- Parameters:
pdb_id (str) – PDB ID
chain (str) – chain ID
- Returns:
Uniprot ID, PDB offsets, Uniprot start, Uniprot end
- Return type:
Dict[str, Union[str, List[str], int]]
- classmethod from_uniprot_by_pdb_id(pdb_id: str, chain: str)[source]
Extract the FASTA sequence from the Uniprot database.
- Parameters:
pdb_id (str) – PDB ID
chain (str) – chain ID
- Returns:
FASTA object
- Return type:
- Raises:
PreprocessorError – if the chain is invalid for the structure
PreprocessorError – if the structure is not mapped to Uniprot
- classmethod from_rcsb_by_pdb_id(pdb_id: str, chain: str)[source]
Extract the FASTA sequence from the RCSB PDB database.
- Parameters:
pdb_id (str) – PDB ID
chain (str) – chain ID
- Returns:
FASTA object
- Return type:
- Raises:
PreprocessorError – if the chain is invalid for the structure
PreprocessorError – if the structure is not mapped to PDB
- classmethod from_file(file_path)[source]
Extract the FASTA sequence from a FASTA file.
- Parameters:
file_path (str) – FASTA file path
- Returns:
FASTA object
- Return type:
- Raises:
PreprocessorError – if the structure is invalid by BioPython standards
PreprocessorError – if the file does not contain any sequences
- classmethod from_pdb_file(file_path: str, chain: str)[source]
Extract the FASTA sequence from a PDB file.
- Parameters:
file_path (str) – PDB file path
chain (str) – chain ID
- Returns:
FASTA object
- Return type:
- Raises:
PreprocessorError – if the structure is invalid by BioPython standards
PreprocessorError – if the chain is invalid for the structure
PreprocessorError – if the file does not contain any sequences
- class benchstab.utils.structure.PDB(pdb_file: str, pdb_id: str | None = None, source: str = 'file', chains: List[str] | None = None)[source]
Bases:
FilePDB file class. Contains methods for extracting PDB structures from different sources.
- _rcsb_url = 'https://files.rcsb.org/download/{id}.pdb'
- logger = <Logger benchstab.utils.structure (INFO)>
- classmethod create(pdb: str)[source]
Determine the PDB structure format and return a PDB object
- Parameters:
pdb (str) – PDB structure or file path
- Returns:
PDB object
- Return type:
- classmethod from_id(pdb_id: str)[source]
Fetch the PDB record from the RCSB PDB database, process it and return a PDB object.
- Parameters:
pdb_id (str) – PDB ID
- Returns:
PDB object
- Return type:
- classmethod from_file(file_handle: str | StringIO)[source]
Process the PDB file and return a PDB object. Log a warning if the structure is faulty by BioPython standards. Raise an exception if the structure is invalid by BioPython standards.
- Parameters:
file_handle (Union[str, StringIO]) – PDB file handle
- Returns:
PDB object
- Return type:
- Raises:
PreprocessorError – if the structure is invalid by BioPython standards
Status
- class benchstab.utils.status._Status(status: str = '', message: str = '')[source]
Bases:
object- blocking = False
- status: str = ''
- message: str = ''
- class benchstab.utils.status.NotStarted(status: str = 'not started', message: str = 'The job has not started yet.')[source]
Bases:
_StatusThis status represents the initial state of the job.
- blocking = False
- status: str = 'not started'
- message: str = 'The job has not started yet.'
- class benchstab.utils.status.Authenticaton(status: str = 'authentication', message: str = 'The job is being authenticated.')[source]
Bases:
_StatusThis status represents the that the user is being authenticated to the predictor.
- blocking = True
- status: str = 'authentication'
- message: str = 'The job is being authenticated.'
- class benchstab.utils.status.Waiting(status: str = 'waiting', message: str = "The job is waiting in predictor's queue.")[source]
Bases:
_StatusThis status represents that the job is waiting in the predictor’s queue (in case of POST-GET predictors), or that the job is waiting for the response (in case of POST predictors).
- blocking = True
- status: str = 'waiting'
- message: str = "The job is waiting in predictor's queue."
- class benchstab.utils.status.Processing(status: str = 'processing', message: str = 'The job request is being currently processed.')[source]
Bases:
_StatusThis status represents that job has not been queued to the predictor yet, but the predictor is processing the payload. It’s usually asocciated with predictors that require some middle step (e.g., multiple post requests) before the job is queued.
- blocking = True
- status: str = 'processing'
- message: str = 'The job request is being currently processed.'
- class benchstab.utils.status.Finished(status: str = 'finished', message: str = 'The job has finished successfully.')[source]
Bases:
_StatusThis status represents that the job has finished successfully.
- status: str = 'finished'
- message: str = 'The job has finished successfully.'
- class benchstab.utils.status.Failed(status: str = 'failed', message: str = 'The job has failed for unknown reasons.')[source]
Bases:
_StatusThis status represents that the job has failed for unknown (other) reasons. It is a good practice to pass the reason for the failure (exception) in the message.
- status: str = 'failed'
- message: str = 'The job has failed for unknown reasons.'
- class benchstab.utils.status.ParsingFailed(status: str = 'parsing failed', message: str = 'The job has failed during data parsing.')[source]
Bases:
_StatusThis status represents that the job has failed during data parsing. It can be HTML processing or data manipulation - Indexing, slicing, etc. It is a good practice to pass the reason for the failure (exception) in the message.
- status: str = 'parsing failed'
- message: str = 'The job has failed during data parsing.'
- class benchstab.utils.status.ConnectionFailed(status: str = 'connection failed', message: str = 'The job has failed during connection.')[source]
Bases:
_StatusThis status represents that the job has failed during network communication with the predictor. This often means that the predictor is down, the network connection is unstable or that the predictor has moved to a different URL.
- status: str = 'connection failed'
- message: str = 'The job has failed during connection.'
- class benchstab.utils.status.AuthenticationFailed(status: str = 'authentication failed', message: str = 'The job has failed during authentication.')[source]
Bases:
_StatusThis status represents failed attempts to authenticate to the predictor.
- status: str = 'authentication failed'
- message: str = 'The job has failed during authentication.'
- class benchstab.utils.status.PredictorNotAvailable(status: str = 'predictor not available', message: str = 'The predictor is not available.')[source]
Bases:
_StatusThis status represents that the predictor is not available.
- status: str = 'predictor not available'
- message: str = 'The predictor is not available.'
Amino Acids
- class benchstab.utils.aminoacids.Mapper[source]
Bases:
objectAminoacid mapper class. It maps aminoacid one letter code to three letter code and vice versa. It also provides information about aminoacid properties. The information is based on the IMGT Aide-memoire for aminoacids.
Url: https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/IMGTclasses.html
Full table of aminoacid properties is as follows:
Three letters
One letter
Polarity
Charge
Chemical P.
Volume
Hydropathy
0
ALA
A
Non-Polar
Uncharged
Aliphatic
Very small
Hydrophobic
1
ARG
R
Polar
Positive
Basic
Large
Hydrophilic
2
ASN
N
Polar
Uncharged
Amide
Small
Hydrophilic
3
ASP
D
Polar
Negative
Acidic
Small
Hydrophilic
4
ASX
B
Polar
Uncharged
Aliphatic
Medium
Hydrophilic
5
CYS
C
Non-Polar
Uncharged
Sulfur
Small
Hydrophobic
6
GLU
E
Polar
Negative
Acidic
Medium
Hydrophilic
7
GLN
Q
Polar
Uncharged
Amide
Medium
Hydrophilic
8
GLY
G
Non-Polar
Uncharged
Aliphatic
Very small
Neutral
9
HIS
H
Polar
Positive
Basic
Medium
Neutral
10
ILE
I
Non-Polar
Uncharged
Aliphatic
Large
Hydrophobic
11
LEU
L
Non-Polar
Uncharged
Aliphatic
Large
Hydrophobic
12
LYS
K
Non-Polar
Positive
Basic
Large
Hydrophilic
13
MET
M
Non-Polar
Uncharged
Sulfur
Large
Hydrophobic
14
PHE
F
Non-Polar
Uncharged
Aromatic
Very large
Hydrophobic
15
PRO
P
Non-Polar
Uncharged
Aliphatic
Small
Neutral
16
SER
S
Polar
Uncharged
Hydroxyl
Very small
Neutral
17
THR
T
Polar
Uncharged
Hydroxyl
Small
Neutral
18
TRP
W
Non-Polar
Uncharged
Aromatic
Very large
Hydrophobic
19
TYR
Y
Non-Polar
Uncharged
Aromatic
Very large
Neutral
20
VAL
V
Non-Polar
Uncharged
Aliphatic
Medium
Hydrophobic
- classmethod three_to_one_letter(aminoacid: str) str[source]
Converts three letter aminoacid code to one letter aminoacid code.
- Parameters:
aminoacid (str) – three letter aminoacid code
- Returns:
one letter aminoacid code
- Return type:
str
- classmethod one_to_three_letter(aminoacid: str) str[source]
Converts one letter aminoacid code to three letter aminoacid code.
- Parameters:
aminoacid (str) – one letter aminoacid code
- Returns:
three letter aminoacid code
- Return type:
str
- classmethod get_polarity(aminoacid: str)[source]
Returns aminoacid polarity.
- Parameters:
aminoacid (str) – one letter aminoacid code
- Returns:
aminoacid polarity
- Return type:
str
- classmethod get_charge(aminoacid: str)[source]
Returns aminoacid charge.
- Parameters:
aminoacid (str) – one letter aminoacid code
- Returns:
aminoacid charge
- Return type:
str
- classmethod get_chemical_properties(aminoacid: str)[source]
Returns aminoacid chemical properties.
- Parameters:
aminoacid (str) – one letter aminoacid code
- Returns:
aminoacid chemical properties
- Return type:
str
Dataset
- class benchstab.utils.dataset.DatasetRow(data=None, index=None, dtype: Dtype | None = None, name=None, copy: bool | None = None, fastpath: bool | lib.NoDefault = <no_default>)[source]
Bases:
Series- property _constructor
Used when a manipulation result has the same dimensions as the original.
- property _constructor_expanddim
Used when a manipulation result has one higher dimension as the original, such as Series.to_frame()
- property fasta
Get fasta sequence from the dataset’s row.
- Returns:
fasta sequence
- Return type:
str
- property ph
Get pH from the dataset’s row.
- Returns:
pH
- Return type:
str
- property temperature
Get temperature from the dataset’s row.
- Returns:
temperature
- Return type:
str
- property mutation
Get mutation from the dataset’s row.
- Returns:
mutation
- Return type:
str
- property fasta_mutation
Get fasta mutation from the dataset’s row.
- Returns:
fasta mutation
- Return type:
str
- class benchstab.utils.dataset.PredictorDataset(*args, **kwargs)[source]
Bases:
DataFrameWrapper around pandas.DataFrame with additional methods.
- blocking_statuses = [Authenticaton(status='authentication', message='The job is being authenticated.'), Waiting(status='waiting', message="The job is waiting in predictor's queue."), Processing(status='processing', message='The job request is being currently processed.')]
- update_statuses = [Authenticaton(status='authentication', message='The job is being authenticated.'), Waiting(status='waiting', message="The job is waiting in predictor's queue."), Processing(status='processing', message='The job request is being currently processed.'), NotStarted(status='not started', message='The job has not started yet.')]
- classmethod concat(*args, **kwargs)[source]
Wrappper around pandas.concat, casting it back to PredictorDataset.
- property _constructor
Used when a manipulation result has the same dimensions as the original.
- property _constructor_sliced
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Operations between Series (+, -, /, *, **) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.
Parameters
- dataarray-like, Iterable, dict, or scalar value
Contains data stored in Series. If data is a dict, argument order is maintained.
- indexarray-like or Index (1d)
Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
- dtypestr, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.
- nameHashable, default None
The name to give to the Series.
- copybool, default False
Copy input data. Only affects Series or 1d ndarray input. See examples.
Notes
Please reference the User Guide for more information.
Examples
Constructing Series from a dictionary with an Index specified
>>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64
The keys of the dictionary match with the Index values, hence the Index values have no effect.
>>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64
Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result.
Constructing Series from a list with copy=False.
>>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64
Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.
Constructing Series from a 1d ndarray with copy=False.
>>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64
Due to input data type the Series has a view on the original data, so the data is changed as well.
- property logger
Getter for the logger.
- start_timer(index)[source]
Start the timer for the prediction.
- Parameters:
index (int) – index of the row
- update_status(index, status)[source]
Update status of the prediction. If prediction succeeded/failed, stop the timer.
- Parameters:
index (int) – index of the row to update
status (str) – new status
- format_to_output(verbose: int = 0)[source]
Format the dataset to output format.
- Parameters:
verbose (int) – verbosity level
- Returns:
formatted dataset
- Return type: