This is a landing page of SoluProt, a web application for prediction of protein solubility.
Protein solubility is a hallmark for practical use of proteins in biotechnologies and biomedicine. Nowadays, protein solubility and aggregation poses a major bottleneck in production of many therapeutic and industrially attractive proteins. The tools predicting a solubility score from entire protein sequence are most useful for genomic projects assisting with prioritization of protein sequences selected for laboratory production .
SoluProt is one of the latest additions to the family of solubility predictors based on machine learning. The training set is based on the TargetTrack database , which was carefully filtered to keep only targets expressed in Escherichia coli. The negative and positive samples were balanced and equalized for the protein lengths. The independent validation set is derived from the NESG dataset .
The predictor is in its current version based on random forest regression model and employs 36 sequence-based features, e.g., amino acid content, predicted disorder, alpha-helix and beta-sheet content, sequence identity to PDB and several aggregated physico-chemical properties. SoluProt currently achieves accuracy 58.2 %, higher than other comparable tools, and is a subject of further active development. The tool will be implemented as intuitive web interface and made available to the community.