This is a preliminary version of the SoluProt web application for prediction of soluble protein expression in Escherichia coli.
SoluProt is one of the latest additions to the family of solubility predictors based on machine learning . The training set is based on the TargetTrack database , which was carefully filtered to keep only targets expressed in Escherichia coli. The negative and positive samples were balanced and equalized for the protein lengths. The independent validation set is derived from the NESG dataset .
The predictor is in its current version based on random forest regression model and employs 36 sequence-based features, e.g., amino acid content, predicted disorder, alpha-helix and beta-sheet content, sequence identity to PDB and several aggregated physico-chemical properties. SoluProt currently achieves accuracy 58.2%, higher than other comparable tools, and is a subject of further active development.
- Musil, M., Konegger, H., Hon, J., Bednar, D., Damborsky, J., Computational Design of Stable and Soluble Biocatalysts. ACS Catalysis 9: 1033−1054.
- Berman, H. M., Gabanyi, M. J., Kouranov, A., Micallef, D. I., Westbrook, J., Protein Structure Initiative Network Of Investigators (2017). Protein Structure Initiative – Targettrack 2000–2017.
- Price, W. N., Handelman, S. K., Everett, J. K., Tong, S. N., Bracic, A., Luff, J. D., Hunt, J. F. (2011). Large-Scale Experimental Studies Show Unexpected Amino Acid Effects on Protein Expression and Solubility in vivo in E. coli. Microbial Informatics and Experimentation 1: 6.