Published in

American Chemical Society, Journal of Chemical Information and Modeling, 3(54), p. 844-856, 2014

DOI: 10.1021/ci4005805

Links

Tools

Export citation

Search in Google Scholar

Uniting Cheminformatics and Chemical Theory To Predict the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules

This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Green circle
Preprint: archiving allowed
  • Must obtain written permission from Editor
  • Must not violate ACS ethical Guidelines
Orange circle
Postprint: archiving restricted
  • Must obtain written permission from Editor
  • Must not violate ACS ethical Guidelines
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

We present four models of solution free energy prediction for druglike molecules utilising cheminformatics descriptors and theoretically calculated thermodynamic values. We make predictions of solution free energy using physics-based theory alone and using machine learning/QSPR models. We also develop machine learning models where the theoretical energies and cheminformatics descriptors are used as combined input. These models are used to predict solvation free energy. While direct theoretical calculation does not give accurate results in this approach, machine learning is able to give predictions with an RMSE (Root Mean Squared Error) of around 1.1 log S units in a 10-fold cross-validation for a our Drug-Like-Solubility-100 (DLS-100) dataset of 100 druglike molecules. We find that a model built using energy terms from our theoretical methodology as descriptors is marginally less predictive than one built on chemistry development kit (CDK) descriptors. Combining both sets of descriptors allows a further but very modest improvement in the predictions. However, in some cases this is a statistically significant enhancement. These results suggest that there is little complementarity between the chemical information provided by these two sets of descriptors, despite their different sources and methods of calculation. Our machine learning models are also able to predict the well-known Solubility Challenge dataset with an RMSE of between 0.9 and 1.0 log S units.