Abstract: MATH/CHEM/COMP 2002, Dubrovnik, June 24-29, 2002
Modelling Aqueous Solubility
Darko Butina and Joelle Gola
ArQule (UK) Ltd, Cambridge Science Park, Cambridge, UK
Of all the molecular properties, which can profoundly affect its biological activity, the aqueous solubility is probably one of the most fundamental and deserves attention in the early drug discovery phase. Not surprisingly, aqueous solubility has been extensively studied and large number of computational methods for the estimation of this highly important property has been reported. They generally use various molecular descriptors based on partition coefficients, chromatographic parameters, activity coefficients, solvation parameters and a variety of geometric, electronic, and topological molecular descriptors.
This paper describes development of the aqueous solubility model based on solubility data from Syracuse database, using calculated octanol/water partition and 51 2D based descriptors. Two methodologies where used, PLS (SIMCA) and Cubist (Ross Quinlan). Cubist, which combines decision trees rules with Multiple Linear Regression (MLR), gave better results with smaller average absolute error, on the independent and randomly selected test set, with R2 = 0.74 and average absolute error of 0.68 log units. Both, training and test set, had similar distribution in terms of different functionalities present, 60% of neutral molecules, 14% acids, 8% phenols, 11% monobasic, 4% polybasic and 3% zwitterions. Training and test set have been designed by random selection, with 81% (2688 molecules) in the training set, and 19% (640) in the test set. Test set has been used only once, to produce the final statistics.