OOD Data Competition
Out Of Distribution Data Competition
How can machine learning models be used to robustly predict chemical properties of compounds and guarantee performance? This project will be run as a competition. The Top scoring individuals after the year have the opportunity to join Dr. Oliva’s Lab.
The goal of this competition is to devise a model(s) trained on in distribution data to accurately classify out of distribution molecules as potentially viable with high precision (extremely low false positive rate).
In order to advance scientific discovery, machine learning models need to be able to accurately make novel predictions about data outside of the training data distribution.
In collaboration with Dr. Oliva and his research team, help advance the role of ML in science.
Machine learning predictions work best on data that follows the same distribution as the training set, i.e. instances that are not particularly novel or interesting. The challenge is that the goal of scientific discovery is to explore novel ground on the foundation of existing knowledge. In the realm of scientific discovery, our goal is to extrapolate beyond what has previously been characterized.
In chemoinformatics, all training data will ultimately be limited when compared to the vastness of possible molecules. Being able to characterize very different molecules (out of distribution) than those present in the training data set is important to discovering new and potentially useful chemicals, such as for medical applications. In order to verify whether a molecule does in fact have the relevant chemical property, it must be synthesized in a lab. However, synthetization of molecules requires labor, time, and money. As a result, false positives are both costly and disappointing.
Using the provided training and validation datasets, your goal is to develop a binary classification model that is able to perform well against the out of distribution test dataset.