General information
The goal of this project is to develop a “programming language” for molecules. We seek to advance the frontier of molecular modeling using large language models (LLMs), building on successes of models like Meta’s Galactica and our BARTSmiles by leveraging publicly available data sources such as PubChem. The model should be capable of not only predicting various properties of given molecules but also generating diverse sets of molecules with the desired properties. We will explore ways to maximize the knowledge captured by the LLM from the training data and to efficiently extract the learned knowledge for each downstream application. Eventually the model will be capable of handling protein sequences as well for improved modeling of interactions between molecules and proteins.