Language Models for Molecule Generation

Program Start Date:

01-05-2024

Program End Date:

01-04-2027

Program lead:

Hrant Khachatryan

General information

Program Code:

24FP-1A058

Budget:

54000000

AMD

The goal of this project is to develop a “programming language” for molecules. We seek to advance the frontier of molecular modeling using large language models (LLMs), building on successes of models like Meta’s Galactica and our BARTSmiles by leveraging publicly available data sources such as PubChem. The model should be capable of not only predicting various properties of given molecules but also generating diverse sets of molecules with the desired properties. We will explore ways to maximize the knowledge captured by the LLM from the training data and to efficiently extract the learned knowledge for each downstream application. Eventually the model will be capable of handling protein sequences as well for improved modeling of interactions between molecules and proteins.