Abstract Search

ISEF | Projects Database | Finalist Abstract

Back to Search Results | Print PDF

De Novo Drug Design as GPT Language Modeling: Large Chemistry Models With Supervised and Reinforcement Learning

Booth Id:
CBIO048

Category:
Computational Biology and Bioinformatics

Year:
2024

Finalist Names:
Ye, Gavin (School: Columbia Grammar and Preparatory School)

Abstract:
In recent years, generative machine learning algorithms have been successful in designing innovative drug-like molecules. SMILES is a sequence-like language used in most effective drug design models. Due to the data’s sequential structure, models such as recurrent neural networks and transformers can design pharmacological compounds with optimized efficacy. Large language models have advanced recently, but their implications on drug design have not yet been explored. Although one study successfully pre-trained a large chemistry model (LCM), its application to specific tasks in drug discovery is unknown. In this study, the drug design task is modeled as a causal language modeling problem. Thus, the procedure of reward modeling, supervised fine-tuning, and proximal policy optimization was used to transfer the LCM to drug design, similar to Open AI's ChatGPT and InstructGPT procedures. By combining the SMILES sequence with chemical descriptors, the novel efficacy evaluation model exceeded its performance compared to previous studies. After proximal policy optimization, the drug design model generated molecules with 99.2% having efficacy pIC50 > 7 towards the amyloid precursor protein, with 100% of the generated molecules being valid and novel. This demonstrated the applicability of LCMs in drug discovery, with benefits including less data consumption while fine-tuning. The applicability of LCMs to drug discovery opens the door for larger studies involving reinforcement learning with human feedback, where chemists provide feedback to LCMs and generate higher-quality molecules. LCMs’ ability to design similar molecules from datasets paves the way for more accessible, non-patented alternatives to drug molecules.