Abstract
In recent years, deep learning has revolutionized protein structure prediction, achieving remarkable speed and accuracy. RNA structure prediction, however, has lagged behind. Although several methods have shown moderate success in predicting RNA secondary and tertiary structures, none have reached the accuracy observed with contemporary protein models. The lack of success of these RNA structure prediction models has been proposed to be due to limited high-quality structural information that can be used as training data. To probe this proposed limitation, we developed a large and diverse dataset comprising paired RNA sequences and their corresponding secondary structures. We assess the utility of this enhanced dataset by retraining two deep learning models, SincFold and MXfold2. We find that SincFold exhibited improved generalization to some previously unseen RNA families, enhancing its capability to predict accurate de novo RNA secondary structures. By contrast, retraining MXfold2 proved too computationally expensive for the large RNASSTR dataset and did not achieve high performance on the testing set. The RNASSTR dataset provides a substantial advance for RNA structure modeling, laying a strong foundation for the development of future RNA secondary structure prediction algorithms.
Competing Interest Statement
J.H.C. is founder, board and SAB member of Initial Therapeutics. The Regents of the University of California have patents issued and pending for CRISPR technologies on which J.A.D. is an inventor. J.A.D. is a cofounder of Azalea Theratupics, Caribou Biosciences, Editas Medicine, Evercrisp, Scribe Therapeutics, Intellia Therapeutics, and Mammoth Biosciences. J.A.D. is a scientific advisory board member at Evercrisp, Caribou Biosciences, Intellia Therapeutics, Scribe Therapeutics, Mammoth Biosciences, The Column Group and Inari. J.A.D. is Chief Science Advisor to Sixth Street, a Director at Johnson & Johnson, Altos and Tempus, and has a research project sponsored by Apple Tree Partners. The remaining authors declare no competing interests.
Funder Information Declared
National Institutes of HealthNational Institutes of Health, https://ror.org/01cwqze88, 5R35GM148352, U19AI171110, U54AI170792, U19AI135990, UH3AI150552, U01AI142817
National Science FoundationNational Science Foundation, https://ror.org/021nxhr62, 2334028
Department of EnergyDepartment of Energy, , DE-AC02- 05CH11231, 2553571, B656358
Apple Tree PartnersApple Tree Partners, , 24180
Howard Hughes Medical InstituteHoward Hughes Medical Institute, https://ror.org/006w34k90,