What is data science? Why do machine learning algorithms perform so well on some problems and so poorly on others? How can institutions come together to help improve our standards for model design and testing?

Despite the rapid growth and apparent ubiquity of data science in recent years, little research has been conducted into these fundamental questions. This project seeks to establish the epistemological foundations of machine learning, with a focus on applications in computational biology.

High throughput technologies like whole genome sequencing have radically transformed modern biological research, yet the dream of personalised medicine remains elusive. Machine learning promises to address this urgent social need, but without a better understanding of the discipline’s methodological imperatives, efforts will continue to be plagued by avoidable errors and disappointing results.

Through a combination of formal reasoning, computer simulations, and expert interviews, we intend to place data science on firm theoretical footing. We put our theory to work by developing new tools for multi-omic meta-analysis and proposing institutional reforms designed to promote more rigorous standards for machine learning research.

Image credit: Jer Thorp, Attribution 2.0 Generic (CC BY 2.0)