The initiative, known as Polymathic AI, leverages advanced technology similar to large language models like ChatGPT, but instead of processing text, it uses datasets from fields such as astrophysics, biology, chemistry, and fluid dynamics. This approach equips the models with cross-disciplinary scientific capabilities.
"These groundbreaking datasets are by far the most diverse large-scale collections of high-quality data for machine learning training ever assembled for these fields," said Michael McCabe, a research engineer at the Flatiron Institute in New York City and a member of Polymathic AI. "Curating these datasets is a critical step in creating multidisciplinary AI models that will enable new discoveries about our universe."
The Polymathic AI team has released two open-source datasets, collectively comprising 115 terabytes of data sourced from dozens of contributors. This massive resource is available to the public and is expected to accelerate the development of AI models capable of solving complex scientific problems. For comparison, GPT-3 required only 45 terabytes of unfiltered data during its training phase.
"The freely available datasets are an unprecedented resource for developing sophisticated machine learning models that can then tackle a wide range of scientific problems," added Ruben Ohana, a research fellow at the Flatiron Institute's Center for Computational Mathematics. "Open-sourcing this data benefits both the machine learning and scientific communities, creating a win-win situation."
The datasets are hosted on HuggingFace, a popular platform for AI models and data, and detailed in papers accepted for presentation at the prestigious NeurIPS conference in Vancouver, Canada.
"We've seen again and again that the most effective way to advance machine learning is to take difficult challenges and make them accessible to the wider research community," said McCabe. "When a new benchmark is released, it initially seems insurmountable. But opening access accelerates progress far beyond what any individual group could achieve."
Polymathic AI is a collaborative effort involving researchers from institutions such as the Simons Foundation, Flatiron Institute, New York University, and the Lawrence Berkeley National Laboratory.
The first dataset, named the Multimodal Universe, focuses on astrophysics and includes hundreds of millions of observations, such as images from NASA's James Webb Space Telescope and stellar data from ESA's Gaia spacecraft. "Machine learning has been happening for around 10 years in astrophysics, but it's still very hard to use across instruments, missions, and disciplines," said Polymathic AI researcher Francois Lanusse. "Datasets like the Multimodal Universe allow us to create models that natively understand this data and act as a Swiss Army knife for astrophysics."
The second dataset, dubbed the Well, spans 15 terabytes of data across 16 diverse datasets. It features simulations of biological systems, fluid dynamics, supernovae, and more, all rooted in mathematical equations called partial differential equations. These equations appear in a wide array of scientific problems but are notoriously difficult to solve. "This dataset encompasses a diverse range of physics simulations designed to address key limitations of current machine learning models," said Polymathic AI member Rudy Morel.
Building these datasets required extensive collaboration. "The creators of numerical simulations are sometimes skeptical of machine learning because of the hype, but they're curious about how it can benefit their research," Ohana explained.
The team is now using the datasets to train AI models, with early results showing promise. "Understanding how machine learning models generalize and interpolate across datasets from different physical systems is an exciting research challenge," said Polymathic AI member Regaldo-Saint Blancard.
Shirley Ho, project lead and group leader at the Flatiron Institute, noted, "Just like the Protein Data Bank spawned AlphaFold, I'm excited to see what the Well and the Multimodal Universe will help create." Ho will present Polymathic AI's findings at NeurIPS.
Related Links
Polymathic AI
Simons Foundation
All about the robots on Earth and beyond!
Subscribe Free To Our Daily Newsletters |
Subscribe Free To Our Daily Newsletters |