We are always looking for new, better materials for applications including energy and communications. To do so systematically, we need to interrogate millions of potential material candidates. This lends itself to AI, and we are arguably already on the cusp of useful, fully autonomous approaches to materials discovery.
To find materials, we are motivated by one or more desirable properties in tandem, like high electrical and low thermal conductivity. These properties will also have downstream dependencies, such as molecular orientation, so we need to incorporate very large amounts of data to screen effectively.
It is highly laborious to create and maintain these databases for a number of reasons. Firstly, new experimental data is continuously created and validated, and the volume of publications is generally increasing year on year. Secondly, the literature does not follow a unified reporting standard, which makes it difficult to incorporate this new information. Thirdly, the literature is generally not published in a machine-readable format, often being published in PDF for example.
In order to generate a database from unstructured text, we use language models. They work by approximating the semantic meaning of a word, making them the only tool flexible enough for systematic literature analysis at scale. Language models are high dimensional, nonlinear problem solvers suited to messy tasks, and natural language understanding is a very noisy task.
We use language models for a variety of tasks. Usually, a combination of entity extraction (such as finding all the chemical names in the text), as well as more qualitative tasks (such as finding the formation conditions of a material). We try to benchmark the performance of our information extraction by manually comparing our results to a handmade curated dataset.
We can try to improve the performance of our language models on qualitative tasks by training them on domain-specific data. We call this fine-tuning. Fine-tuning always leads to a trade-off, where a language model loses general performance in favor of task-specific performance. The effect of fine-tuning on a model is not well understood, and many of the emergent properties of language models come from model training on extremely broad sources, so we must be careful not to overfit on domain data.
At the end of this process, we end up with a database of relevant material information. Looking up a level of abstraction, it's important to keep in mind the constraints of this database: which properties we chose to extract, what the experimentalists chose to report, and what information we may have lost.
The database is a snapshot of material behaviours under static conditions, and we are usually interested in utilising materials for dynamic applications, so it's far from the end of the investigation. To actually select a material includes multiple stages of screening. We want to first disqualify as many candidates as possible using cheap methods, then follow-up with medium and finally high resolution interatomic simulations. An example of a medium resolution screening technique is molecular dynamics using approximate forces to simulate molecular behaviour. Higher resolution techniques involve simulating the actual behaviour of individual atoms and electrons, which requires the use of more expensive calculations, such as in Density Functional Theory.
To conclude, we have summarised the process of:
Creating a materials database using a language model, with the freedom to define exactly what to create.
Applying a selection criteria using a variety of resolutions to determine a final set of candidates.
In the future, as the volume of experimental data increases, and the efficacy of language models for tasks like entity extraction continues to improve, we will become increasingly effective at utilising latent information from the literature for rapid scientific iteration and discovery.
As DFT is so computationally expensive, we struggle to model many atoms for large volumes over a long timeframe. Another emergent approach is to utilise DFT data to create Machine Learning Interatomic Potentials (ML-IPs). These neural nets are highly dimensional functions that predict atomistic behaviour. Our empirical approaches are not perfect, and make compromises like not incorporating effects from secondary electron shielding or instantaneous magnetic moments. ML-IPs are interesting because they are computationally lighter than DFT (discounting the compute required for the initial training), so they can handle larger simulations (with more molecules and longer timeframes). This lets us get a better insight into dynamic behaviour that may only bear out over longer timeframes with more atoms. We can also use them to simulate dynamical systems that would be almost impossible to model empirically, like biological amorphous systems.