Prioritizing Structures for Neuroevolution Potentials

C. Strandby
Master′s Thesis (2024)
doi: 20.500.12380/

This Master’ s thesis investigates how informed data selection using principal component analysis impacts the performance of machine-learned interatomic potentials in molecular dynamics simulations, with a focus on improving training efficiency and accuracy for neuroevolution potential models. The central hypothesis is that selecting structures based on their separation in principal component space can enhance dataset diversity and improve model robustness against overfitting.

The primary method, Large Variance Selection, selects structures based on the largest norm of descriptor variance in principal component space, to create diverse training datasets. The method was applied to crystalline benzene datasets of varying sizes, and model performance was evaluated using root-mean-square-error scores for energy and force predictions. Two additional selection methods were also explored: one selecting structures with the smallest norm of descriptor variance, and another that attempts to maintain the original target distribution.

The results showed that Large Variance Selection consistently outperformed random selection, reducing overfitting and improving test accuracy. The other methods, while providing insights into different biases, generally performed not only worse than Large Variance Selection but even random selection, particularly in force predictions.

In conclusion, this study demonstrates that informed data selection using principal component analysis based on large variance can enhance the performance of neuroevolution potential models by increasing dataset diversity. Informed selection in principal component space may also improve the reproducibility of results by mitigating biases present in random selection. These findings highlight the importance of dataset diversity and offer a promising approach for more efficient training of neuroevolution potential models, advancing the field of molecular dynamics simulations.