Bayesian statistics allows to express domain knowledge about model parameters as a probability distribution and, by means of Bayes’ theorem, to update this knowledge using measured data. It is thus a perfect example of interpretable data science and a proven tool for making probabilistic predictions. It forces us to conceptualize our knowledge about the system and its dominant sources of uncertainty in the form of a stochastic model for the measured data. Bayesian inference is almost never consistently applied in connection with non-trivial stochastic models, because it is computationally extremely expensive. In recent years, sophisticated and scalable algorithms have emerged, which have the potential of making Bayesian inference for complex stochastic models feasible, even for large data sets. It is the primary goal of this project to explore the potential of these algorithms and to make them accessible to researchers from various domains. For this endeavour, the SDSC is an ideal partner for our team of experts from data, statistical, computational and domain sciences.
The Bayesian inference algorithms we will apply fall into the two classes: Approximate Bayesian Computation (ABC) and Hamiltonian Monte Carlo (HMC). While the former class is technically easy to apply but yields only approximate results , the latter requires much more tailoring to a particular problem, but has the potential of yielding exact results . The basic idea behind ABC is to compress data into a few so-called summary statistics and accept or reject model parameters depending on how well associated model outputs (pseudo-data generated via model forward simulation), comply with the (real) data in terms of these statistics. Today, ABC is used in many domains, but little is known as to (i) how the summary statistics should be chosen and (ii) how accurate the inference results are. It is one of the goals of this project to compare the performance of automatically generated summary statistics against those derived from domain knowledge. For this purpose a HPC-enabled Python package, ABCpy, has already been developed by members of our team. The ground truth, against which we will compare the approximate ABC results, will be calculated by means of HMC. HMC algorithms are rarely applied with stochastic models because they involve calculating (i) huge-dimensional integrals and (ii) exact derivatives. Inspired from Statistical Physics, members of our team have recently introduced and implemented the idea of a generic time-scale separation, which boosts the performance of HMC-inference for stochastic differential equation models. A proof of concept has been given , but parallelization will be required for making inference with long time-series. The problem with the derivatives is solved by employing Automated Differentiation (AD) routines. The use of AD and parallelization will leverage HMC algorithms and open up a large number of potential applications. The combined expertise of SDSC and our team will be perfectly suited for this task. ABC and HMC are algorithms scalable to large datasets and applicable in many domains. We thus deem them of great significance for data science. We will test the algorithms on problems of intermediate computational size, from the domains of solar physics and quantum magnetism. The lessons learned and the tools developed here should allow researchers from various domains to tackle problems of much larger size.
Professor Antonietta Mira; ; PI; ICS Institute of Computational Science
Professor C. Ruegg; PI; PSI
Professor C. Alberto; PI; EAWAG
Professor Ritabrata Dutta
Swiss Data Science Center (SDSC);
Paul Scherrer Institute (PSI);
SPELL OUT (EAWAG);
Swiss National Supercomputing Center (CSCS);
Swiss Data Science Center;