Bayesian identifications of data intrinsic dimensions
Tuesday, April 2, 2019, 14:30, aula 402 (main building)
Authors: Michele Allegra, Francesco Denti, Elena Facco, Alessandro Laio, Michele Guindani, Antonietta Mira
Presenter: Antonietta Mira, Università della Svizzera italiana and University of Insubria
Even if defined on a large dimensional space, data points usually lie onto one or more hypersurfaces, or manifolds, with much smaller intrinsic dimensions (ID). The recent TWO-NN method (Facco et al., 2017, Scientific Report), allows estimating the ID when all points lie onto a single sub-manifold.
TWO-NN only assumes that the density of points is approximately constant in a small neighborhood around each point. Under this hypothesis, the ratio of the distances of a point from its first and second neighbor follows a Pareto distribution that depends parametrically only on the ID. We first extend the TWO-NN model to the case in which the data lie onto several sub-manifolds each one with its own different ID. While the idea behind the model extension is simple (the Pareto is replaced by a finite mixture of $K$ Pareto distributions), a non-trivial Bayesian algorithm is required for estimating the model and assigning each point to its own manifold. Applying this method, which we dub Hidalgo (Heterogeneous Intrinsic Dimension ALGOrithm), we uncover a surprising ID variability in several real-world datasets. In fact, we are able to show how this methodology helps to discover latent clusters hidden in data of different nature, ranging from protein folding trajectory to financial indexes computed on balance sheets. Hidalgo obtains remarkable results, but its main limitation consists in fixing a priori the number of sub-manifolds, i.e. of components in the mixture. To overcome this issue we employ a flexible Bayesian Nonparametric approach and model the data as an infinite mixture of Pareto distributions using a Dirichlet Process Mixture Model. This framework allows evaluating the uncertainty relative to the number of mixture components and to the assignments of data points to sub-manifolds. Since the posterior distribution has no closed form, to perform inference we employ the Slice Sampler algorithm. From preliminary analyses on simulated and well-known datasets (e.g. Fisher's Iris dataset), the full Bayesian nonparametric version of the TWO-NN provides promising results allowing to recover a rich data structure starting from the intrinsic dimension, a pure geometric data feature, and only requiring the definition of a distance measure.