Discovering data source stability patterns in biomedical repositories based on simplicial projections from probability distribution distances
The degree of homogeneity of statistical distributions among data sources is a critical issue when reusing data of Integrated Data Repositories (IDR). Evaluating this data source stability is of utmost importance in order to ensure a confident data reuse. This work tackles the task of discovering and classifying patterns among the statistical distributions of multiple sources in IDRs, by means of a novel approach based on simplicial projections from probability distribution distances, combined with Density-based spatial clustering of applications with noise (DBSCAN). The results of this work on the evaluated 16 public repositories support the existence of four main data source stability patterns in biomedical repositories: the global stability pattern (GSP), the local stability pattern (LSP), the sparse stability pattern (SSP) and the instability pattern (IP).