Links

Tools

Export citation

Search in Google Scholar

Two-level infinite mixture for multi-domain data

Journal article published in 2008 by Simon Rogers ORCID, Arto Klami, Janne Sinkkonen, Mark Girolami, Samuel Kaski
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Question mark in circle
Preprint: policy unknown
Question mark in circle
Postprint: policy unknown
Question mark in circle
Published version: policy unknown

Abstract

The combined, unsupervised analysis of coupled data sources is an open problem in machine learn-ing. A particularly important example from the biological domain is the analysis of mRNA and protein profiles derived from the same set of genes (either over time or under different conditions). Such analysis has the potential to provide a far more comprehensive picture of the mechanisms of transcription and translation than the individual analysis of the separate data sets. The problem is similar to that attacked with traditional Canonical Correlation Analysis (CCA) but in many application areas, the CCA assumptions are too restrictive. Probabilistic CCA [1] and kernel CCA [2] have both been recently proposed but the former is still limited to linear relationships and the latter compromises the interpretability in the original space. In this work, we preset a non-parametric model for coupled data that provides an interpretable description of the shared variability in the data (as well as that that isn't shared) whilst being free of restrictive assumptions such as those found in CCA. The hierarchical model is built from two marginal mixtures (one for each representation -generali-sation to three or more is straightforward). Each object will be assigned to one component in each marginal and the contingency table describing these joint assignments is assumed to have been gen-erated by a mixture of tables with independent margins. This top-level mixture captures the shared variability whilst the marginal models are free to capture variation specific to the respective data sources. The number of components in all three mixtures is inferred from the data using a novel Dirichlet Process (DP) formulation.