For multivariate continuous data, the contaminated Gaussian distribution – having two parameters indicating the proportion of outliers and the degree of contamination – represents a convenient and natural way to model and detect outliers. In this paper, we introduce a mixture model whereby each mixture component is itself a contaminated Gaussian distribution. To make the approach parsimonious, a family of fourteen mixtures of contaminated Gaussian distributions is developed by applying constraints on eigen-decomposed component covariance matrices. Although these models could be used for model-based clustering, model-based classification, and discriminant analysis, we focus on the more general model-based classification framework. An ECM algorithm is used to find maximum likelihood estimates of the parameters and thereby give classifications for the observations. A simulation study is performed to evaluate the behavior of the BIC and the ICL in model selection. This novel family of models is applied to artificial and real data in order to illustrate some of its advantages. Among them, and in contrast to the trimming approach: 1) each observation has a posterior probability of belong to a particular group and, inside each group, to be an outlier or not, 2) the models do not require pre-specification of quantities like the proportion of observations to trim, and 3) the approach can be easily used in high dimensions.
|Titolo:||Outlier Detection via Mixtures of Contaminated Gaussian Distributions|
|Data di pubblicazione:||2013|
|Appare nelle tipologie:||4.2 Abstract in Atti di convegno|