For multivariate continuous data, the contaminated Gaussian distribution – having two parameters indicating the proportion of outliers and the degree of contamination – represents a convenient and natural way to model and detect outliers. In this paper, we introduce a mixture model whereby each mixture component is itself a contaminated Gaussian distribution. To make the approach parsimonious, a family of fourteen mixtures of contaminated Gaussian distributions is developed by applying constraints on eigen-decomposed component covariance matrices. Although these models could be used for model-based clustering, model-based classification, and discriminant analysis, we focus on the more general model-based classification framework. An ECM algorithm is used to find maximum likelihood estimates of the parameters and thereby give classifications for the observations. A simulation study is performed to evaluate the behavior of the BIC and the ICL in model selection. This novel family of models is applied to artificial and real data in order to illustrate some of its advantages. Among them, and in contrast to the trimming approach: 1) each observation has a posterior probability of belong to a particular group and, inside each group, to be an outlier or not, 2) the models do not require pre-specification of quantities like the proportion of observations to trim, and 3) the approach can be easily used in high dimensions.

Outlier Detection via Mixtures of Contaminated Gaussian Distributions

PUNZO, ANTONIO
2013

Abstract

For multivariate continuous data, the contaminated Gaussian distribution – having two parameters indicating the proportion of outliers and the degree of contamination – represents a convenient and natural way to model and detect outliers. In this paper, we introduce a mixture model whereby each mixture component is itself a contaminated Gaussian distribution. To make the approach parsimonious, a family of fourteen mixtures of contaminated Gaussian distributions is developed by applying constraints on eigen-decomposed component covariance matrices. Although these models could be used for model-based clustering, model-based classification, and discriminant analysis, we focus on the more general model-based classification framework. An ECM algorithm is used to find maximum likelihood estimates of the parameters and thereby give classifications for the observations. A simulation study is performed to evaluate the behavior of the BIC and the ICL in model selection. This novel family of models is applied to artificial and real data in order to illustrate some of its advantages. Among them, and in contrast to the trimming approach: 1) each observation has a posterior probability of belong to a particular group and, inside each group, to be an outlier or not, 2) the models do not require pre-specification of quantities like the proportion of observations to trim, and 3) the approach can be easily used in high dimensions.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/108139
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact