Multiple scaled contaminated normal distribution and its application in clustering

IRIS

The multivariate contaminated normal (MCN) distribution represents a simple heavy-tailed generalizationofthemultivariatenormal(MN)distributiontomodelellipticalcontouredscattersinthe presence of mild outliers (also referred to as ‘bad’ points herein) and automatically detect bad points. Thepriceoftheseadvantagesistwoadditionalparameters:proportionofgoodobservationsanddegree of contamination. However, in a multivariate setting, only one proportion of good observations and only one degree of contamination may be limiting. To overcome this limitation, we propose a multiple scaled contaminated normal (MSCN) distribution. Among its parameters, we have an orthogonal matrix 0. In the space spanned by the vectors (principal components) of 0, there is a proportion of good observations and a degree of contamination for each component. Moreover, each observation has a posterior probability of being good with respect to each principal component. Thanks to this probability, the method provides directional robust estimates of the parameters of the nested MN and automatic directional detection of bad points. The term ‘directional’ is added to specify that the method works separately for each principal component. Mixtures of MSCN distributions are also proposed, and an expectation-maximization algorithm is used for parameter estimation. Real and simulated data are considered to show the usefulness of our mixture with respect to well-established mixtures of symmetric distributions with heavy tails.

Multiple scaled contaminated normal distribution and its application in clustering

Punzo A.;Tortora C.

2021-01-01

Abstract

The multivariate contaminated normal (MCN) distribution represents a simple heavy-tailed generalizationofthemultivariatenormal(MN)distributiontomodelellipticalcontouredscattersinthe presence of mild outliers (also referred to as ‘bad’ points herein) and automatically detect bad points. Thepriceoftheseadvantagesistwoadditionalparameters:proportionofgoodobservationsanddegree of contamination. However, in a multivariate setting, only one proportion of good observations and only one degree of contamination may be limiting. To overcome this limitation, we propose a multiple scaled contaminated normal (MSCN) distribution. Among its parameters, we have an orthogonal matrix 0. In the space spanned by the vectors (principal components) of 0, there is a proportion of good observations and a degree of contamination for each component. Moreover, each observation has a posterior probability of being good with respect to each principal component. Thanks to this probability, the method provides directional robust estimates of the parameters of the nested MN and automatic directional detection of bad points. The term ‘directional’ is added to specify that the method works separately for each principal component. Mixtures of MSCN distributions are also proposed, and an expectation-maximization algorithm is used for parameter estimation. Real and simulated data are considered to show the usefulness of our mixture with respect to well-established mixtures of symmetric distributions with heavy tails.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2021
		
	Parole chiave
	
			contaminated normal distribution, heavy-tailed distributions, multiple scaled distributions, EM algorithm, mixture models, model-based clustering
		
	Appare nelle tipologie:
	
			1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Punzo & Tortora (2019) - SM.pdf solo gestori archivio Descrizione: Articolo principale Tipologia: Documento in Pre-print Dimensione 842.38 kB Formato Adobe PDF Visualizza/Apri	842.38 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/376027

Citazioni

ND

10

10

social impact