Statistical Data Analysis



STATISTICAL LEARNING

Senior investigator: Dr. Matteo Falasconi

STATISTICAL METHODS FOR ARTIFICIAL OLFACTION

We apply statistical and pattern recognition techniques to gas sensor arrays (also called Electronic Noses – EN).
EN applications have regarded four main areas:
1) food quality control
2) environmental monitoring
3) industrial safety and security aspects
4) medical applications

We tackle all main aspects of data analysis:

a) Explorative data analysis (EDA)
This part takes care of importing data, organizing them in an appropriate data structure, filtering, calculating summary statistics and plotting (principal component analysis, correspondence analysis, clustering). Inside two EU projects WOUNDMONITOR and NANOS4 we developed a user friendly software for explorative data analysis (see e.g. Vezzoli et al. [1]). EDA software has been then successfully exploited in a number of projects and applications (see “Olfaction” section of SENSOR website).

b) Advanced cluster analysis and cluster validity
In the gas sensor field, Principal Component Analysis (PCA) is still the mostly used technique for exploratory data analysis, although the human judgment of PCA plots often determines the classification results. We have proposed a new approach based on cluster analysis (CA) in combination with cluster validity (CLV) that can be used to objectively infer and assess the data structure, and we have applied it to EN data sets [2].
Cluster analysis (CA) is the unsupervised classification of patterns (feature vectors) into groups (clusters) so that individuals within the same group are more similar to each other than those belonging to different groups (see A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACMComput. Surv. 31 (1999) 264–323). Clustering is particularly appropriate for exploratory data analysis. This methodology of classification enables us to summarize information, e.g. by representing classes through prototypes, and can help detecting important relationships and structures within the data sets. Cluster validity (CLV) techniques can be used to objectively and quantitatively assess the structure of experimental data (see A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988, Chapter 4).
SENSOR developed a Matlab-based platform for Cluster Analysis and Cluster Validity that implements different types of clustering algorithms (hierarchical agglomerative clustering, k-means and fuzzy c-means) and a number of state-of-the-art internal and external validity indices. We recently proposed a novel strategy that is very effective for validating fuzzy c-means clustering results [3].
These promising CLV techniques have been applied also to biological olfaction data sets [4] in cooperation with prof. S. Marco Lab, University of Barcelona (http://isp.el.ub.es/) and prof. M. Leon Lab, University of CA-Irvine (http://gara.bio.uci.edu/). This work was partially supported by the European Network of Excellence GOSPEL General Olfaction and Sensing Projects on a European Level (FP6-IST-2002-507610). This research received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 216916. Partial support has been given by CNR under the Short-Term Mobility Program 2007.

c) Preprocessing and adaptive drift correction
In the past, we have worked on feature extraction and particularly on feature selection. Inside diverse international collaborations we analyze data produced by hybrid arrays with several search strategies and feature subset evaluation measures [6].
More recently we started working on bio-inspired adpative drift correction methods. Indeed, Electronic Noses (ENs) might represent a simple, fast, high sample throughput and economic alternative to conventional analytical instruments. However, gas sensors drift still limits the EN adoption in real industrial setups due to high recalibration efforts and costs. In fact, pattern recognition (PaRC) models built in the training phase become useless after a period of time, in some cases a few weeks.
Although algorithms to mitigate the drift date back to the early 90 this is still a challenging issue for the chemical sensor community. Among other approaches, adaptive drift correction methods adjust the PaRC model in parallel with data acquisition without need of periodic calibration. Self-Organizing Maps (SOMs) and Adaptive Resonance Theory (ART) networks have been already tested in the past with fair success.
In cooperation with the CAD group, leaded by prof. G. Squillero, of the Control and Computer Engineering Department, Politecnico di Torino (http://www.cad.polito.it/), SENSOR has developed an original methodology based on Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES), suited for stochastic optimization of complex problems [5].
The CMA-ES approach has shown superior classification rates w.r.t. other drift correction methods (e.g. OSC correction). Gathered results corroborate the hypothesis that the proposed methodology can systematically adapt to drift even when the amount of data is relatively small. CMA-ES can also flexibly work well with different types of classifiers that may clearly affect absolute classification performance (we tested it with kNN, PLS, Radial Basis functions, and SVM).

d) Pattern recognition and supervised learning
We consider both classical chemometrics and pattern recognition techniques such as PLS, kNN, LDA and QDA, and more advanced machine learning techniques, like multilayer perceptrons (MLP), support vector machines (SVM) and ensembles of learning machines (boosting, random forests). The Sensor Lab has been forerunner in the application of SVM, boosting and random forests to the analysis of e-nose data [7].

References
[1] M. Vezzoli, A. Ponzoni, M. Pardo, M. Falasconi, G. Faglia, G. Sberveglieri, Exploratory data analysis for industrial safety application, Sensors and Actuators B 2008, 131, 100-109;
[2] M. Falasconi, M. Pardo, M. Vezzoli, G. Sberveglieri, Cluster Validation for Electronic Nose data, Sensors and Actuator B: Chemical 125 (2): (2007) 596-606;
[3] M. Falasconi, A. Gutierrez, M. Pardo, G. Sberveglieri, S. Marco, A stability based validity method for fuzzy clustering, Pattern Recognition 43 (2010) 1292–1305;
[4] Matteo Falasconi, Agustin Gutierrez, Benjamin Auffarth, Giorgio Sberveglieri And Santiago Marco, Cluster Analysis of the Rat Olfactory Bulb Activity in Response to Different Odorants. Proc. of the 13th International Symposium on Olfaction & Electronic Nose, Brescia, 2009;
[5] Stefano Di Carlo, Matteo Falasconi, Ernesto Sánchez, Alberto Scionti, Giovanni Squillero and Alberto Tonda, Exploiting Evolution for an Adaptive Drift-Robust Classifier in Chemical Sensing, Lecture Notes in Computer Science, 6024/2010, Applications of Evolutionary Computation, pags. 412-421 (Springer Berlin/Heidelberg)
[6] M. Pardo, G. Sberveglieri. Comparing the Performance of Different Features in Sensor Arrays. Sensors and Actuators B 123 (2007) 437–443
[7] M. Pardo, G. Sberveglieri. Random Forests and Nearest Shrunken Centroids for the Classification of Sensor Array Data. Sensors and Actuators B, 131/1 (2008) 93-99