PRECISE

Program for Research in Computing and Information Sciences and Engineering

Computational Statistics and Data Analysis Group

COMPUTATIONAL STATISTICS AND DATA ANALYSIS GROUP

Edgar Acuña, edgar@math.uprm.edu, Coordinator
Daniel McGee, mcgee@cs.uprm.edu
Jaime Ramirez-Vick, jaime@ge.uprm.edu

Description

This group is dealing with research work in two areas:

• Computational Statistics: where we look for the explicit impact of computers on statistical methodology, such as: algorithms, computer graphics, computer intensive inferential methods, expert systems, neural networks, parallel computing and statistical databases.

• Statistical Methodology for data analysis, where we explore for new data analysis strategies and methodologies such as: classification, data exploration, density estimation, design of experiments, pattern recognition/image analysis and robust procedures, comparison of statistical methodology and simulation of experiments.

Current Participation in Competitive Research Grants

“Combining Classifiers Involving Kernel Density Estimates and Gaussian Mixtures”, Sponsored by Office of Naval Research (PI Edgar Acuña) 2003-2005.

“Water Quality and Marine System Indicators: Development of a Statistical Model for an Integral Assessment”, Sponsored by NOAA (J. González, D. McGee (Co-PI), and A. San-Juan (Co-PI))

Strategic R&D with other Institutions

Medical University of South Carolina: Barbara Tilley, Zhen Zhang
Navy Laboratory at Virginia: David Marchette and Jeffrey Solka.

Research Summaries

Improvement of Supervised Pattern Recognition Techniques - Dr. Edgar Acuña

The research deals with the use of computer intensive methods in statistics to improve pattern recognition techniques. Computer intensive methods involve three aspects: First, the use of powerful computers including parallel computers. Second, the development of efficient algorithms to carry out the procedures, and Third, efficient programming to perform the algorithms with accuracy and minimizing the running time. Pattern recognition has plenty of applications, but we are more interested in engineering and biomedical (Bioinformatics) applications. The expected research outcomes are:

• Feature selection for nonparametric classifiets. This part has been the master thesis of my student Frida Coaquira, who will continue working in this topic in her doctoral thesis.

• Improvement of classifiers based on Gaussian Mixtures through the use of bagging and boosting. Combining classifiers has a lot of applications even in unsupervised pattern recognition and involves parallel computation. This part has been considered in the master thesis of my student Luis Daza, who will continue working in this topic in his doctoral thesis.

• Reduction of dimensionality using Partial Least Squares. We are combining partial least squares with logistic regression to reduce dimensionality. This technique will be an alternative to the overused principal component method. This topic will be the doctoral thesis of my student Jose Vega, who began his research in January 2003.

• Application of statistical pattern recognition for microarray data. We are exploring the application of nonparametric classifiers and new clustering algorithms to gene expression data obtained using microarrays. These themes will be considered in the master thesis of my students Marggie Gonzalez and Santiago Velasco. They started already their research work and they expect to be done by the summer 2004.

• Application of Parallel computation to pattern recognition. The main disadvantage of the nonparametric techniques that we use is that they take a lot of computing time. However most of the algorithms needed can be parallelized. We have seen already superb results working with a parallel environment of 8 processors. My student Elio Lozano is working in this topic in his master thesis. He will defend in July 2003. Elio will extend this research in his doctoral thesis.

• Visualization techniques of microarray data. The preprocessing of microarray data and the analysis of the images obtained from them is the research topic of my master student Caroline Rodriguez.

Bioinformatics—Dr. Daniel McGee

The research is conducted in coordination with the BioInformatics Department of the Medical University of South Carolina and concentrates on the application and development of Bioinformatics techniques when applied to medical databases. In particular, neural networks, cluster analysis, genetic algorithms, principal component analysis, and normal statistical methodologies will be used and improved upon in these environments.

Goals:

• Made significant improvements to the training process for neural networks when used on medical and educational databases
• Compare the effectiveness of neural networks with traditional methods when applied to medical databases
• Create an overall system to analyze medical databases that will obtain the minimum necessary dimensionality per record, will obtain appropriate endpoints with which records may be associated, will train in a considerably expedited fashion neural networks that will obtain the probability that a record should be associated with a given endpoint.

Bioinformatics, Dr. Jaime Ramirez-Vick

Development of an expert system for microarray statistical data analysis.

An essential source of genetic information for drug and functional genomics research
comes from data generated by high-throughput gene screening technologies. Of all
these technologies, DNA microarrays are the standard source of these data. For this
reason, a major interest of our research is the development of bioinformatics tools to
extract information from this type of data.
The massive data sets generated by DNA microarrays usually consist of mRNA levels
of two distinct cell populations (i.e., diseased vs. healthy, drug-treated vs. control,
etc.). The first step followed in the analysis of this data is to determine which genes are
differentially expressed. We are interested in exploring the different alternatives for the statistical determination differential gene expression. These alternatives will be included into an expert system that will chose the most appropriate method based on the type of
microarray (e.g., oligonucleotide, cDNA, etc.) and the type of experiment.

Probabilistic Inference of Gene Function and Regulation

More recently in the study of gene function and regulation, approaches are based on
the extraction of fundamental patterns of gene expression inherent in the data using
functional classification methods. An improvement to gene functional classification or
clustering has been the use of simple network models to infer regulatory interactions
between genes in an approach known as reverse engineering (or gene network
inference). We are interested in ways of integrating knowledge on gene function with microarray gene expression data (e.g., transcriptome, proteome, etc.) to generate new biological knowledge. To do this we are exploring the use of Bayesian Networks, because of their ability to model stochasticity, incorporate prior knowledge, and handle hidden (unknown) variables and missing data in a principled way. There are two applications we are currently exploring the discovery of: (1) genetic regulatory network models and (2) signaling network models.

Publications

Journals

E. Acuña, and A. Rojas, “Bagging classifiers based on kernel density estimators”. Proceedings of the International Conference on New Trends in Computational Statistics with Biomedical Applications, August 2001, pp 343-350 (an extended version of this paper will appear on the Journal of The Japanese Society of Computational Statistics by the end of this year).

Acuña, E., (2002) Combining Classifiers based on Kernel density classifiers and Gaussian mixtures. Computing Science and Statistics. Vol 33.

Acuña, E., Rojas, A., and Coaquira, F. (2002). The Effect of Feature Selection on Combining Classifiers Based on Kernel Density Estimates. In K. Jajuga, A. Sokodowski, H.-H Bock (Eds). Classification, Clustering and Data Analysis. Springer, Heidelberg, 161-168.

McGee, D., Lackland, D. et al, (2003). Trends in Blood Pressure Treatment: Some observations based on the Framingham study, Cardiovascular Reports and Reviews. (In Press).

Acuña,. E. (2003) Combining classifiers based on kernel density estimators. Submitted to the Journal of Statistical Computation and Simulation

Acuña,. E. (2003) Filters and wrappers for supervised classification. Submitted to Communications in Statistics: Simulation and Computation.

1.5.2 Refereed Conferences (with proceedings)

Acuña, E., (2002) Combining Classifiers based on Kernel density classifiers and Gaussian mixtures. Proceedings of the Interface 2002 Computing Science and Statistics. Vol 33.

Lozano, E. and Acuña, E. (2003) Parrallel computation ok kernel density estiumates classifiers and their ensembles. To appear in Proceedings of the Conference in Computers, Communications and Control 2003. July 2003.

Acuña, E., (2003) A comparison of filters and wrappers for feature selection supervised classification. Proceedings of the Interface 2003 Computing Science and Statistics. Vol 34.

McGee, D, Lackland, D. et al, (2003) Trends in Blood Pressure Treatment: Some observations based on the Framingham study, Cardiovascular Reports and Reviews. (In Press).

Daza, L. and Acuña, E.. (2003) Combining classifiers based on Gaussian Mixtures. To appear in Proceedings of the Conference in Computers, Communications and Control 2003. July 2003.

Acuña E. and Coaquira, F. On the performance of ensembles based on kernel density estimation. To appear in Proceedings of the Conference in Computers, Communications and Control 2003. July 2003.

Acuña E., Coaquira, F., and Gonzalez, M. (2003) A comparison of feature selection procedures for classifiers based on kernel density estimation. To appear in Proceedings of the Conference in Computers, Communications and Control 2003. July 2003.

McGee, D., and Maldonado. (2003). Using coefficients of backpropagating neural networks to identify change points. To appear in Proceedings of the Conference in Computers, Communications and Control 2003. July 2003.

McGee, D., and Maldonado. (2003). Using coefficients of backpropagating neural networks to identify change points. To appear in Proceedings of the Conference in Computers, Communications and Control 2003. July 2003.

Book Chapters/Articles in Collections

Acuña, E, “Análisis Estadístico de Datos usando MINITAB para Windows”, Segunda Edición. John Wiley and Sons, New York (2202).

About Precise Research Publications People CISE Technical Lecture Series Laboratories Ph.D. in CISE Computer Research Conference Reports Important Links