Modeling Multivariate Biomedical Data

Nawaz, Uzma

DSpace Home
→
Natural Sciences
→
Thesis
→
View Item

Modeling Multivariate Biomedical Data

Nawaz, Uzma

URI: http://142.54.178.187:9060/xmlui/handle/123456789/11817

Date: 2013

Abstract:

The study addresses the significance of biomedical data to be analyzed by Statistical Community in collaboration with the expertise of personnel in the biomedical field. The data has its own particular constraints and difficulties being privacy-sensitive, heterogeneous and voluminous data. The mathematical understanding of patterns and structures and estimation procedures may be fundamentally different from those of data collected in other fields. For the purpose complicated genomic data of leukemia cancer type of Golub et al (1999) is selected for the study. This dataset comes from a study of gene expression in two types of acute leukemia’s, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The training data set consisted of 38 bone marrow samples, 27 of which were taken from ALL patients (19 B-ALL and 8 T-ALL) and 11 of which were taken from AML patients. Each gene expression is the quantitative level of messenger RNA found in the cells. Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. Thus exploring genomic data has drawn on mathematical, statistical, and computational methods to discover meaningful genetic relationships from large-scale measurements of genes. Since this is a continuously growing area and is constantly being seeded with new approaches and interpretations. Most of this new material is easily accessible given a familiarity with basic genetics and multivariate statistics. The application of multivariate techniques need a thorough study of the data in hand and the primary objective in the study has been to “let the data speak for itself”. For the proper interpretation of these data, experimental and computational genomics need to have a firm grasp of statistical methodology. An aspect of prime importance, keenly taken into consideration in the 1study. For the multivariate genomic data of leukemia cancer type an initial exploratory data analysis has been performed in the study with the graphical tools of Histograms and Box plots in conjunction with one another. This has exposed that such a data set has a thorough fit for the extreme value distributions, which apart for the study undertaken has not been found in literature for the data type. The fitting of extreme value distributions has opened many new avenues for the data type for the new researchers to work on. Another output of the exploratory data analysis is the application of an appropriate transformation (the classical Box Cox transformation) to deal with the sharp skewness the data, and not relying only on the traditionally used logarithmic transformation. The appropriate data transformation has been another high point in the application of PCA for visualizing clusters present in the data set. Previously PCA and other complicated techniques like SOM and SVM has been applied and new adaptations are continuously being tried on these apart from the traditional clustering methodologies. Here the focus has not been just on the application of multivariate techniques to locate the clusters as predefined by the biological knowledge, rather it is on the methodologically simple yet most appropriate technique to be applied after a thorough look into the interior of the data set. Thus the data set revealed a patterned correlation matrix which in itself explained the number and configuration of clusters. This provided a groundwork for the application of PCA on box cox transformed data using the patterned correlation matrix as the interrelationship matrix. Indeed a comparison has been made with other interrelationship matrices as well. The clear cluster structure presented was, with no any misclassification in the configuration of clusters and exactly coincided with the prior biological knowledge. Therefore as per our hopes this introduction to prototypical methods for 2studying the data and interpreting in the context of biological genomic knowledge has been successful to get started. Addressing the next immediate issue in the study of the biomedical genomic data was finding genes that may be specific for one leukemia type or the cluster. The initial exploratory data analysis exposed certain data values that were of prime biological significance and played statistically significant role in the specification of genes for each cluster defined or the leukemia type. Resultantly a criterion developed from the data set, classifying each gene into its specific single cluster, or two of the three clusters or in all of the three clusters (the common genes).Thus a classified data set of the most variant genes across all the samples was taken as a training data set. Based on the classified grouping a linear discriminant analysis was successfully performed to find the discriminating genes for the specific leukemia type with 99.97% probability of correct classification. The collections of the discriminating genes from the three clusters formed were then needed to be checked for the previously found externally valid cluster structure. PCA was then applied in a new dimension as a check for the discriminating genes. For the discriminating genes the cluster formed for the sample expression profiles were expected to be distinctively clear for the genes to term as a leukemia type specific or cluster specific. Thus the clusters formed were very clearly distinguishable from one and other in contrast to the clusters of the sample expression profiles comprising of the common genes in all. These presented no any distinctive cluster rather a big bulk of a cluster that did not showed any difference in the biologically different leukemia types. The two major issues of the biomedical genomic data have been addressed successfully with an appropriate proposed model for the data type. Thus the study has been based on methodologically simple yet appropriate statistical techniques for such a data type filling 3the inevitable space left in for a statistical community the Pakistani statistical community for the very first time for such a internationally important field, the genomic biomedical field. With the results being unequivocal: Simplest is best! Can cluster genes, cell samples, or both. Yet the study has explored many new dimensions that need to be explored to establish relationship between an experiment based leukemia class and its subclass and a clinical out come. Since the data has many dimensions and concentrating on few precisely has been a difficult task yet accomplished.

Show full item record