Abstract:
The study addresses the significance of biomedical data to be analyzed by Statistical
Community in collaboration with the expertise of personnel in the biomedical field. The
data has its own particular constraints and difficulties being privacy-sensitive,
heterogeneous and voluminous data. The mathematical understanding of patterns and
structures and estimation procedures may be fundamentally different from those of data
collected in other fields. For the purpose complicated genomic data of leukemia cancer
type of Golub et al (1999) is selected for the study. This dataset comes from a study of
gene expression in two types of acute leukemia’s, acute lymphoblastic leukemia (ALL)
and acute myeloid leukemia (AML). The training data set consisted of 38 bone marrow
samples, 27 of which were taken from ALL patients (19 B-ALL and 8 T-ALL) and 11 of
which were taken from AML patients. Each gene expression is the quantitative level of
messenger RNA found in the cells. Understanding the genetic underpinnings of disease is
important for screening, treatment, drug development, and basic biological insight. Thus
exploring genomic data has drawn on mathematical, statistical, and computational
methods to discover meaningful genetic relationships from large-scale measurements of
genes. Since this is a continuously growing area and is constantly being seeded with new
approaches and interpretations. Most of this new material is easily accessible given a
familiarity with basic genetics and multivariate statistics. The application of multivariate
techniques need a thorough study of the data in hand and the primary objective in the
study has been to “let the data speak for itself”. For the proper interpretation of these
data, experimental and computational genomics need to have a firm grasp of statistical
methodology. An aspect of prime importance, keenly taken into consideration in the
1study. For the multivariate genomic data of leukemia cancer type an initial exploratory
data analysis has been performed in the study with the graphical tools of Histograms and
Box plots in conjunction with one another. This has exposed that such a data set has a
thorough fit for the extreme value distributions, which apart for the study undertaken has
not been found in literature for the data type. The fitting of extreme value distributions
has opened many new avenues for the data type for the new researchers to work on.
Another output of the exploratory data analysis is the application of an appropriate
transformation (the classical Box Cox transformation) to deal with the sharp skewness the
data, and not relying only on the traditionally used logarithmic transformation. The
appropriate data transformation has been another high point in the application of PCA for
visualizing clusters present in the data set. Previously PCA and other complicated
techniques like SOM and SVM has been applied and new adaptations are continuously
being tried on these apart from the traditional clustering methodologies. Here the focus
has not been just on the application of multivariate techniques to locate the clusters as
predefined by the biological knowledge, rather it is on the methodologically simple yet
most appropriate technique to be applied after a thorough look into the interior of the data
set. Thus the data set revealed a patterned correlation matrix which in itself explained the
number and configuration of clusters. This provided a groundwork for the application of
PCA on box cox transformed data using the patterned correlation matrix as the
interrelationship matrix. Indeed a comparison has been made with other interrelationship
matrices as well. The clear cluster structure presented was, with no any misclassification
in the configuration of clusters and exactly coincided with the prior biological
knowledge. Therefore as per our hopes this introduction to prototypical methods for
2studying the data and interpreting in the context of biological genomic knowledge has
been successful to get started. Addressing the next immediate issue in the study of the
biomedical genomic data was finding genes that may be specific for one leukemia type or
the cluster. The initial exploratory data analysis exposed certain data values that were of
prime biological significance and played statistically significant role in the specification
of genes for each cluster defined or the leukemia type. Resultantly a criterion developed
from the data set, classifying each gene into its specific single cluster, or two of the three
clusters or in all of the three clusters (the common genes).Thus a classified data set of the
most variant genes across all the samples was taken as a training data set. Based on the
classified grouping a linear discriminant analysis was successfully performed to find the
discriminating genes for the specific leukemia type with 99.97% probability of correct
classification. The collections of the discriminating genes from the three clusters formed
were then needed to be checked for the previously found externally valid cluster
structure. PCA was then applied in a new dimension as a check for the discriminating
genes. For the discriminating genes the cluster formed for the sample expression profiles
were expected to be distinctively clear for the genes to term as a leukemia type specific or
cluster specific. Thus the clusters formed were very clearly distinguishable from one and
other in contrast to the clusters of the sample expression profiles comprising of the
common genes in all. These presented no any distinctive cluster rather a big bulk of a
cluster that did not showed any difference in the biologically different leukemia types.
The two major issues of the biomedical genomic data have been addressed successfully
with an appropriate proposed model for the data type. Thus the study has been based on
methodologically simple yet appropriate statistical techniques for such a data type filling
3the inevitable space left in for a statistical community the Pakistani statistical community
for the very first time for such a internationally important field, the genomic biomedical
field. With the results being unequivocal: Simplest is best! Can cluster genes, cell
samples, or both. Yet the study has explored many new dimensions that need to be
explored to establish relationship between an experiment based leukemia class and its
subclass and a clinical out come. Since the data has many dimensions and concentrating
on few precisely has been a difficult task yet accomplished.