Abstract:
There has been a growing interest in representing real-life applications with data
sets having binary-valued features. These data sets due to the advancements in
computer and data management systems consist of tens or hundreds of thousands
of features. In this dissertation, we investigate two problems in machine learning
which have been relatively less studied for high-dimensional binary data. The first
problem is to select a subset of features useful for supervised learning applications
from the entire feature set and is known as the feature selection (FS) problem.
The second problem is to compare two orderings of features induced by feature
ranking (FR) algorithms and to determine which one is better.
For the feature selection problem, we have proposed a new feature ranking measure
termed as the diff-criterion. Its distinct attribute is that it estimates the usefulness
of binary features by using their probability distributions. The diff-criterion has
been evaluated against two well-known FS algorithms with four widely used clas-
sifiers on six binary data sets on which it has achieved up to about 99% reduction
in the feature set size. To further improve the performance, we have suggested a
two-stage FS algorithm. The novelty of our two-stage algorithm is that the first
stage provides the second stage with a reduced subset without losing valuable in-
formation about the class. Two-stage feature selection used with the diff-criterion
not only significantly improves the classification accuracy but also exhibits up to
about 99% reduction in the feature set size. We have also compared our proposed
FS algorithms against the winning entries of the “Agnostic Learning versus Prior
Knowledge” challenge. The algorithms have shown results better or comparable
to the winners of the challenge.
For the problem of ranking features using FR algorithms, different FR algorithms
estimate the importance of features with respect to the class variable differently
thus generating different orderings. To determine which ordering is better, we
propose a new evaluation method termed as feature ranking evaluation strategy
(FRES). It uses the individual predictive power of features for estimating howAbstract
correct is an ordering of features. We found that compared to Relief and mu-
tual information algorithms our proposed diff-criterion generates the most correct
orderings of binary features.