Development of Feature Selection Algorithms for High-Dimensional Binary Data

Javed, Kashif

DSpace Home
→
Natural Sciences
→
Thesis
→
View Item

Development of Feature Selection Algorithms for High-Dimensional Binary Data

Javed, Kashif

URI: http://142.54.178.187:9060/xmlui/handle/123456789/2911

Date: 2012

Abstract:

There has been a growing interest in representing real-life applications with data sets having binary-valued features. These data sets due to the advancements in computer and data management systems consist of tens or hundreds of thousands of features. In this dissertation, we investigate two problems in machine learning which have been relatively less studied for high-dimensional binary data. The first problem is to select a subset of features useful for supervised learning applications from the entire feature set and is known as the feature selection (FS) problem. The second problem is to compare two orderings of features induced by feature ranking (FR) algorithms and to determine which one is better. For the feature selection problem, we have proposed a new feature ranking measure termed as the diff-criterion. Its distinct attribute is that it estimates the usefulness of binary features by using their probability distributions. The diff-criterion has been evaluated against two well-known FS algorithms with four widely used clas- sifiers on six binary data sets on which it has achieved up to about 99% reduction in the feature set size. To further improve the performance, we have suggested a two-stage FS algorithm. The novelty of our two-stage algorithm is that the first stage provides the second stage with a reduced subset without losing valuable in- formation about the class. Two-stage feature selection used with the diff-criterion not only significantly improves the classification accuracy but also exhibits up to about 99% reduction in the feature set size. We have also compared our proposed FS algorithms against the winning entries of the “Agnostic Learning versus Prior Knowledge” challenge. The algorithms have shown results better or comparable to the winners of the challenge. For the problem of ranking features using FR algorithms, different FR algorithms estimate the importance of features with respect to the class variable differently thus generating different orderings. To determine which ordering is better, we propose a new evaluation method termed as feature ranking evaluation strategy (FRES). It uses the individual predictive power of features for estimating howAbstract correct is an ordering of features. We found that compared to Relief and mu- tual information algorithms our proposed diff-criterion generates the most correct orderings of binary features.

Show full item record