Abstract:
The expansion of the data is so rapid in the real world today that, now
accumulating and processing it is a huge task. This growth is exponential
and when Data Mining (DM) tools are applied to analyze this enormous
data, it makes the algorithms time-consuming and expensive. One of
the most important algorithm in DM for analyzing the data is the tool
for classi cation. Classi cation is a function of DM for predicting the
class of a sample by building a classi er or a prediction model on the
basis of already collected samples with their class. The dataset used
for classi cation is a supervised data with di erent features or attribute.
During classi cation some features can be of great signi cance while
some could be irrelevant and redundant. The learning and prediction
time of classi cation algorithms is reduced using feature selection. This
decrease in time is due to the time saved on the cost of features that are
not selected through feature selection. Feature selection also provides
understanding into the nature of the problem to be solved. So, there is
a vital need of removing those irrelevant and redundant features before
building a classi er. This research is based on solving the problem of
feature subset selection (FSS) that chooses the features/attributes that
are of signi cant value for the classi er to be built. These signi cant
features would reduce the data that will eventually help to improve the
accuracy and reliability of big data analytics. The reduction of data
eventually would increase the accuracy and reliability of decision support
systemsespeciallycriticalhealthrelateddecisionsupportsystems. Other
areas include sentiment analysis, opinion mining, drug discovery, tumor detection, stroke detection and many other such applications.
The rst phase of this research has the novelty of considering FSS prob
lem as multi-objective problem and solving it using two metaheuris
tic techniques that are Non-dominating Sorting Genetic Algorithm II
(NSGA-II) and Multi-objective Particle Swarm Optimization altered to
solve FSS as a binary problem (BMOPSO). The experimentation results
represent the importance of considering FSS as multi-objective problem
as it outperforms against current techniques of FSS not only in terms
of the accuracy of a classi er but number features reduced. The sec
ond phase of this research explores Ant Colony Optimization (ACO)
technique for FSS which is another meta-heuristic technique. To fur
ther re ne the search, the signi cance of each feature is measured using
minimum Redundancy Maximum Relevance (mRMR) technique before
applying ACO. The results show that proposed technique performs bet
ter when compared with other existing biological inspired algorithms for
FSS. Both of the phases of this research use di erent real world datasets
taken from UCI machine repository and k-fold cross validation is used
to further authenticate the results of the proposed techniques. The fea
ture subset selection primarily deals with the data representation for
the classi cation process and reduces the computational complexity and
prediction accuracy.