dc.description.abstract |
The Internet has touched every part of our lives, including our interactions and communications.
Printed books are being replaced by electronic books (e-books), personal and official correspon-
dences have shifted to electronic mail (e-mail), and news is now being read online. This is gener-
ating huge volumes of unstructured textual data that needs to be analyzed, filtered, and organized
automatically in order to harness its wealth of information for profitable gains. By 2013, it is
projected that the worldwide volume of e-mails will reach 507 billion e-mails per day out of which
89% will be spam e-mails [Radicati (2009)]. In 2008, the cost of spam to businesses in terms of
hardware, software, and human resource cost was around $140 billion [Research (2008)].
Content-based text classification can automatically organize text documents into predefined
thematic categories. However, text classification is challenging in the modern Internet environment.
Firstly, text documents are sparsely represented in a very high dimensional feature space (easily in
hundred thousands), making learning and generalization difficult. Secondly, due to the high cost
of labeling documents researchers are forced to collect training data from sources different from
the target domain, which results in a distribution shift between training and test data. Thirdly,
although unlabeled data is easily available its utilization in practical text classification for improved
performance remains a challenge. One important domain for text classification, which embodies
these challenges, is that of e-mail spam filtering. A typical e-mail service provider (ESP) caters to
thousands to millions of users where each user can have his own interests of topics and preferences
for spam and non-spam e-mails. Personalized service-side spam filtering provides a solution to this
problem; however, for such solutions to be practically usable they must be efficient, scalable, and
robust to distribution shifts.
In this thesis, we propose a robust text classification technique that combines local generative
models and global discriminative classifiers through the use of discriminative term weighting and
linear opinion pooling. Terms in the documents are assigned weights that quantify the discrimina-
tion information they provide for one category over the others. These weights, called discriminative
term weights (DTW), also serve to partition the terms into two sets. An opinion pooling strategy
consolidates the discrimination information of terms in the sets to yield a two dimensional feature
space, in which a discriminant function is learned to categorize the documents. In addition to a
supervised technique, we also develop two semi-supervised variants for personalizing the local and
global models using unlabeled data. We then generalize our technique into a classifier framework
that integrates different feature selection criteria, discriminative term weighting schemes, infor-
mation pooling strategies, and discriminative classifiers. We provide a theoretical comparison of
our proposed framework with existing generative, discriminative, and hybrid classifiers. Our text
classification framework is evaluated with five discriminative term weighting strategies, six opinion
consolidation techniques, and four discriminative classifiers. We employ nine real-world datasets
from different domains in our experimental evaluation, and the results are compared with four
benchmark text classification algorithms via accuracy and AUC values. Our framework is also
evaluated under varying distribution shift, on gray e-mails, on unseen e-mails, and under varying
classifier size. Scalability of our spam filter is also demonstrated for personalized service-side spam
filtering.
Statistical significance tests confirm that our technique performs significantly better than the
compared techniques in both supervised and semi-supervised settings, and in global and person-
alized spam filtering. In particular, it performs remarkably well when distribution shift is high
between training and test data, a phenomenon common in e-mail systems.
Additional contributions of this thesis include a systematic analysis of the spam filtering problem
and the challenges to effective global and personalized spam filtering at the service side. We formally
define key characteristics of e-mail classification such as distribution shift and gray e-mails, and
relate them to machine learning problem settings. The concept of term discrimination introduced
in this work has also found applications in text clustering, visualization, and feature extraction,
and it can be extended for keyword extraction and topic identification from textual documents. |
en_US |