Abstract:
The Web is a goldmine of knowledge, but its realization requires effective and efficient discovery
algorithms. Information on the Web ranges from textual documents to social content to usage
patterns. Such information is huge and dynamic in nature making useful knowledge discovery a
challenging task. In recent years, data mining techniques have been utilized for various knowledge
discovery tasks with success. Data clustering, in particular, has two key advantages for Web mining:
(1) it is an unsupervised technique that does not require labeled data; (2) it is a conceptually
simple task that can produce readily understandable patterns. In this thesis, we develop and
evaluate discriminative clustering algorithms for textual document understanding, social content
tag recommendation, and Web surfing behavior analysis. Our discriminative clustering algorithms
are efficient and semantically rich for effective knowledge discovery on the Web.
For textual document clustering and understanding, we develop and evaluate a new algorithm
called CDIM (Clustering via Discrimination Information Maximization). CDIM is an iterative
partitional clustering algorithm that maximizes the sum of discrimination information provided by
documents in the collection. A key advantage of CDIM is that its clusters are describable by their
highly discriminating terms, or equivalently, their highly topically-related terms. This is achieved
by incorporating statistically sound measures of discrimination that have been shown to convey
semantic relatedness of terms to topics into the clustering algorithm. A hierarchical version of
CDIM is also presented. CDIM’s superior performance is demonstrated on benchmark datasets in
comparison with current state-of-the-art text clustering algorithms.
For social content tag recommendation, we develop a model of contents and tags using CDIM
for recommendation of tags of new content. User textual posts (contents) are clustered to yield a
list of discriminative terms for each cluster. Likewise, textual tagging history is clustered to produce
another list of terms. These lists are combined with user’s personal tagging history, if available, to
produce the final tag recommendations. Our approach is evaluated on the data of a social book-
marking system Bibsonomy. We observe that the recommendation accuracy can be improved by
vupdating the recommendation model from time to time. To realize this in an efficient manner, we
build a self-optimizing version of our tag recommendation system. The self-optimization strategy
decides when and how to update the system by solving a nonlinear optimization problem con-
strained on available time to decide the best clustering parameters (number of clusterable records
and number of clusters). A better alternate to re-building the complete clustering models is doing
corrections to clusters that are getting outdated and are contributing to errors. We achieve this
by developing a self-calibration strategy for our system which is shown to be a better and more
practical option. We also perform an analysis of personalized and non-personalized versions of
our tag recommendation system. Besides our discriminative clustering based tag recommendations
algorithm, performance of other algorithms including PITF (Pair wise Interaction Tensor Factor-
ization), FolkRank, and adapted PageRank is analyzed on our proposed personalization groups
(beginners, followers, and leaders) in folksonomies.
For Web surfer behavior analysis, we find patterns of Web navigation paths among users and
then develop discriminative and generative models for predicting future paths of users. Navigation
patterns or behaviors are discovered by adapting the k-modes clustering algorithm with a new
similarity measure appropriate for comparing navigation paths and a new method for cluster ini-
tialization. Our experiments, conducted on two real-world datasets, demonstrate that predictions
based on navigation behaviors are not necessarily better because of diversity of behaviors on the
Web. Likewise, it is found that inclusion of start time of navigation sessions in predication models
has little affect on accuracy but is significantly bad on efficiency. On the other hand, predictions
based on cluster centroids are very cost-efficient without significant loss in accuracy.
This thesis demonstrates the usefulness and versatility of clustering algorithms for Web mining,
and highlights the importance of semantics in textual document analysis and self-management in
practical Web systems. Directions for future work include semantic enhancements to CDIM and
developments of self-management strategies for data mining applications.