Discriminative Clustering Algorithms for Document Understanding, Tag Recommendation, and Web Surfer Behavior Prediction

Hassan, Malik Tahir

DSpace Home
→
Natural Sciences
→
Thesis
→
View Item

Discriminative Clustering Algorithms for Document Understanding, Tag Recommendation, and Web Surfer Behavior Prediction

Hassan, Malik Tahir

URI: http://142.54.178.187:9060/xmlui/handle/123456789/2469

Date: 2013

Abstract:

The Web is a goldmine of knowledge, but its realization requires effective and efficient discovery algorithms. Information on the Web ranges from textual documents to social content to usage patterns. Such information is huge and dynamic in nature making useful knowledge discovery a challenging task. In recent years, data mining techniques have been utilized for various knowledge discovery tasks with success. Data clustering, in particular, has two key advantages for Web mining: (1) it is an unsupervised technique that does not require labeled data; (2) it is a conceptually simple task that can produce readily understandable patterns. In this thesis, we develop and evaluate discriminative clustering algorithms for textual document understanding, social content tag recommendation, and Web surfing behavior analysis. Our discriminative clustering algorithms are efficient and semantically rich for effective knowledge discovery on the Web. For textual document clustering and understanding, we develop and evaluate a new algorithm called CDIM (Clustering via Discrimination Information Maximization). CDIM is an iterative partitional clustering algorithm that maximizes the sum of discrimination information provided by documents in the collection. A key advantage of CDIM is that its clusters are describable by their highly discriminating terms, or equivalently, their highly topically-related terms. This is achieved by incorporating statistically sound measures of discrimination that have been shown to convey semantic relatedness of terms to topics into the clustering algorithm. A hierarchical version of CDIM is also presented. CDIM’s superior performance is demonstrated on benchmark datasets in comparison with current state-of-the-art text clustering algorithms. For social content tag recommendation, we develop a model of contents and tags using CDIM for recommendation of tags of new content. User textual posts (contents) are clustered to yield a list of discriminative terms for each cluster. Likewise, textual tagging history is clustered to produce another list of terms. These lists are combined with user’s personal tagging history, if available, to produce the final tag recommendations. Our approach is evaluated on the data of a social book- marking system Bibsonomy. We observe that the recommendation accuracy can be improved by vupdating the recommendation model from time to time. To realize this in an efficient manner, we build a self-optimizing version of our tag recommendation system. The self-optimization strategy decides when and how to update the system by solving a nonlinear optimization problem con- strained on available time to decide the best clustering parameters (number of clusterable records and number of clusters). A better alternate to re-building the complete clustering models is doing corrections to clusters that are getting outdated and are contributing to errors. We achieve this by developing a self-calibration strategy for our system which is shown to be a better and more practical option. We also perform an analysis of personalized and non-personalized versions of our tag recommendation system. Besides our discriminative clustering based tag recommendations algorithm, performance of other algorithms including PITF (Pair wise Interaction Tensor Factor- ization), FolkRank, and adapted PageRank is analyzed on our proposed personalization groups (beginners, followers, and leaders) in folksonomies. For Web surfer behavior analysis, we find patterns of Web navigation paths among users and then develop discriminative and generative models for predicting future paths of users. Navigation patterns or behaviors are discovered by adapting the k-modes clustering algorithm with a new similarity measure appropriate for comparing navigation paths and a new method for cluster ini- tialization. Our experiments, conducted on two real-world datasets, demonstrate that predictions based on navigation behaviors are not necessarily better because of diversity of behaviors on the Web. Likewise, it is found that inclusion of start time of navigation sessions in predication models has little affect on accuracy but is significantly bad on efficiency. On the other hand, predictions based on cluster centroids are very cost-efficient without significant loss in accuracy. This thesis demonstrates the usefulness and versatility of clustering algorithms for Web mining, and highlights the importance of semantics in textual document analysis and self-management in practical Web systems. Directions for future work include semantic enhancements to CDIM and developments of self-management strategies for data mining applications.

Show full item record