Document Clustering based on Semantic Notions

Rafi, Muhammad

DSpace Home
→
Engineering and Technology
→
Thesis
→
View Item

dc.contributor.author	Rafi, Muhammad
dc.date.accessioned	2019-05-30T05:35:15Z
dc.date.accessioned	2020-04-11T15:35:26Z
dc.date.available	2020-04-11T15:35:26Z
dc.date.issued	2017
dc.identifier.govdoc	16948
dc.identifier.uri	http://142.54.178.187:9060/xmlui/handle/123456789/5044
dc.description.abstract	The exponential growth of electronic documents, in both proprietary and public information systems, pose new challenges in finding relevant information from these large repositories. Document clustering is a specialized technique that has found its niche in effectively browsing, filtering, managing and summarizing these collections. Document clustering process has three distinct steps: (i) document representation, (ii) computation of pair-wise document similarity, and (iii) application of clustering algorithm. Document clustering methods are very sensitive to document representation schemes. Conventionally, document representations are based on extracting simple features such as terms/n-grams/frequent words/sequences from the documents that can be used as meta-descriptors for documents. These features reduce the dimensionality of the problem but simply fail to capture the semantics of the text in a transformed compact representation. These representations completely ignore the order and relationships among words/features. Documents written in human languages generally contain a context and use of words are mainly dependent on the same context. Motivated by this a novel document representation scheme that first extracts lexical chains from the documents and exploits topic maps structure for the lexical chains is proposed. The scheme takes advantage of lexical cohesion structure along with topic map relationships to get a semantic based representation of document. Topic Maps (TM) is an international standard for codification of knowledge. Moreover, a good similarity measure is essential for the clustering task. The similarity function should make use of semantic relationship among features (lexical topics) to provide a viable clue for relatedness between any pair of documents. A similarity function based on lexical chain similarity and frequent common tree patterns extracted from the topic maps of documents is defined. Hence these patterns (hierarchical lexical topics with different granularity) also inherently capture semantics in similarity calculation. An extensive set of experiments on four publicly available document datasets is performed. The evaluation measures like F-score, purity and entropy clearly established that the proposed approach is better than traditional document clustering approaches.	en_US
dc.description.sponsorship	Higher Education Commission, Pakistan	en_US
dc.language.iso	en_US	en_US
dc.publisher	National University of Computer & Emerging Sciences, Islamabad	en_US
dc.subject	Computer Science	en_US
dc.title	Document Clustering based on Semantic Notions	en_US
dc.type	Thesis	en_US