PASTIC Dspace Repository

Document Clustering based on Semantic Notions

Show simple item record

dc.contributor.author Rafi, Muhammad
dc.date.accessioned 2019-05-30T05:35:15Z
dc.date.accessioned 2020-04-11T15:35:26Z
dc.date.available 2020-04-11T15:35:26Z
dc.date.issued 2017
dc.identifier.govdoc 16948
dc.identifier.uri http://142.54.178.187:9060/xmlui/handle/123456789/5044
dc.description.abstract The exponential growth of electronic documents, in both proprietary and public information systems, pose new challenges in finding relevant information from these large repositories. Document clustering is a specialized technique that has found its niche in effectively browsing, filtering, managing and summarizing these collections. Document clustering process has three distinct steps: (i) document representation, (ii) computation of pair-wise document similarity, and (iii) application of clustering algorithm. Document clustering methods are very sensitive to document representation schemes. Conventionally, document representations are based on extracting simple features such as terms/n-grams/frequent words/sequences from the documents that can be used as meta-descriptors for documents. These features reduce the dimensionality of the problem but simply fail to capture the semantics of the text in a transformed compact representation. These representations completely ignore the order and relationships among words/features. Documents written in human languages generally contain a context and use of words are mainly dependent on the same context. Motivated by this a novel document representation scheme that first extracts lexical chains from the documents and exploits topic maps structure for the lexical chains is proposed. The scheme takes advantage of lexical cohesion structure along with topic map relationships to get a semantic based representation of document. Topic Maps (TM) is an international standard for codification of knowledge. Moreover, a good similarity measure is essential for the clustering task. The similarity function should make use of semantic relationship among features (lexical topics) to provide a viable clue for relatedness between any pair of documents. A similarity function based on lexical chain similarity and frequent common tree patterns extracted from the topic maps of documents is defined. Hence these patterns (hierarchical lexical topics with different granularity) also inherently capture semantics in similarity calculation. An extensive set of experiments on four publicly available document datasets is performed. The evaluation measures like F-score, purity and entropy clearly established that the proposed approach is better than traditional document clustering approaches. en_US
dc.description.sponsorship Higher Education Commission, Pakistan en_US
dc.language.iso en_US en_US
dc.publisher National University of Computer & Emerging Sciences, Islamabad en_US
dc.subject Computer Science en_US
dc.title Document Clustering based on Semantic Notions en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account