PASTIC Dspace Repository

GRAND: Graph based Author Name Disambiguation Framework

Show simple item record

dc.contributor.author Hussain, Ijaz
dc.date.accessioned 2019-10-02T06:59:33Z
dc.date.accessioned 2020-04-11T15:39:21Z
dc.date.available 2020-04-11T15:39:21Z
dc.date.issued 2019
dc.identifier.govdoc 17896
dc.identifier.uri http://142.54.178.187:9060/xmlui/handle/123456789/5234
dc.description.abstract Author name disambiguation is a challenging research area in the field of biblio metric analysis, scientometrics, and informetrics. Author name ambiguity may occur in two ways, when multiple authors share a common name, or an author’s multiple name variations appear in the bibliographic databases, such as DBLP, ACM, and Google Scholar. In both these scenarios, it is difficult to be certain about the accuracy of the retrieved results. Proper identification of one’s work from other’s is necessary due to many reasons, for example, in author ranking sites such as the Arnetminer, presence of author’s name ambiguity in citations leads to wrong metrics such as h-index, g-index, and i-index. Author name am biguity is one of the main errors for the wrong analysis in these bibliographic databases. To improve the accuracy of aforementioned metrics, it is necessary to disambiguate these ambiguous authors. Similarly, these bibliographic databases provide content as an input to visual bibliographic information retrieval systems that are currently used for expert (supervisor) finding, specific literature searching, selecting reviewers, and detecting a potential conflict of interests. Existing author name disambiguation techniques require a representative labeled data set for the training of the model, or require a number of ambiguous authors known a priori, or require extra information from the Web, or need user feed back, and are less scalable due to the requirement of training thousands of models for each ambiguous author. In this dissertation, a complete author name dis ambiguation framework called “GRAND” is presented that consists of four main algorithms, one each for the resolution of homonyms, synonyms, sole authors, and incremental author’s name ambiguity. The first algorithm is DISC that exploits graph semantics, similarity measures, and community detection algorithms to disambiguate homonyms. The citation data set is preprocessed and ambiguous author blocks are created. DISC utilizes only two citation attributes–co-authors and titles, which are implicit bibliographic information in all bibliographic databases. The co-author’s graph of the citation data set is constructed and “GSkeletonClu: A graph Structural Clustering Algo rithm for networks” is used to identify hub vertices, outliers, and clusters of nodes in the co-author’s graph. Homonyms are resolved by splitting these clusters of nodes across the hub nodes if the similarity between their title feature vectors is less than a threshold. The second algorithm is SISTER that uses graph-based se mantic similarity measure “SynGeo”. It preprocesses and constructs co-author’s graph of the citation’s data set. Synonyms are resolved by exploiting SynGeo, which is based on syntactic similarity and graph geodesics between compared nodes. The third algorithm is GCLUSIM, which detects and disambiguates sole authors. In GCLUSIM, sole author’s and disambiguated author’s title feature vectors are constructed to find the similarity between them. On the basis of this similarity, a sole author may be merged with the disambiguated clusters. As our final contribution, the fourth algorithm is CAND that exploits author name in dices, author profiles, and a comparison function to solve the incremental author’s name ambiguity. Author name indices enhance the overall system performance and author profile models help in disambiguation of the incremental insertions. The comparison function utilizes the most strong bibliometric features–co-author, titles, and self-citations. The proposed algorithms are effective than state of the art methods in terms of clustering metrics. Furthermore, we believe that our pro posed algorithms in this dissertation can serve a baseline for future author name disambiguation studies. en_US
dc.description.sponsorship Higher Education Commission, Pakistan en_US
dc.language.iso en_US en_US
dc.publisher COMSATS Institute of Information Technology, Islamabad en_US
dc.subject Computer Sciences en_US
dc.title GRAND: Graph based Author Name Disambiguation Framework en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account