GRAND: Graph based Author Name Disambiguation Framework

Hussain, Ijaz

DSpace Home
→
Engineering and Technology
→
Thesis
→
View Item

dc.contributor.author	Hussain, Ijaz
dc.date.accessioned	2019-10-02T06:59:33Z
dc.date.accessioned	2020-04-11T15:39:21Z
dc.date.available	2020-04-11T15:39:21Z
dc.date.issued	2019
dc.identifier.govdoc	17896
dc.identifier.uri	http://142.54.178.187:9060/xmlui/handle/123456789/5234
dc.description.abstract	Author name disambiguation is a challenging research area in the ﬁeld of biblio metric analysis, scientometrics, and informetrics. Author name ambiguity may occur in two ways, when multiple authors share a common name, or an author’s multiple name variations appear in the bibliographic databases, such as DBLP, ACM, and Google Scholar. In both these scenarios, it is diﬃcult to be certain about the accuracy of the retrieved results. Proper identiﬁcation of one’s work from other’s is necessary due to many reasons, for example, in author ranking sites such as the Arnetminer, presence of author’s name ambiguity in citations leads to wrong metrics such as h-index, g-index, and i-index. Author name am biguity is one of the main errors for the wrong analysis in these bibliographic databases. To improve the accuracy of aforementioned metrics, it is necessary to disambiguate these ambiguous authors. Similarly, these bibliographic databases provide content as an input to visual bibliographic information retrieval systems that are currently used for expert (supervisor) ﬁnding, speciﬁc literature searching, selecting reviewers, and detecting a potential conﬂict of interests. Existing author name disambiguation techniques require a representative labeled data set for the training of the model, or require a number of ambiguous authors known a priori, or require extra information from the Web, or need user feed back, and are less scalable due to the requirement of training thousands of models for each ambiguous author. In this dissertation, a complete author name dis ambiguation framework called “GRAND” is presented that consists of four main algorithms, one each for the resolution of homonyms, synonyms, sole authors, and incremental author’s name ambiguity. The ﬁrst algorithm is DISC that exploits graph semantics, similarity measures, and community detection algorithms to disambiguate homonyms. The citation data set is preprocessed and ambiguous author blocks are created. DISC utilizes only two citation attributes–co-authors and titles, which are implicit bibliographic information in all bibliographic databases. The co-author’s graph of the citation data set is constructed and “GSkeletonClu: A graph Structural Clustering Algo rithm for networks” is used to identify hub vertices, outliers, and clusters of nodes in the co-author’s graph. Homonyms are resolved by splitting these clusters of nodes across the hub nodes if the similarity between their title feature vectors is less than a threshold. The second algorithm is SISTER that uses graph-based se mantic similarity measure “SynGeo”. It preprocesses and constructs co-author’s graph of the citation’s data set. Synonyms are resolved by exploiting SynGeo, which is based on syntactic similarity and graph geodesics between compared nodes. The third algorithm is GCLUSIM, which detects and disambiguates sole authors. In GCLUSIM, sole author’s and disambiguated author’s title feature vectors are constructed to ﬁnd the similarity between them. On the basis of this similarity, a sole author may be merged with the disambiguated clusters. As our ﬁnal contribution, the fourth algorithm is CAND that exploits author name in dices, author proﬁles, and a comparison function to solve the incremental author’s name ambiguity. Author name indices enhance the overall system performance and author proﬁle models help in disambiguation of the incremental insertions. The comparison function utilizes the most strong bibliometric features–co-author, titles, and self-citations. The proposed algorithms are eﬀective than state of the art methods in terms of clustering metrics. Furthermore, we believe that our pro posed algorithms in this dissertation can serve a baseline for future author name disambiguation studies.	en_US
dc.description.sponsorship	Higher Education Commission, Pakistan	en_US
dc.language.iso	en_US	en_US
dc.publisher	COMSATS Institute of Information Technology, Islamabad	en_US
dc.subject	Computer Sciences	en_US
dc.title	GRAND: Graph based Author Name Disambiguation Framework	en_US
dc.type	Thesis	en_US