Disambiguating Authors in Bibliographic Databases

Shoaib, Muhammad

DSpace Home
→
Engineering and Technology
→
Thesis
→
View Item

dc.contributor.author	Shoaib, Muhammad
dc.date.accessioned	2019-07-02T11:15:57Z
dc.date.accessioned	2020-04-11T15:35:45Z
dc.date.available	2020-04-11T15:35:45Z
dc.date.issued	2016
dc.identifier.govdoc	17875
dc.identifier.uri	http://142.54.178.187:9060/xmlui/handle/123456789/5064
dc.description.abstract	Author name disambiguation in bibliographic databases such as DBLP1, Citeseer2, and Scopus3 is a specialized problem of entity resolution. In the literature, different approaches have been proposed and most of them base on machine learning techniques, either supervised or un-supervised learning or a combination of the two. The supervised learning approaches require labeling effort to train data. Unsupervised learning approaches utilize available attributes to group one’s citations by exploiting different similarity measures and clustering algorithms. The performance of un-supervised methods is affected by clustering algorithms, attributes and similarity measures. Previously, the focus of the research was on devising clustering algorithms and identifying attributes, but similarity measures have not been paid due attention. In this research work, we propose improved similarity measures for each type of attribute and a clustering algorithm. To estimate author name similarity, we divide name tokens into five different categories, and devise a similarity measure that accommodates them by assigning variant weights to each type of token. Our proposed similarity measure for co-authors attribute assigns higher similarity value to the citations if they share more common co-authors irrespective of the total number of co-authors. For textual attributes, we propose a conditional absolute measure (for attributes having short texts) and SDK4 index (for attributes having long texts). Experiments on DBDComp datasets show that our similarity measures outperform baseline measures by 16.2% in k-measure and 14.20 % in f-measure. We propose to use references of publications as additional sources of information. Use of titles of references improves k-measure by 0.6% and f-measure by 8% on DBLP-Ref datasets. We also propose clustering algorithm by modifying heuristic-based hierarchical clustering. Experiments on three different types of author name disambiguation collections show that our proposed methodology (similarity measures, clustering algorithm and use of references) helps improve both k-measure and f-measure.	en_US
dc.description.sponsorship	Higher Education Commission, Pakistan	en_US
dc.language.iso	en_US	en_US
dc.publisher	International Islamic University, Islamabad.	en_US
dc.subject	Computer Sciences	en_US
dc.title	Disambiguating Authors in Bibliographic Databases	en_US
dc.type	Thesis	en_US