dc.description.abstract |
Author name disambiguation is a challenging research area in the field of biblio
metric analysis, scientometrics, and informetrics. Author name ambiguity may
occur in two ways, when multiple authors share a common name, or an author’s
multiple name variations appear in the bibliographic databases, such as DBLP,
ACM, and Google Scholar. In both these scenarios, it is difficult to be certain
about the accuracy of the retrieved results. Proper identification of one’s work
from other’s is necessary due to many reasons, for example, in author ranking
sites such as the Arnetminer, presence of author’s name ambiguity in citations
leads to wrong metrics such as h-index, g-index, and i-index. Author name am
biguity is one of the main errors for the wrong analysis in these bibliographic
databases. To improve the accuracy of aforementioned metrics, it is necessary to
disambiguate these ambiguous authors. Similarly, these bibliographic databases
provide content as an input to visual bibliographic information retrieval systems
that are currently used for expert (supervisor) finding, specific literature searching,
selecting reviewers, and detecting a potential conflict of interests.
Existing author name disambiguation techniques require a representative labeled
data set for the training of the model, or require a number of ambiguous authors
known a priori, or require extra information from the Web, or need user feed
back, and are less scalable due to the requirement of training thousands of models
for each ambiguous author. In this dissertation, a complete author name dis
ambiguation framework called “GRAND” is presented that consists of four main
algorithms, one each for the resolution of homonyms, synonyms, sole authors, and
incremental author’s name ambiguity.
The first algorithm is DISC that exploits graph semantics, similarity measures,
and community detection algorithms to disambiguate homonyms. The citation
data set is preprocessed and ambiguous author blocks are created. DISC utilizes
only two citation attributes–co-authors and titles, which are implicit bibliographic
information in all bibliographic databases. The co-author’s graph of the citation
data set is constructed and “GSkeletonClu: A graph Structural Clustering Algo
rithm for networks” is used to identify hub vertices, outliers, and clusters of nodes
in the co-author’s graph. Homonyms are resolved by splitting these clusters of
nodes across the hub nodes if the similarity between their title feature vectors is
less than a threshold. The second algorithm is SISTER that uses graph-based se
mantic similarity measure “SynGeo”. It preprocesses and constructs co-author’s
graph of the citation’s data set. Synonyms are resolved by exploiting SynGeo,
which is based on syntactic similarity and graph geodesics between compared
nodes. The third algorithm is GCLUSIM, which detects and disambiguates sole
authors. In GCLUSIM, sole author’s and disambiguated author’s title feature
vectors are constructed to find the similarity between them. On the basis of this
similarity, a sole author may be merged with the disambiguated clusters. As our
final contribution, the fourth algorithm is CAND that exploits author name in
dices, author profiles, and a comparison function to solve the incremental author’s
name ambiguity. Author name indices enhance the overall system performance
and author profile models help in disambiguation of the incremental insertions.
The comparison function utilizes the most strong bibliometric features–co-author,
titles, and self-citations. The proposed algorithms are effective than state of the
art methods in terms of clustering metrics. Furthermore, we believe that our pro
posed algorithms in this dissertation can serve a baseline for future author name
disambiguation studies. |
en_US |