dc.description.abstract |
Citation indexes and digital libraries index millions of research papers and make
them available to the scienti c community; however, searching the intended information
from these huge repositories remain a challenge. Everyday, the research
papers in online digital libraries are increasing due to di erent number of conferences,
workshop, and journals which are being arranged throughout the world.
According to the statistic in 2017, one of the digital libraries in medical domain,
such as PubMed consisted of 28 millions of research documents. The manual
searching of relevant research papers from such a huge amount of documents is a
very di cult task. Therefore, this area has attracted the attention of researcher's
worldwide to propose and implement innovative techniques that could recommend
relevant papers to researchers.
The identi cation of relevant research papers has become an important research
area. For this, research community has proposed more than 90 di erent approaches
in the past 15 years. These approaches have utilized di erent data sources, such as
metadata, content, pro le based data and citations of research papers. These techniques
have certain strengths and limitations which have been critically reviewed
and presented in this document.
One of the important approaches in this area is co-citation analysis which considers
two documents as relevant if they are co-cited in other scienti c documents.
The original approach used references from the reference list of scienti c documents
to make such observations. However, in the recent years, the content of
documents have also been exploited along with the reference list to enhance the
accuracy. These approaches include Citation Proximity Analysis (CPA), Citation
Order Analysis (COA), and exploit bytes of the content of scienti c papers. These
approaches conceptualize the occurrence of co-citations in di erent level of proximity
and give more weights to the co-cited documents which are co-cited closely.
However, the closely co-cited documents in the \Methodology/Results" section
may be considered more relevant as compared to the closely co-cited papers in the
\Introduction/Discussion" sections. This thesis explores structural organization of scienti c documents by giving weights according to the importance of di erent
generic sections, and investigates that whether such approach may increase the
accuracy of identifying relevant papers.
This work addresses the following important research challenges and can be considered
as the contributions of the thesis: (1) generic section identi cation in citing
document (2) in-text citation patterns and frequencies identi cation in citing document
and (3) design of an algorithm that utilizes evidences from above mentioned
sources (sections name, their weight, and the frequency of co-citations) to identify
and recommend relevant papers.
For each contribution, the detailed architecture, dataset and evaluation have been
discussed in this thesis. First the generic section identi cation component was
designed, implemented and then evaluated with state-of-the-art approaches. The
proposed approach was evaluated on two datasets consisted of 150 and 300 citing
documents respectively. The aggregated F-score of proposed approach was 92%
over the both datasets while the F-score of the state-of-the-art technique was 81%.
Second, the component of in-text citation patterns and frequencies identi cation
was implemented with detailed architecture, dataset, and evaluation. For the evaluation,
two datasets were prepared from openly available digital libraries, Journal
of Universal Computer Science (J.UCS)1 and CiteSeerX2. The proposed model was
outperformed the state-of-the-art approach by increasing the F-score from 0.58 to
0.97. The third contribution of this thesis is section wise co-citation analysis
which depends on earlier two components. The proposed approach was designed
to rank the co-cited documents. For the evaluation purpose, two benchmarks such
as JSD and cosine similarity based rankings were selected for the comparison of
proposed and state-of-the-art approaches. The score has been compared between
the proposed and state-of-the-art approaches using Spearman's and Kendall's tau
measures. The results show that the proposed approach has outperformed comparatively
the state-of-the-art techniques such as: standard co-citation and CPA
based on bytes o set. |
en_US |