Abstract:
Scientific publications are growing exponentially. For example, more than 50 million journal papers have been published till now, and more than 2 million journal papers are added to the scientific knowledge every year. The published conference papers are in billions, and millions others are added every year. The world famous scientific databases such as Web of Science, Scopus, and PubMed etc index millions of such scientific papers, and that also despite the fact that their index either belongs to specialized domain or it is selective. There is another comprehensive index known as Google Scholar, indexes huge scientific knowledge from different domains. These systems make available the scientific knowledge to researchers. The advancement in research is always possible by standing on the shoulders of others. However, when users attempt to identify relevant papers from the mentioned systems or other similar systems, they are given millions of papers and are asked to select the most relevant papers manually by skimming those millions of papers. This creates frustration, and generally all of the selected papers do not belong to the list of papers which the users must read. In this task, many important papers are overlooked by the users as well.
The identification of relevant papers from such a big data has attracted a number of researchers across the globe to find solutions to this problem. The contemporary approaches use a variety of techniques for the identification of the relevant documents such as content based approaches, metadata based approaches, collaborative filtering based approaches, co-citation analysis, and bibliographic analysis etc. However, the state-of-the-art research lacks in many directions such as its inability to find the nature of relationship between scientific documents and its failure to find how strongly two scientific documents are linked up, based on their relationship strength.
To address these issues, this thesis designs, implements, and evaluates a novel approach that facilitates researchers to identify the most relevant papers in their domains. The proposed approach identifies the most relevant papers from the list of cited-by papers for the cited paper. This thesis works on the in-text citation frequencies and in-text citation patterns to identify the most relevant papers. In-text citation frequency is the number of occurrences of citations of one paper in the text of the other paper. In-text citation frequency patterns are the in-text citation evidences in different sections of the paper. The system has been implemented as a prototype for
3
CiteSeer. The proposed system has been evaluated using a number of user studies. The proposed approach shows encouraging results and assists the scientific community to identify the most relevant papers from a huge list of papers.