dc.description.abstract |
Biomedical knowledge is usually presented in the form of unstructured segments; making the extraction of such information a complex task. Although, manual information extraction often produces the best results, it is harder to manage biomedical data extraction manually, because its data size is rising exponentially. Thus, there is a need for automatic tools and techniques for information extraction and knowledge discovery in biomedical text mining. Named entity recognition and relation extraction are focused areas of research in biomedical information extraction systems. Relation Extraction hinders the known relationship between Named Entities and in some way these are dependent on each other yet research also takes both these steps in an independent manner also.
A lot of work has been done on biomedical named entity recognition focusing mostly on supervised and semi supervised solutions but very less attention work is done on unsupervised methods. Due to limited availability of annotated corpora the researchers now directed their efforts towards achievement of unsupervised named entity recognition systems. Named Entity Recognition from annotated corpora has been matured and there is very less margin for performance optimization. The challenge is still alive for the named entity recognition from unannotated corpora in all domains generally and for biological and biomedical domain specifically.
Biomedical text exhibits relationships between different entities which are important for practitioners and researchers. Relation extraction is a significant area in biomedical knowledge, which has gained much importance in the last two decades. A lot of work has been done on biomedical relation extraction and identification focusing on two major areas: 1) rule based technique and 2) machine learning technique. In the last decade, focus has changed to hybrid approaches which have shown better results.
This research presents an unsupervised named entity recognition framework along with a hybrid feature set for classification of relations between biomedical entities. Our Named Entity Recognition uses UMLS concepts and creates signatures that automate signature vectors. The vectorization of UMLS concepts ensures application of the framework in a generic way. Our framework differs with previous un-supervised methods in a way that we rely on UMLS for vector space creation instead of corpus statistics. The Relation Extraction approach uses bag of word feature, along with Natural Language Processing (NLP) to identify the noun and verb phrases and semantic features based on UMLS concepts. This hybrid feature set is a better representation of the relation extraction task. The main contribution in this hybrid features is the addition of semantic feature
xi | P a g e
set where verb phrases are ranked using Unified Medical Language System (UMLS), and a ranking algorithm is designed to get the most suitable concepts as features for the classifier.
For Named Entity Recognition, we used Arizona Disease Corpus (AZDC) a gold standard corpus for this task. Our framework achieved accuracy of 72.56% which is competitive with supervised techniques on the same corpus. Our Relation Extraction approach has been validated on standard biomedical text corpus obtained from MEDLINE 2001, an accuracy of 96.19%, 97.45%, 96.49% and F-measure of 98.05%, 93.55%, 88.89% has been achieved for the cure, prevent and side effect relations respectively. |
en_US |