Abstract:
The rise of social networking sites and blogs has simulated a bull market in personal opinion;
consumer recommendations, product reviews, ratings, and other types of online expressions. For
computational linguistic researchers, this fast-growing heap of information has opened an
exciting research frontier, referred as, the Sentiment Analysis (SA). For English, this area is
under consideration from last decade. But, other major languages, like Urdu, are totally
overlooked by the research community. Urdu is a morphologically rich and recourse poor
language. The distinctive features, like, complex morphology, flexible grammar rules, context
sensitive orthography and free word order, make the Urdu language processing a challenging
problem domain. For the same reasons, sentiment analysis approaches and techniques developed
for other well-explored languages are not workable for Urdu text.
This dissertation presents a grammatically motivated, sentiment classification framework to
handle these distinctive features of the Urdu language. The main research contributions are; to
highlight the linguistic (orthography, grammar and morphology, etc.) as well as technical
(parsing algorithm, lexicon, corpus, etc.) aspects of this multidimensional research problem, to
explore Urdu morphological operations, grammar and orthographic rules, to redefine these
operations and rules with respect to the requirements of sentiment analysis framework. The
orthographical, morphological, grammatical and finally the conceptual details of the language
are our target concerns. Additionally, our approach can help in the sentiment analysis of other
languages, like Arabic, Persian, Hindi, Punjabi etc.
The proposed framework emphasizes on the identification of the SentiUnits, rather than, the
subjective words in the given text. SentiUnits are the sentiment carrier expressions, which reveal
the inherent sentiments of the sentence for a specific target. The targets are the noun phrases for
which an opinion is made. The system extracts SentiUnits and the target expressions through the
shallow parsing based chunking. The dependency parsing algorithm creates associations between
these extracted expressions. The framework uses the sentiment-annotated lexicon based
approach. Each entry of the lexicon is marked with its orientation (positive or negative) and the
intensity (force of orientation) score. The experimentation based evaluation of the system with a
sentiment-annotated lexicon of Urdu words and two corpuses of reviews as test-beds, shows
encouraging achievement in terms of accuracy, precision, recall and f-measure.