Abstract:
Computational grammar development and deep linguistic analysis
provides structural details for natural language understanding by machines.
Modern multilingual information processing systems use these details for
understanding and processing of information represented in different
languages. While work in Sindhi language is focused in the areas like part of
speech tagging and machine learning. Sindhi lacks resources like
computational grammars and deep linguistic analysis systems. Development
of such resources is open research area in computational linguistic and natural
language processing domains.
This work presents the development of Sindhi language morphology
and grammar in Finite State Technology and Lexical Functional Grammar
(LFG) frameworks. The work includes the investigation and identification of
morphology and syntax patterns in Sindhi language, development of Sindhi
finite state lexicon by modeling of identified morphological patters in LEXC,
development of Sindhi LFG by incorporating the finite state lexicon in XLE, and
evaluation of developed morphological lexicon and LFG grammar.
Various parts of speech of Sindhi language are investigated and their
morphological patterns are identified. Nouns are marked by number, gender
and case. Ten different cases of nouns are identified namely nominative,
accusative, dative, participant, instrumental, locative, ablative, agentive,
genitive and vocative. Adjectives are also declined like nouns. Pronouns are
declined for number and gender and are marked by nominative, oblique and
genitive cases. Generally, adverbs are not inflected but when adjectives used
as adverbs they hold the inflectional properties of adjectives. Genitive
iv
postpositions are inflected and marked by number and gender. Conjunctions
and interjections do not inflect. Verbs are most complex part of speech and
classified into main, auxiliary, copula and modal verbs. Verbs are conjugated
by number and gender and are marked by tense, aspect and mood.
Morphological analysis of developed model shows that a verb can have up to
75 different morphological forms in Sindhi. Present, past and future tense
patterns along with aspect and mood are analyzed. Aspect in Sindhi can either
be perfective or imperfective (continuous and habitual) and can be marked
morphologically or syntactically. Many alternative patterns of different aspects
exist. Nine different mood patterns are identified including subjunctive,
presumptive, imperative, declarative, permissive, prohibitive, capacitive,
compulsive and suggestive. Pronominal suffixes in Sindhi may appear on
nouns, postpositions and verbs. Pronominal suffixation can possibly cause
subject and object pro-drop.
Sindhi syntax is analyzed with LFG perspective. Different noun phrase
constructions are implemented with coordination patterns including adjective
phrases, postpositional phrases, participle phrases, and relative clauses.
Genitive case marking patterns along with syntactic agreement are identified
and modeled in LFG. Verbal subcategorization frames are defined for different
grammatical functions including SUBJ (Subject), OBJ (Object), OBJ2
(Secondary Object), OBL (Oblique), COMP (Complement), XCOMP (Open
Complement), and PREDLINK (Predicate link). Phrase and sentence level
adjuncts (ADJUNCT) and open adjunct (XADJUNCT) patterns are also
identified and implemented in LFG.
The developed grammar is tested against two different test suites. First
v
test suite contains 617 handcrafted sentences in 10 different test files
containing sentences with different syntactic features. Second test suite
contains real time corpus of two text books of Sindhi class one with 258
sentences. Results show 98.05% and 96.5% parsing percentage of test suite
1 and test suite 2 respectively.
Morphology coverage includes 862 stems of different POS classes with
total of 10327 inflectional forms. The developed finite state morphology is
tested and evaluated against the corpus of 9050 words in terms of coverage,
ambiguity, precision, recall and f-measure (F1). The results show 97.8%
precision, 96.08% recall and average ambiguity of 1.65 solutions per word with
91.1% coverage. Coverage of different morphological features include
number, gender, case, tense, aspect and mood. Syntactic coverage includes
nominal elements, coordination, subordination, agreement, verbal
subcategorization, tense, aspect and mood.
Research and development results include Sindhi part of speech tagset,
roman script for Sindhi language, morphological lexicon and LFG grammar of
Sindhi. As a side development, a corpus of about 4 million words is also
developed. In absence of linguistic resources for Sindhi language, these
developments will have signification impact on Sindhi language processing
and further research in computational linguistics and related domains.