PASTIC Dspace Repository

Developing a Sindhi Computational Resource Grammar in Lexical Functional Grammar Framework

Show simple item record

dc.contributor.author Rahman, Mutee U
dc.date.accessioned 2019-07-12T11:17:53Z
dc.date.accessioned 2020-04-11T15:36:26Z
dc.date.available 2020-04-11T15:36:26Z
dc.date.issued 2017
dc.identifier.govdoc 17341
dc.identifier.uri http://142.54.178.187:9060/xmlui/handle/123456789/5102
dc.description.abstract Computational grammar development and deep linguistic analysis provides structural details for natural language understanding by machines. Modern multilingual information processing systems use these details for understanding and processing of information represented in different languages. While work in Sindhi language is focused in the areas like part of speech tagging and machine learning. Sindhi lacks resources like computational grammars and deep linguistic analysis systems. Development of such resources is open research area in computational linguistic and natural language processing domains. This work presents the development of Sindhi language morphology and grammar in Finite State Technology and Lexical Functional Grammar (LFG) frameworks. The work includes the investigation and identification of morphology and syntax patterns in Sindhi language, development of Sindhi finite state lexicon by modeling of identified morphological patters in LEXC, development of Sindhi LFG by incorporating the finite state lexicon in XLE, and evaluation of developed morphological lexicon and LFG grammar. Various parts of speech of Sindhi language are investigated and their morphological patterns are identified. Nouns are marked by number, gender and case. Ten different cases of nouns are identified namely nominative, accusative, dative, participant, instrumental, locative, ablative, agentive, genitive and vocative. Adjectives are also declined like nouns. Pronouns are declined for number and gender and are marked by nominative, oblique and genitive cases. Generally, adverbs are not inflected but when adjectives used as adverbs they hold the inflectional properties of adjectives. Genitive iv postpositions are inflected and marked by number and gender. Conjunctions and interjections do not inflect. Verbs are most complex part of speech and classified into main, auxiliary, copula and modal verbs. Verbs are conjugated by number and gender and are marked by tense, aspect and mood. Morphological analysis of developed model shows that a verb can have up to 75 different morphological forms in Sindhi. Present, past and future tense patterns along with aspect and mood are analyzed. Aspect in Sindhi can either be perfective or imperfective (continuous and habitual) and can be marked morphologically or syntactically. Many alternative patterns of different aspects exist. Nine different mood patterns are identified including subjunctive, presumptive, imperative, declarative, permissive, prohibitive, capacitive, compulsive and suggestive. Pronominal suffixes in Sindhi may appear on nouns, postpositions and verbs. Pronominal suffixation can possibly cause subject and object pro-drop. Sindhi syntax is analyzed with LFG perspective. Different noun phrase constructions are implemented with coordination patterns including adjective phrases, postpositional phrases, participle phrases, and relative clauses. Genitive case marking patterns along with syntactic agreement are identified and modeled in LFG. Verbal subcategorization frames are defined for different grammatical functions including SUBJ (Subject), OBJ (Object), OBJ2 (Secondary Object), OBL (Oblique), COMP (Complement), XCOMP (Open Complement), and PREDLINK (Predicate link). Phrase and sentence level adjuncts (ADJUNCT) and open adjunct (XADJUNCT) patterns are also identified and implemented in LFG. The developed grammar is tested against two different test suites. First v test suite contains 617 handcrafted sentences in 10 different test files containing sentences with different syntactic features. Second test suite contains real time corpus of two text books of Sindhi class one with 258 sentences. Results show 98.05% and 96.5% parsing percentage of test suite 1 and test suite 2 respectively. Morphology coverage includes 862 stems of different POS classes with total of 10327 inflectional forms. The developed finite state morphology is tested and evaluated against the corpus of 9050 words in terms of coverage, ambiguity, precision, recall and f-measure (F1). The results show 97.8% precision, 96.08% recall and average ambiguity of 1.65 solutions per word with 91.1% coverage. Coverage of different morphological features include number, gender, case, tense, aspect and mood. Syntactic coverage includes nominal elements, coordination, subordination, agreement, verbal subcategorization, tense, aspect and mood. Research and development results include Sindhi part of speech tagset, roman script for Sindhi language, morphological lexicon and LFG grammar of Sindhi. As a side development, a corpus of about 4 million words is also developed. In absence of linguistic resources for Sindhi language, these developments will have signification impact on Sindhi language processing and further research in computational linguistics and related domains. en_US
dc.description.sponsorship Higher Education Commission, Pakistan en_US
dc.language.iso en_US en_US
dc.publisher Isra University, Hyderabad en_US
dc.subject Computer Sciences en_US
dc.title Developing a Sindhi Computational Resource Grammar in Lexical Functional Grammar Framework en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account