SELECTING PROPER TOOLS FOR THE PROCESSING OF KOREAN, ARABIC AND INDONESIAN CORPUS

PRIHANTORO, PRIHANTORO SELECTING PROPER TOOLS FOR THE PROCESSING OF KOREAN, ARABIC AND INDONESIAN CORPUS. In: Trans-disciplinary Linguistics Seminar.

Microsoft Word - Published Version
437Kb

Abstract

Prihantoro Universitas Diponegoro prihantoro2001@yahoo.com, prihantoro@undip.ac.id Abstract One of the fundamental issues in corpus processing begins with how the corpus processing tools can recognize the characters in the documents. This issue is considered trivial for romanized languages, which adopt the same writing system as English (such as Indonesian). While most corpus processing tools are designed to process English, or the romanized version of the documents, tools to handle languages with their own writing systems needs to be built. Consider Korean, a language with syllable block writing system, or Arabic, a language with consonant skeleton writing system. The organization of the characters in these languages is not completely concatenative like English. The description of the writing system in these languages must underlie the design of the tools. Character recognition issue can be solved and much more complex processing tasks can be performed. This paper will demonstrate AntConc, Geuljabi and Unitex and show how these tools recognize characters in these languages. Keywords: Corpus Processing Tools, writing system, character recognition, automatic retrieval

Item Type:	Conference or Workshop Item (Paper)
Subjects:	P Language and Literature > P Philology. Linguistics
Divisions:	Faculty of Humanities > Department of English
ID Code:	44969
Deposited By:	INVALID USER
Deposited On:	16 Jan 2015 18:40
Last Modified:	16 Jan 2015 18:40

Repository Staff Only: item control page