Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
  • Selangor, Malaysia
Parallel texts or Bitexts - where the same content is available in several languages, due to document translation, are becoming plentiful and available, both in private data warehouses and on publicly accessible sites on the WWW
Sourcing for large amount of text and translating them are some of the challenges in building an Example-Based Machine Translation (EBMT) system. These big amounts of translated texts are annotated into the S-SSTC format to cover an... more
Sourcing for large amount of text and translating them are some of the challenges in building an Example-Based Machine Translation (EBMT) system. These big amounts of translated texts are annotated into the S-SSTC format to cover an extensive vocabulary and sentence structures. However, the Bilingual Knowledge Bank (BKB), which is a collection of the S-SSTCs, will normally contain redundancy. Hence, the idea of an optimized BKB is born. An optimized BKB (redundancy reduced; is smaller in size but is as equally extensive in term of its sentence structure coverage compared to an un-optimized BKB. Therefore, an optimized BKB enhances the performance of the EBMT. In this paper, we introduce the idea of an optimized BKB and propose it to be re-used to effectively construct new BKBs in order to adapt an existing EBMT for new language pairs
Long short term memory (LSTM) networks have been gaining popularity in modeling sequential data such as phoneme recognition, speech translation, language modeling, speech synthesis, chatbot-like dialog systems and others. This paper... more
Long short term memory (LSTM) networks have been gaining popularity in modeling sequential data such as phoneme recognition, speech translation, language modeling, speech synthesis, chatbot-like dialog systems and others. This paper investigates the attention-based encoder-decoder LSTM networks in Malay part-of-speech (POS) tagging when it is compared to weighted finite state transducer (WFST) and hidden Markov model (HMM). The attractiveness of LSTM networks is its strength in modeling long distance dependencies. Malay POS tagging is examined from two different conditions: with and without morphological information. The experiment results show that LSTM networks that are trained without any explicit morphological knowledge perform nearly equally with WFST but better than HMM approach that is trained with morphological information.
The search that involves structured web resources like XML data, services is still lagging of its own method and relying on contemporary search systems. This paper presents a method that learns semantics from structured information of... more
The search that involves structured web resources like XML data, services is still lagging of its own method and relying on contemporary search systems. This paper presents a method that learns semantics from structured information of these resources. Instead of committing the semantic meaning of resources to strict and formal vocabularies like ontology or data dictionary, we are interested to
Structured retrieval aims at exploiting the structural information of documents when searching for documents. Structured retrieval makes use of both content and structure of documents to improve information retrieval. Therefore, the... more
Structured retrieval aims at exploiting the structural information of documents when searching for documents. Structured retrieval makes use of both content and structure of documents to improve information retrieval. Therefore, the availability of semantic structure in the documents is an important factor for the success of structured retrieval. However, the majority of documents in the Web still lack semantically-rich structure.
The String-Tree Correspondence Grammar (STCG) [1] is a grammar formalism for defining: • a set of strings (a language), • a set of trees (valid representation/interpretation structures), • the mapping between the two (to be interpreted... more
The String-Tree Correspondence Grammar (STCG) [1] is a grammar formalism for defining: • a set of strings (a language), • a set of trees (valid representation/interpretation structures), • the mapping between the two (to be interpreted for analysis & generation). The formalism is argued to be a totally declarative grammar formalism that can associate, to strings in a language, arbitrary tree structures as desired by the grammar writer to be the linguistic representation structures of the strings. More importantly is the facility to specify the correspondence between the string and the associated tree in a very natural manner. These features are very much desired in grammar writing, in particular for the treatment of certain linguistic phenomena which are 'non-standard', namely featurisation, lexicalisation and crossed dependencies [2,3]. Furthermore, a grammar written in this way naturally inherits the desired property of bi-directionality (in fact non-directionality [4]) such that the same grammar can be interpreted for both analysis and generation. In this paper, we investigate the properties of the STCG for interpretation towards analysis (as is understood within the context of Machine Translation (MT)). Other than using STCG
A system was proposed to implement the phoneme segmentation for the Malay language connected words. The system consists of the front-end speech preprocessing part which focuses on the usage of zero crossing rates. The detection algorithm... more
A system was proposed to implement the phoneme segmentation for the Malay language connected words. The system consists of the front-end speech preprocessing part which focuses on the usage of zero crossing rates. The detection algorithm was used to determine the beginning and ending of the phonemes based on the silence intervals and valleys. Object-Oriented Programming (OOP) approach and Graphical
Research Interests:
Abstract. Word sense disambiguation (WSD) requires the establish-ment of a list of the different meanings of words. WSD efforts in ma-chine translation require, in addition, the equivalent translation words in target languages. To... more
Abstract. Word sense disambiguation (WSD) requires the establish-ment of a list of the different meanings of words. WSD efforts in ma-chine translation require, in addition, the equivalent translation words in target languages. To facilitate WSD in machine translation systems, we propose ...
Kamus Dewan is the authoritative dictionary for Bahasa Malaysia, containing a wealth of linguistic and cultural information about Bahasa Malaysia. It is currently available in print, as well as a searchable online dictionary. However, the... more
Kamus Dewan is the authoritative dictionary for Bahasa Malaysia, containing a wealth of linguistic and cultural information about Bahasa Malaysia. It is currently available in print, as well as a searchable online dictionary. However, the online dictionary lacks advanced search capabilities that target specific fields within each headword and lemma entry. For this information to be targeted and extracted efficiently by computers, the macro- and micro-structures of Kamus Dewan entries need to be first annotated or marked up explicitly. We describe how TEI-P5 guidelines have been applied in this endeavour to make the Kamus Dewan more machine-tractable. We also give some examples of how the machine-tractable data from Kamus Dewan can be used for linguistic research and analysis, as well as for producing other language resources.
Categories are used to organize information and knowledge in directory system, folder etc. As the amount of information increase and the types of information diversify, it is common to have more categories created. As the number of... more
Categories are used to organize information and knowledge in directory system, folder etc. As the amount of information increase and the types of information diversify, it is common to have more categories created. As the number of categories increases, it becomes more difficult to organize, manage and look up information from existing categories. In this paper, categories are annotated with concept features to facilitate the access, retrieval and sharing of information in the categories. We have observed that training texts is crucial in learning the concept of a category and serves as a good measure to help human to construct the category model. Hence, we present a study on training texts selection and evaluate the effectiveness of training texts, as well as its capability to complement human's knowledge in constructing the category model. Experimental evaluation shows that using training texts approach in category model construction gives promising results in both effectiveness and complement measures
EXTENDED ABSTRACT The retrieval of structured resources using unstructured queries is challenging as we need to deal with the matching between entities of two different types. Consider an unstructured query, “publications of K.H. Gan in... more
EXTENDED ABSTRACT The retrieval of structured resources using unstructured queries is challenging as we need to deal with the matching between entities of two different types. Consider an unstructured query, “publications of K.H. Gan in WI”, in a structured retrieval system. To match this query to structured resources, the system needs to transform it into a format that is comparable to the structure of the resources. As such, we develop a solution that automatically transform unstructured query to a mediated query which is enhanced with structural information. The mediated query is then matched against structured resources to obtain relevant results.
Automatic question answering (QA) is playing an increasingly important role in intelligent answer searching. Many approaches have been employed for retrieving answers to natural language questions with rule-based approach being one of... more
Automatic question answering (QA) is playing an increasingly important role in intelligent answer searching. Many approaches have been employed for retrieving answers to natural language questions with rule-based approach being one of them. Traditionally, rules for automatic QA have been generated manually which may be time consuming and limited in scope. To address this issue, we present a proposed automatic rule extraction approach to generate rules for QA from training data via structural clustering. Key words: Automatic question answering, rule extraction, structural clustering. 1.
This paper outlines the creation of an open combined semantic lexicon as a resource for the study of lexical semantics in the Malay languages (Malaysian and Indonesian). It is created by combining three earlier wordnets, each built using... more
This paper outlines the creation of an open combined semantic lexicon as a resource for the study of lexical semantics in the Malay languages (Malaysian and Indonesian). It is created by combining three earlier wordnets, each built using different resources and approaches: the Malay Wordnet (Lim & Hussein 2006), the Indonesian Wordnet (Riza, Budiono & Hakim 2010) and the Wordnet Bahasa (Nurril Hirfana, Sapuan & Bond 2011). The final wordnet has been validated and extended as part of sense annotation of the Indonesian portion of the NTU Multilingual Corpus (Tan & Bond 2012). The wordnet has over 48,000 concepts and 58,000 words for Indonesian and 38,000 concepts and 45,000 words for Malaysian.
In this paper, we would like to present an approach to construct a huge Bilingual Knowledge Bank (BKB) from an English Malay bilingual dictionary based on the idea of synchronous Structured String-Tree Correspondence (SSTC). The SSTC is a... more
In this paper, we would like to present an approach to construct a huge Bilingual Knowledge Bank (BKB) from an English Malay bilingual dictionary based on the idea of synchronous Structured String-Tree Correspondence (SSTC). The SSTC is a general structure that can associate an arbitrary tree structure to string in a language as desired by the annotator to be the interpretation structure of the string, and more importantly is the facility to specify the correspondence between the string and the associated tree which can be non-projective. With this structure, we are able to match linguistic units at different inter levels of the structure (i.e. define the correspondence between substrings in the sentence, nodes in the tree, subtrees in the tree and sub-correspondences in the SSTC). This flexibility makes synchronous SSTC very well suited for the construction of a Bilingual Knowledge Bank we need for the English-Malay MT application.
Kertas ini memperihal tentang pembinaan korpus pertuturan Bahasa Melayu untuk diguna dalam pembinaan sistem pertuturan Bahasa Melayu. Korpus pertuturan Bahasa Melayu ini diwakili dengan perwakilan struktur pokok sintaks-prosodi, yang... more
Kertas ini memperihal tentang pembinaan korpus pertuturan Bahasa Melayu untuk diguna dalam pembinaan sistem pertuturan Bahasa Melayu. Korpus pertuturan Bahasa Melayu ini diwakili dengan perwakilan struktur pokok sintaks-prosodi, yang diubah suai daripada struktur perwakilan Structured-String Correspondence (SSTC). Bagi membina korpus pertuturan Bahasa Melayu dalam perwakilan sintaks-prosodi, ayat teks yang sedia kala dalam perwakilan SSTC diguna sebagai skrip rakaman. Melalui rakaman suara berdasarkan skrip tersebut, fitur prosodi diekstrak keluar dan dianotasi pada struktur pokok SSTC, dan pada masa yang sama, fail bunyi dipaut pada nod struktur pohon SSTC. Pada akhir pemprosesan rakaman dan anotasi, mini korpus pertuturan yang diwakili dengan perwakilan sintaksis-prosodi yang mengandungi 422 ayat, 1720 frasa dan 6978 unit perkataan berjaya dihasil.
We present the S-SSTC framework for machine translation (MT), introduced in 2002 and developed since as a set of working MT systems (SiSTeC-ebmt). Our approach is example-based, but differs from other EBMT approaches in that it uses... more
We present the S-SSTC framework for machine translation (MT), introduced in 2002 and developed since as a set of working MT systems (SiSTeC-ebmt). Our approach is example-based, but differs from other EBMT approaches in that it uses alignments of string-tree alignments, and in that supervised learning is an integral part of the approach. Our model directly deals with three main difficulties in the traditional treatment of MT that stem from its separation from the "translation task" (the 'world'). First, by allowing the system to learn from real translation examples directly, we avoid the need to indefinitely pursue the elusive goal of writing grammars to exactly describe intermediate syntacticosemantic monolingual representations and their correspondences. Second, we make explicit the dependence of the MT system performance on the input from the environment. That is possible only because the learning process uses feedback from the real translation knowledge when co...
... Malaysia enyakong@mmu.edu.my Alvin Yeo Wee Universiti Malaysia Sarawak Faculty of Computer Science and Information Technology 94300 Kota Samarahan, Malaysia alvin@fit.unimas.my Wong Chui Yin Multimedia University ...
In this paper we sketch an approach for Natural Language parsing. Our approach is an example-based approach, which relies mainly on examples that already parsed to their representation structure, and on the knowledge that we can get from... more
In this paper we sketch an approach for Natural Language parsing. Our approach is an example-based approach, which relies mainly on examples that already parsed to their representation structure, and on the knowledge that we can get from these examples the required information to parse a new input sentence. In our approach, examples are annotated with the Structured String Tree Correspondence (SSTC) annotation schema where each SSTC describes a sentence, a representation tree as well as the correspondence between substrings in the sentence and subtrees in the representation tree. In the process of parsing, we first try to build subtrees for phrases in the input sentence which have been successfully found in the example-base - a bottom up approach. These subtrees will then be combined together to form a single rooted representation tree based on an example with similar representation structure - a top down approach.
Research Interests:
Abstract. This paper presents a research proposal on user-oriented evaluation method to compare the usability of Internet search tools. Cognitive style and problem solving style are identified individual difference factors. Meta-search,... more
Abstract. This paper presents a research proposal on user-oriented evaluation method to compare the usability of Internet search tools. Cognitive style and problem solving style are identified individual difference factors. Meta-search, portal and individual search engines are Internet search tool available. Usability of each search tools based on relevancy and satisfaction is another factor of this study. The ultimate aim
ABSTRACT This research work describes our approaches in using dependency parse tree information to derive useful hidden word statistics to improve the baseline system of Malay large vocabulary automatic speech recognition system. The... more
ABSTRACT This research work describes our approaches in using dependency parse tree information to derive useful hidden word statistics to improve the baseline system of Malay large vocabulary automatic speech recognition system. The traditional approaches to train language model are mainly based on Chomsky hierarchy type 3 that approximates natural language as regular language. This approach ignores the characteristics of natural language. Our work attempted to overcome these limitations by extending the approach to consider Chomsky hierarchy type 1 and type 2. We extracted the dependency tree based lexical information and incorporate the information into the language model. The second pass lattice rescoring was performed to produce better hypotheses for Malay large vocabulary continuous speech recognition system. The absolute WER reduction was 2.2% and 3.8% for MASS and MASS-NEWS Corpus, respectively.
ABSTRACT There have been many R&D projects conducted under PPSKOMP (School of Computer Sciences, Universiti Sains Malaysia) since its establishment in 1995 until today. In PPSKOMP, there are eight major research groups... more
ABSTRACT There have been many R&D projects conducted under PPSKOMP (School of Computer Sciences, Universiti Sains Malaysia) since its establishment in 1995 until today. In PPSKOMP, there are eight major research groups established, which are: Artificial Intelligence Lab, Computer Aided Translation Unit, Computer Vision Research Group, Health Information Research Group, Information Systems Engineering, Multimedia Research Group, Network Research Group, and Parallel and Distributed Computing. Many knowledge resources and processing components have been developed by researchers and available in each of this research group. However, these resources are resided and accessible only in respective research group and mostly developed using different methodologies, paradigm and platform. In this paper, we present a Service-Oriented Architecture (SOA) framework which capable to resolve this problem and enable the synergisation of research and development strengths in PPSKOMP.
ABSTRACT On the web, most structured document collections consist of documents from different sources and marked up with different types of structures. The diversity of structures has led to the emergence of heterogeneous structured... more
ABSTRACT On the web, most structured document collections consist of documents from different sources and marked up with different types of structures. The diversity of structures has led to the emergence of heterogeneous structured documents. The heterogeneity of structured documents is one of the reason for query-document mismatch in structured document retrieval. In structured document retrieval, a user is assumed to have intimate knowledge of the document structures and is able to specify contextual constraints in their queries. However, it is impossible for the user to know all structures in heterogeneous structured document collections. In this paper, we propose to include similar correspondence relations in the representation model for structured document retrieval. The similar correspondences make the relations between similar contents explicit in order to improve structured document retrieval effectiveness. We introduce a generic and flexible structured document model to represent heterogeneous structured documents as well as the similar correspondences in the document collections. We also illustrate how the proposed model can be utilized in structured document retrieval.
This study focused on how human translators (HTs) performed translation task, which could contribute to a good start in designing and prototyping computer-aided translation (CAT) system. Data gathered from 20 subjects was analyzed with... more
This study focused on how human translators (HTs) performed translation task, which could contribute to a good start in designing and prototyping computer-aided translation (CAT) system. Data gathered from 20 subjects was analyzed with cognitive task analysis (CTA) technique. The user model derived from CTA was integrated into CAT system, where user modeling (UM) technique served in prototyping the adaptive and interactive system. UM would customize the properties of individual HTs with their task within the CAT system. HTs can use the help facilities available on the system to support their routine and non-routine tasks.
During the improvement of Malay Speech Synthesizer ver2 (MSS ver2), we focused on how the selection of target syllable utterance is to be concatenated. The selection is based on the best match of phonetic context similarity between target... more
During the improvement of Malay Speech Synthesizer ver2 (MSS ver2), we focused on how the selection of target syllable utterance is to be concatenated. The selection is based on the best match of phonetic context similarity between target utterance and recorded ...
In this paper, we will give the update information on the existing speech synthesizer systems that our unit has, the limitation and also our future plan to enhance our system. ... Keywords Speech Synthesis, TTS engine, concatenative... more
In this paper, we will give the update information on the existing speech synthesizer systems that our unit has, the limitation and also our future plan to enhance our system. ... Keywords Speech Synthesis, TTS engine, concatenative synthesis, distortion, spectral discontinuity, ...
This paper outlines the creation of an open combined semantic lexicon as a resource for the study of lexical semantics in the Malay languages (Malaysian and Indonesian). It is created by combining three earlier wordnets, each built using... more
This paper outlines the creation of an open combined semantic lexicon as a resource for the study of lexical semantics in the Malay languages (Malaysian and Indonesian). It is created by combining three earlier wordnets, each built using different resources and approaches: the Malay Wordnet (Lim & Hussein 2006), the Indonesian Wordnet (Riza, Budiono & Hakim 2010) and the Wordnet Bahasa (Nurril Hirfana, Sapuan & Bond 2011). The final wordnet has been validated and extended as part of sense annotation of the Indonesian portion of the NTU Multilingual Corpus (Tan & Bond 2012). The wordnet has over 48,000 concepts and 58,000 words for Indonesian and 38,000 concepts and 45,000 words for Malaysian.
Research Interests: