Skip to main content

Shu-Kai Hsieh

National Taiwan University, Linguistics, Faculty Member

Followers

106

Following

52

Co-authors

6

Public Views

InterestsView All (6)

Uploads

Papers by Shu-Kai Hsieh

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society Title Skillex, an action labelling efficiency score: the case for French and Mandarin Publication Date Skillex, an action labelling efficiency score: the case for French and Mandarin

We propose a model to compute two measurements of semantic efficiency of verbs as action labels. ... more We propose a model to compute two measurements of semantic efficiency of verbs as action labels. It is based on the exploration of the specific structure of synonymy networks of verbs. We use these measurements to analyse and compare the semantic efficiency of [Children/Adults] productions in action labelling tasks, in French and Mandarin. The combination of these two measurements leads to a generic score of semantic efficiency, Skillex. Assigned to participants of the Approx protocol experiment, this score enables us to accurately classify them into Children and Adults categories, be they French or Mandarin native speakers.

Sentiment detection in micro-blogs using unsupervised chunk extraction

Lingua Sinica, 2016

Sketching the Dependency Relations of Words in Chinese

Wiktionary and NLP

Proceedings of the 2009 Workshop on The People's Web Meets NLP Collaboratively Constructed Semantic Resources - People's Web '09, 2009

Chinese Sentiments on the Clouds

This study aims to propose a novel pipeline architecture in building and analyzing largescaled li... more This study aims to propose a novel pipeline architecture in building and analyzing largescaled linguistic data on the cloud-based environment, an experimental survey on Chinese Polarity Lexicon will be taken as an example. In this experiment, data are evaluated and tagged by applying crowd sourcing approach using online Google Form. All the data processing and analyzing procedures are completed on-the-fly with free cloud services automatically and dynamically. The paper shows the advantages of using cloud-based ...

Incorporating structural topic modeling into short text analysis

Concentric. Studies in Linguistics

The past few decades have seen the rapid development of topic modeling. So far, research has been... more The past few decades have seen the rapid development of topic modeling. So far, research has been more concerned with determining the ideal number of topics or meaningful topic clustering words than with applying topic modeling techniques to evaluate linguistic theories. This study proposes the Structural Topic Model (STM)-led framework to facilitate the interpretation of topic modeling results and standardize text analysis. STM encompasses various model training mechanisms, thereby requiring systematic designs to properly combine language studies. “Structural” in STM refers to the inclusion of metadata structure. Unlike the corpus-based keyness approach, STM can capture contextual cues and meta-information for the interpretation of topical results. Besides, STM can make cross-corpora comparisons via topical contrast, a challenging task for corpus-driven related models such as the Biterm Topic Model (BTM). Stylistic variations in song lyrics are taken as an illustration to show how ...

2009 Index International Journal of Computational Linguistics &

Assessing Text Readability Using Hierarchical Lexical Relations Retrieved from WordNet

Although some traditional readability formulas have shown high predictive validity in the r = 0.8... more Although some traditional readability formulas have shown high predictive validity in the r = 0.8 range and above (Chall & Dale, 1995), they are generally not based on genuine linguistic processing factors, but on statistical correlations (Crossley et al., 2008). Improvement of readability assessment should focus on finding variables that truly represent the comprehensibility of text as well as the indices that accurately measure the correlations. In this study, we explore the hierarchical relations between lexical items based on the conceptual categories advanced from Prototype Theory (Rosch et al., 1976). According to this theory and its development, basic level words like guitar represent the objects humans interact with most readily. They are acquired by children earlier than their superordinate words like stringed instrument and their subordinate words like acoustic guitar. Accordingly, the readability of a text is presumably associated with the ratio of basic level words it co...

Hanzi Grid Toward a Knowledge Infrastructure for Chinese Character-based Cultures

Abstract. The long-term historical development and broad geographi-cal variation of Chinese chara... more Abstract. The long-term historical development and broad geographi-cal variation of Chinese character (Hanzi/Kanji) has made it a cross-cultural information sharing platform in East Asia. However, due to the lack of proper research framework, the integration of heterogeneous knowledge grounded in Hanzi and its variants has been a thorny problem. In this paper, we propose a theoretical framework for the knowledge rep-resentation of Hanzi in the cross-cultural context. Our proposal is mainly based on two resources: Hantology and Generative Lexicon Theory. Han-tology is a comprehensive Chinese character-based knowledge resource created to provide a solid foundation both for philological surveys and language processing tasks, while Generative lexicon theory is extended to catch the abundant knowledge information of Chinese characters within its proposed qualia structure. We believe that the proposed theoretical framework will have great influence on the current research paradigm of Hanz...

Stance Classification on PTT Comments

With the development of social media and online forums, users have grown accustomed to expressing... more With the development of social media and online forums, users have grown accustomed to expressing their agreement and disagreement via short texts. Elements that reveal the user’s stance or subjectivity thus becomes an important resource in identifying the user’s position on a given topic. In the current study, we observe comments of an online bulletin board in Taiwan for how people express their stance when responding to other people’s post in Chinese. A lexicon is built based on linguistic analysis and annotation of the data. We performed binary classification task using these linguistic features and was able to reach an average of 71 percent accuracy. A linguistic analysis on the confusion caused in the classification task is done for future work on better accuracy for such task. 1

Query Expansion using LMF-Compliant Lexical Resources

This paper reports prototype multilingual query expansion system relying on LMF compliant lexical... more This paper reports prototype multilingual query expansion system relying on LMF compliant lexical resources. The system is one of the deliverables of a three-year project aiming at establishing an international standard for language resources which is applicable to Asian languages. Our important contributions to ISO 24613, standard Lexical Markup Framework (LMF) include its robustness to deal with Asian languages, and its applicability to cross-lingual query tasks, as illustrated by the prototype introduced in this paper. 1

Computational Modeling of Affixoid Behavior in Chinese Morphology

Proceedings of the 28th International Conference on Computational Linguistics

Formal Description of Lexical Semantic Relations

Lexical semantic relations have played an important role in the recent developments of Natural La... more Lexical semantic relations have played an important role in the recent developments of Natural Language Processing and Com putational Lexical Resources as well. This paper reviews the notion of lexical s emantic relations in the WordNet-like lexical resources, and proposes a formal modeling o f lexical semantic relations using the extended Formal Concept Analysis. I believe tha t e proposed formalization will be able to highlight problems with regard to lexica l and cultural gaps, and serve as a foundation for solutions that support lexical theor etical explorations and applications for multilingual wordnets in the future.

Entrenchment and Creativity in Chinese Quadrasyllabic Idiomatic Expressions

This paper aims to explore a special type of idiomatic expressions of even length called Quadrasy... more This paper aims to explore a special type of idiomatic expressions of even length called Quadrasyllabic Idiomatic Expressions (QIEs) in Chinese, and explain their variations with reference to semantic and structural constraints on the elements imposed by the construction of QIEs on the one hand, and its interplay with individual semantic elements in semantic space in the comprehension task of QIEs variants. Results of human ratings and behavioral experiment both show that semantic distance affects the speed of comprehension with the construction entrenchment. For those QIEs with idiomaticity, semantic distance leads to no major effect. We show that Chinese QIEs provide an ideal testing ground for the empirical investigation of the functional linguistic notion of entrenchment in processing multi-morphemic strings.

Computational Representation of Chinese Characters: Comparison Between Singular Value Decomposition and Variational Autoencoder

The Journal of Cognitive Science, 2020

Being a notoriously complex problem, writing is generally decomposed into a series of subtasks: i... more Being a notoriously complex problem, writing is generally decomposed into a series of subtasks: idea generation, expression, revision, etc. Given some goal, the author generates a set of ideas (brainstorming), which he integrates into some skeleton (outline, text plan, outline). This leads to a first draft which is submitted then for revision possibly yielding changes at various levels (content, structure, form). Having made a draft, authors usually revise, edit, and proofread their documents. We confine ourselves here only to academic writing, focusing on sentence production. While there has been quite some work on this topic, most writing assistance has mainly dealt with grammatical errors, editing and proofreading, the goal being the correction of surface-level problems such as typography, spelling, or grammatical errors. We broaden the scope by also including cases where the entire sentence needs to be rewritten in order to express properly all of the information planned. Hence,...

Leveraging Morpho-semantics for the Discovery of Relations in Chinese Wordnet

Semantic relations of different types have played an important role in wordnet, and have been wid... more Semantic relations of different types have played an important role in wordnet, and have been widely recognized in various fields. In recent years, with the growing interests of constructing semantic network in support of intelligent systems, automatic semantic relation discovery has become an urgent task. This paper aims to extract semantic relations relying on the in situ morpho-semantic structure in Chinese which can dispense of an outside source such as corpus or web data. Manual evaluation of thousands of word pairs shows that most relations can be successful predicted. We believe that it can serve as a valuable starting point in complementing with other approaches, which will hold promise for the robust lexical relations acquisition.

The Secret to Popular Chinese Web Novels: A Corpus-Driven Study

What is the secret to writing popular novels? The issue is an intriguing one among researchers fr... more What is the secret to writing popular novels? The issue is an intriguing one among researchers from various fields. The goal of this study is to identify the linguistic features of several popular web novels as well as how the textual features found within and the overall tone interact with the genre and themes of each novel. Apart from writing style, non-textual information may also reveal details behind the success of web novels. Since web fiction has become a major industry with top writers making millions of dollars and their stories adapted into published books, determining essential elements of “publishable” novels is of importance. The present study further examines how non-textual information, namely, the number of hits, shares, favorites, and comments, may contribute to several features of the most popular published and unpublished web novels. Findings reveal that keywords, function words, and lexical diversity of a novel are highly related to its genres and writing style w...

Features of Verb Complements in Co-composition: A case study of Chinese baking verb using Weibo corpus

In the Generative Lexicon Theory (GLT), co-composition is one of the generative devices proposed ... more In the Generative Lexicon Theory (GLT), co-composition is one of the generative devices proposed to explain the cases of verbal polysemous behavior where more than one function application is allowed. The English baking verbs were used as one of the examples to illustrate how their complements co-specify the verb with qualia unification. In this paper, we begin by exploring the polysemy of Chinese baking verb, where the first two senses in Chinese Wordnet (CWN) are assumed. Features including linguistic cues and common sense knowledge are involved in the experiment with Weibo corpus and computed with SVM for closer investigation. From the analysis, it is found that though there are various cases found in senses of change of state and creation, a coarse but systematic approach combined with certain features in disambiguating CWN senses could be arranged. In addition, we further observe that the usage of various instruments cases and classifiers would be harnessed by underlying backgr...

CogALex-V Shared Task: LOPE

Automatic discovery of semantically-related words is one of the most important NLP tasks, and has... more Automatic discovery of semantically-related words is one of the most important NLP tasks, and has great impact on the theoretical psycholinguistic modeling of the mental lexicon. In this shared task, we employ the word embeddings model to testify two thoughts explicitly or implicitly assumed by the NLP community: (1). Word embedding models can reflect syntagmatic similarities in usage between words to distances in projected vector space. (2). Word embedding models can reflect paradigmatic relationships between words.

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society Title Skillex, an action labelling efficiency score: the case for French and Mandarin Publication Date Skillex, an action labelling efficiency score: the case for French and Mandarin

We propose a model to compute two measurements of semantic efficiency of verbs as action labels. ... more We propose a model to compute two measurements of semantic efficiency of verbs as action labels. It is based on the exploration of the specific structure of synonymy networks of verbs. We use these measurements to analyse and compare the semantic efficiency of [Children/Adults] productions in action labelling tasks, in French and Mandarin. The combination of these two measurements leads to a generic score of semantic efficiency, Skillex. Assigned to participants of the Approx protocol experiment, this score enables us to accurately classify them into Children and Adults categories, be they French or Mandarin native speakers.

Sentiment detection in micro-blogs using unsupervised chunk extraction

Lingua Sinica, 2016

Sketching the Dependency Relations of Words in Chinese

Wiktionary and NLP

Proceedings of the 2009 Workshop on The People's Web Meets NLP Collaboratively Constructed Semantic Resources - People's Web '09, 2009

Chinese Sentiments on the Clouds

This study aims to propose a novel pipeline architecture in building and analyzing largescaled li... more This study aims to propose a novel pipeline architecture in building and analyzing largescaled linguistic data on the cloud-based environment, an experimental survey on Chinese Polarity Lexicon will be taken as an example. In this experiment, data are evaluated and tagged by applying crowd sourcing approach using online Google Form. All the data processing and analyzing procedures are completed on-the-fly with free cloud services automatically and dynamically. The paper shows the advantages of using cloud-based ...

Incorporating structural topic modeling into short text analysis

Concentric. Studies in Linguistics

The past few decades have seen the rapid development of topic modeling. So far, research has been... more The past few decades have seen the rapid development of topic modeling. So far, research has been more concerned with determining the ideal number of topics or meaningful topic clustering words than with applying topic modeling techniques to evaluate linguistic theories. This study proposes the Structural Topic Model (STM)-led framework to facilitate the interpretation of topic modeling results and standardize text analysis. STM encompasses various model training mechanisms, thereby requiring systematic designs to properly combine language studies. “Structural” in STM refers to the inclusion of metadata structure. Unlike the corpus-based keyness approach, STM can capture contextual cues and meta-information for the interpretation of topical results. Besides, STM can make cross-corpora comparisons via topical contrast, a challenging task for corpus-driven related models such as the Biterm Topic Model (BTM). Stylistic variations in song lyrics are taken as an illustration to show how ...

2009 Index International Journal of Computational Linguistics &

Assessing Text Readability Using Hierarchical Lexical Relations Retrieved from WordNet

Although some traditional readability formulas have shown high predictive validity in the r = 0.8... more Although some traditional readability formulas have shown high predictive validity in the r = 0.8 range and above (Chall & Dale, 1995), they are generally not based on genuine linguistic processing factors, but on statistical correlations (Crossley et al., 2008). Improvement of readability assessment should focus on finding variables that truly represent the comprehensibility of text as well as the indices that accurately measure the correlations. In this study, we explore the hierarchical relations between lexical items based on the conceptual categories advanced from Prototype Theory (Rosch et al., 1976). According to this theory and its development, basic level words like guitar represent the objects humans interact with most readily. They are acquired by children earlier than their superordinate words like stringed instrument and their subordinate words like acoustic guitar. Accordingly, the readability of a text is presumably associated with the ratio of basic level words it co...

Hanzi Grid Toward a Knowledge Infrastructure for Chinese Character-based Cultures

Abstract. The long-term historical development and broad geographi-cal variation of Chinese chara... more Abstract. The long-term historical development and broad geographi-cal variation of Chinese character (Hanzi/Kanji) has made it a cross-cultural information sharing platform in East Asia. However, due to the lack of proper research framework, the integration of heterogeneous knowledge grounded in Hanzi and its variants has been a thorny problem. In this paper, we propose a theoretical framework for the knowledge rep-resentation of Hanzi in the cross-cultural context. Our proposal is mainly based on two resources: Hantology and Generative Lexicon Theory. Han-tology is a comprehensive Chinese character-based knowledge resource created to provide a solid foundation both for philological surveys and language processing tasks, while Generative lexicon theory is extended to catch the abundant knowledge information of Chinese characters within its proposed qualia structure. We believe that the proposed theoretical framework will have great influence on the current research paradigm of Hanz...

Stance Classification on PTT Comments

With the development of social media and online forums, users have grown accustomed to expressing... more With the development of social media and online forums, users have grown accustomed to expressing their agreement and disagreement via short texts. Elements that reveal the user’s stance or subjectivity thus becomes an important resource in identifying the user’s position on a given topic. In the current study, we observe comments of an online bulletin board in Taiwan for how people express their stance when responding to other people’s post in Chinese. A lexicon is built based on linguistic analysis and annotation of the data. We performed binary classification task using these linguistic features and was able to reach an average of 71 percent accuracy. A linguistic analysis on the confusion caused in the classification task is done for future work on better accuracy for such task. 1

Query Expansion using LMF-Compliant Lexical Resources

This paper reports prototype multilingual query expansion system relying on LMF compliant lexical... more This paper reports prototype multilingual query expansion system relying on LMF compliant lexical resources. The system is one of the deliverables of a three-year project aiming at establishing an international standard for language resources which is applicable to Asian languages. Our important contributions to ISO 24613, standard Lexical Markup Framework (LMF) include its robustness to deal with Asian languages, and its applicability to cross-lingual query tasks, as illustrated by the prototype introduced in this paper. 1

Computational Modeling of Affixoid Behavior in Chinese Morphology

Proceedings of the 28th International Conference on Computational Linguistics

Formal Description of Lexical Semantic Relations

Lexical semantic relations have played an important role in the recent developments of Natural La... more Lexical semantic relations have played an important role in the recent developments of Natural Language Processing and Com putational Lexical Resources as well. This paper reviews the notion of lexical s emantic relations in the WordNet-like lexical resources, and proposes a formal modeling o f lexical semantic relations using the extended Formal Concept Analysis. I believe tha t e proposed formalization will be able to highlight problems with regard to lexica l and cultural gaps, and serve as a foundation for solutions that support lexical theor etical explorations and applications for multilingual wordnets in the future.

Entrenchment and Creativity in Chinese Quadrasyllabic Idiomatic Expressions

This paper aims to explore a special type of idiomatic expressions of even length called Quadrasy... more This paper aims to explore a special type of idiomatic expressions of even length called Quadrasyllabic Idiomatic Expressions (QIEs) in Chinese, and explain their variations with reference to semantic and structural constraints on the elements imposed by the construction of QIEs on the one hand, and its interplay with individual semantic elements in semantic space in the comprehension task of QIEs variants. Results of human ratings and behavioral experiment both show that semantic distance affects the speed of comprehension with the construction entrenchment. For those QIEs with idiomaticity, semantic distance leads to no major effect. We show that Chinese QIEs provide an ideal testing ground for the empirical investigation of the functional linguistic notion of entrenchment in processing multi-morphemic strings.

Computational Representation of Chinese Characters: Comparison Between Singular Value Decomposition and Variational Autoencoder

The Journal of Cognitive Science, 2020

Being a notoriously complex problem, writing is generally decomposed into a series of subtasks: i... more Being a notoriously complex problem, writing is generally decomposed into a series of subtasks: idea generation, expression, revision, etc. Given some goal, the author generates a set of ideas (brainstorming), which he integrates into some skeleton (outline, text plan, outline). This leads to a first draft which is submitted then for revision possibly yielding changes at various levels (content, structure, form). Having made a draft, authors usually revise, edit, and proofread their documents. We confine ourselves here only to academic writing, focusing on sentence production. While there has been quite some work on this topic, most writing assistance has mainly dealt with grammatical errors, editing and proofreading, the goal being the correction of surface-level problems such as typography, spelling, or grammatical errors. We broaden the scope by also including cases where the entire sentence needs to be rewritten in order to express properly all of the information planned. Hence,...

Leveraging Morpho-semantics for the Discovery of Relations in Chinese Wordnet

Semantic relations of different types have played an important role in wordnet, and have been wid... more Semantic relations of different types have played an important role in wordnet, and have been widely recognized in various fields. In recent years, with the growing interests of constructing semantic network in support of intelligent systems, automatic semantic relation discovery has become an urgent task. This paper aims to extract semantic relations relying on the in situ morpho-semantic structure in Chinese which can dispense of an outside source such as corpus or web data. Manual evaluation of thousands of word pairs shows that most relations can be successful predicted. We believe that it can serve as a valuable starting point in complementing with other approaches, which will hold promise for the robust lexical relations acquisition.

The Secret to Popular Chinese Web Novels: A Corpus-Driven Study

What is the secret to writing popular novels? The issue is an intriguing one among researchers fr... more What is the secret to writing popular novels? The issue is an intriguing one among researchers from various fields. The goal of this study is to identify the linguistic features of several popular web novels as well as how the textual features found within and the overall tone interact with the genre and themes of each novel. Apart from writing style, non-textual information may also reveal details behind the success of web novels. Since web fiction has become a major industry with top writers making millions of dollars and their stories adapted into published books, determining essential elements of “publishable” novels is of importance. The present study further examines how non-textual information, namely, the number of hits, shares, favorites, and comments, may contribute to several features of the most popular published and unpublished web novels. Findings reveal that keywords, function words, and lexical diversity of a novel are highly related to its genres and writing style w...

Features of Verb Complements in Co-composition: A case study of Chinese baking verb using Weibo corpus

In the Generative Lexicon Theory (GLT), co-composition is one of the generative devices proposed ... more In the Generative Lexicon Theory (GLT), co-composition is one of the generative devices proposed to explain the cases of verbal polysemous behavior where more than one function application is allowed. The English baking verbs were used as one of the examples to illustrate how their complements co-specify the verb with qualia unification. In this paper, we begin by exploring the polysemy of Chinese baking verb, where the first two senses in Chinese Wordnet (CWN) are assumed. Features including linguistic cues and common sense knowledge are involved in the experiment with Weibo corpus and computed with SVM for closer investigation. From the analysis, it is found that though there are various cases found in senses of change of state and creation, a coarse but systematic approach combined with certain features in disambiguating CWN senses could be arranged. In addition, we further observe that the usage of various instruments cases and classifiers would be harnessed by underlying backgr...

CogALex-V Shared Task: LOPE

Automatic discovery of semantically-related words is one of the most important NLP tasks, and has... more Automatic discovery of semantically-related words is one of the most important NLP tasks, and has great impact on the theoretical psycholinguistic modeling of the mental lexicon. In this shared task, we employ the word embeddings model to testify two thoughts explicitly or implicitly assumed by the NLP community: (1). Word embedding models can reflect syntagmatic similarities in usage between words to distances in projected vector space. (2). Word embedding models can reflect paradigmatic relationships between words.