I work on language as a knowledge system, applying inter-disciplinary approaches that include but are not limited to analytical, behavioural experiments, corpus-driven and computational modelling. The fields I have published most in are computational and corpus linguistics, lexical semantics, Chinese linguistics and ontology. Recently I am interested in language as a self-adaptive complex system, as well as the cognitive and ontological basis of language.
Leech’s corpus-based comparison of English modal verbs from 1961 to 1992 showed the steep decline... more Leech’s corpus-based comparison of English modal verbs from 1961 to 1992 showed the steep decline of all modal verbs together, which he ascribed to continuing changes towards a more equal and less authority-driven society. This study inspired many diachronic and synchronic studies, mostly on English modal verbs and largely assuming the correlation between the use of modal verbs and power relations. Yet, there are continuing debates on sampling design and the choices of corpora. In addition, this hypothesis has not been attested in any other language with comparable corpus size or examined with longitudinal studies. This study tracks the use of Chinese modal verbs from 1901 to 2009, covering the historical events of the New Culture Movement, the establishment of the PRC, the implementation of simplified characters and the completion and finalization of simplification of the Chinese writing system. We found that the usage of modal verbs did rise and fall during the last century, and f...
Sentiment Analysis of tweets is a complex task, because these short messages employ unconventiona... more Sentiment Analysis of tweets is a complex task, because these short messages employ unconventional language to increase the expressiveness. This task becomes even more difficult when people use figurative language (e.g. irony, sarcasm and metaphors) because it causes a mismatch between the literal meaning and the actual expressed sentiment. In this paper, we describe a sentiment analysis system designed for handling ironic and sarcastic tweets. Features grounded on several linguistic levels are proposed and used to classify the tweets in a 11-scale range, using a decision tree. The system is evaluated on the dataset released by the organizers of the SemEval 2015, task 11. The results show that our method largely outperforms the systems proposed by the participants of the task on ironic and sarcastic tweets.
While neural embeddings represent a popular choice for word representation in a wide variety of N... more While neural embeddings represent a popular choice for word representation in a wide variety of NLP tasks, their usage for thematic fit modeling has been limited, as they have been reported to lag behind syntax-based count models. In this paper, we propose a complete evaluation of count models and word embeddings on thematic fit estimation, by taking into account a larger number of parameters and verb roles and introducing also dependency-based embeddings in the comparison. Our results show a complex scenario, where a determinant factor for the performance seems to be the availability to the model of reliable syntactic information for building the distributional representations of the roles.
This paper adopts a comparable corpus-based approach to light verb variations in two varieties of... more This paper adopts a comparable corpus-based approach to light verb variations in two varieties of Mandarin Chinese and proposes a transitivity (Hopper and Thompson 1980) based theoretical account. Light verbs are highly grammaticalized and lack strong collocation restrictions; hence it has been a challenge to empirical accounts. It is even more challenging to consider their variations between different varieties (e.g. Taiwan and Mainland Mandarin). This current study follows the research paradigm set up in Lin et al. (2014) for differentiating different light verbs and Huang et al. (2014) for automatic discovery of light verb variations. In our study, a corpus-based statistical approach is adopted to show that both internal variety differences between light verbs and external differences between different variants can be detected effectively. The distributional differences between Mainland and Taiwan can also shed light on the re-classification of syntactic types of the taken comple...
The present paper explored the focusing topics change and language change in Report on the Work o... more The present paper explored the focusing topics change and language change in Report on the Work of the Government by Premiers of the People’s Republic of China (hereinafter Report texts). The text clustering and correspondence analysis showed the focusing topics change in selected three periods Report texts. The Report texts were represented by the clause length distribution and clustered. The clustering result showed the differences of clause length usages in the Report texts. The relationship between clause length and word length was studied. The average word length decreases with clause length and were fitted using the function, y = ax based on the Menzerath-Altmann Law. The relationship between the three periods Report texts represented by the fitted parameters, a and b, were explored.
Leech’s corpus-based comparison of English modal verbs from 1961 to 1992 showed the steep decline... more Leech’s corpus-based comparison of English modal verbs from 1961 to 1992 showed the steep decline of all modal verbs together, which he ascribed to continuing changes towards a more equal and less authority-driven society. This study inspired many diachronic and synchronic studies, mostly on English modal verbs and largely assuming the correlation between the use of modal verbs and power relations. Yet, there are continuing debates on sampling design and the choices of corpora. In addition, this hypothesis has not been attested in any other language with comparable corpus size or examined with longitudinal studies. This study tracks the use of Chinese modal verbs from 1901 to 2009, covering the historical events of the New Culture Movement, the establishment of the PRC, the implementation of simplified characters and the completion and finalization of simplification of the Chinese writing system. We found that the usage of modal verbs did rise and fall during the last century, and f...
Sentiment Analysis of tweets is a complex task, because these short messages employ unconventiona... more Sentiment Analysis of tweets is a complex task, because these short messages employ unconventional language to increase the expressiveness. This task becomes even more difficult when people use figurative language (e.g. irony, sarcasm and metaphors) because it causes a mismatch between the literal meaning and the actual expressed sentiment. In this paper, we describe a sentiment analysis system designed for handling ironic and sarcastic tweets. Features grounded on several linguistic levels are proposed and used to classify the tweets in a 11-scale range, using a decision tree. The system is evaluated on the dataset released by the organizers of the SemEval 2015, task 11. The results show that our method largely outperforms the systems proposed by the participants of the task on ironic and sarcastic tweets.
While neural embeddings represent a popular choice for word representation in a wide variety of N... more While neural embeddings represent a popular choice for word representation in a wide variety of NLP tasks, their usage for thematic fit modeling has been limited, as they have been reported to lag behind syntax-based count models. In this paper, we propose a complete evaluation of count models and word embeddings on thematic fit estimation, by taking into account a larger number of parameters and verb roles and introducing also dependency-based embeddings in the comparison. Our results show a complex scenario, where a determinant factor for the performance seems to be the availability to the model of reliable syntactic information for building the distributional representations of the roles.
This paper adopts a comparable corpus-based approach to light verb variations in two varieties of... more This paper adopts a comparable corpus-based approach to light verb variations in two varieties of Mandarin Chinese and proposes a transitivity (Hopper and Thompson 1980) based theoretical account. Light verbs are highly grammaticalized and lack strong collocation restrictions; hence it has been a challenge to empirical accounts. It is even more challenging to consider their variations between different varieties (e.g. Taiwan and Mainland Mandarin). This current study follows the research paradigm set up in Lin et al. (2014) for differentiating different light verbs and Huang et al. (2014) for automatic discovery of light verb variations. In our study, a corpus-based statistical approach is adopted to show that both internal variety differences between light verbs and external differences between different variants can be detected effectively. The distributional differences between Mainland and Taiwan can also shed light on the re-classification of syntactic types of the taken comple...
The present paper explored the focusing topics change and language change in Report on the Work o... more The present paper explored the focusing topics change and language change in Report on the Work of the Government by Premiers of the People’s Republic of China (hereinafter Report texts). The text clustering and correspondence analysis showed the focusing topics change in selected three periods Report texts. The Report texts were represented by the clause length distribution and clustered. The clustering result showed the differences of clause length usages in the Report texts. The relationship between clause length and word length was studied. The average word length decreases with clause length and were fitted using the function, y = ax based on the Menzerath-Altmann Law. The relationship between the three periods Report texts represented by the fitted parameters, a and b, were explored.
In this paper, we claim that vector cosine – which is generally considered among the most efficie... more In this paper, we claim that vector cosine – which is generally considered among the most efficient unsupervised measures for identifying word similarity in Vector Space Models – can be outperformed by an unsupervised measure that calculates the extent of the intersection among the most mutually dependent contexts of the target words. To prove it, we describe and evaluate APSyn, a variant of the Average Precision that, without any optimization, outperforms the vector cosine and the co-occurrence on the standard ESL test set, with an improvement ranging between +9.00% and +17.98%, depending on the number of chosen top contexts.
Uploads