Strong background in natural language processing & the theory of linguistics - Experience collecting representative corpora(English & Indian Languages). Supervisors: Dr. Rashmi Agrawal
The amount of unstructured text present in all electronic media is increasing periodically day af... more The amount of unstructured text present in all electronic media is increasing periodically day after day. In order to extract relevant and succinct information, extraction algorithms are limited to entity relationships. This paper is compendium of different bootstrapping approaches which have their own subtask of extracting dependencies like who did, what, whom, from natural language sentence. This can be extremely helpful in both feature design and error analysis in application of machine learning to natural language processing.
We discuss about language distinct and an unsupervised approach for Morphological Analysis. The a... more We discuss about language distinct and an unsupervised approach for Morphological Analysis. The algorithm is based on probability that uses the distance, frequency and length of the strings. In future it would solve problems of large corpora and agglutinative languages as well. We perform the algorithm on English data as well as Punjabi data and get the results as follows, as the number of morphemes recognized are more in English than in Punjabi language due to the fluctuations in random behavior showing smaller segmentations in small data sizes. There will be always a room for change ahead as the language grows.
NLTK (3.2.5), which incorporated features like Arabic stemmers, NIST evaluation, MOSES tokenizer,... more NLTK (3.2.5), which incorporated features like Arabic stemmers, NIST evaluation, MOSES tokenizer, Stanford segmenter, treebank detokenizer, verbnet, and vader, etc. NLTK was created in 2001 as a part of Computational Linguistic Department at the University of Pennsylvania. Since then it has been tested and developed. The important packages of this system are 1) corpus builder, 2) tokenizer, 3) collocation, 4) tagging, 5) parsing, 6) metrics, and 7) probability distribution system. Toolbox NLTK was built to meet four primary requirements: 1) Simplicity: An substantive framework for building blocks; 2) Consistency: Consistent interface; 3) Extensibility: Which can be easily scaled; and 4) Modularity: All modules are independent of each other.
The amount of unstructured text present in all electronic media is increasing periodically day af... more The amount of unstructured text present in all electronic media is increasing periodically day after day. In order to extract relevant and succinct information, extraction algorithms are limited to entity relationships. This paper is compendium of different bootstrapping approaches which have their own subtask of extracting dependencies like who did, what, whom, from natural language sentence. This can be extremely helpful in both feature design and error analysis in application of machine learning to natural language processing.
We discuss about language distinct and an unsupervised approach for Morphological Analysis. The a... more We discuss about language distinct and an unsupervised approach for Morphological Analysis. The algorithm is based on probability that uses the distance, frequency and length of the strings. In future it would solve problems of large corpora and agglutinative languages as well. We perform the algorithm on English data as well as Punjabi data and get the results as follows, as the number of morphemes recognized are more in English than in Punjabi language due to the fluctuations in random behavior showing smaller segmentations in small data sizes. There will be always a room for change ahead as the language grows.
NLTK (3.2.5), which incorporated features like Arabic stemmers, NIST evaluation, MOSES tokenizer,... more NLTK (3.2.5), which incorporated features like Arabic stemmers, NIST evaluation, MOSES tokenizer, Stanford segmenter, treebank detokenizer, verbnet, and vader, etc. NLTK was created in 2001 as a part of Computational Linguistic Department at the University of Pennsylvania. Since then it has been tested and developed. The important packages of this system are 1) corpus builder, 2) tokenizer, 3) collocation, 4) tagging, 5) parsing, 6) metrics, and 7) probability distribution system. Toolbox NLTK was built to meet four primary requirements: 1) Simplicity: An substantive framework for building blocks; 2) Consistency: Consistent interface; 3) Extensibility: Which can be easily scaled; and 4) Modularity: All modules are independent of each other.
Uploads
Papers by Simran Kaur
on probability that uses the distance, frequency and length of the strings. In future it would solve problems of large corpora and
agglutinative languages as well. We perform the algorithm on English data as well as Punjabi data and get the results as follows, as
the number of morphemes recognized are more in English than in Punjabi language due to the fluctuations in random behavior
showing smaller segmentations in small data sizes. There will be always a room for change ahead as the language grows.
Books by Simran Kaur
on probability that uses the distance, frequency and length of the strings. In future it would solve problems of large corpora and
agglutinative languages as well. We perform the algorithm on English data as well as Punjabi data and get the results as follows, as
the number of morphemes recognized are more in English than in Punjabi language due to the fluctuations in random behavior
showing smaller segmentations in small data sizes. There will be always a room for change ahead as the language grows.