Skip to main content

Moshe Koppel

Bar-Ilan University, Computer Science, Faculty Member

Followers

143

Following

19

Co-authors

17

Public Views

Interests

Uploads

Papers by Moshe Koppel

Identification of Parallel Passages Across a Large Hebrew/Aramaic Corpus

by Avi Shmidman and Moshe Koppel

We propose a method for efficiently finding all parallel passages in a large corpus, even if the ... more We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at most one word and then identifying clusters of such matched pairs. Using this method, over 4600 parallel pairs of passages were identified in the Babylonian Talmud, a Hebrew-Aramaic corpus of over 1.8 million words, in just over 11 seconds. Empirical comparisons on sample data indicate that the coverage obtained by our method is essentially the same as that obtained using slow exhaustive methods. keywords approximate matching; fuzzy matching; text reuse INTRODUCTION Ancient text corpora in classical languages such as Greek, Latin, Hebrew and Aramaic typically include numerous examples of text reuse, including repetitions of long passages of 20 words or more. Identifying such passages is important because it allows scholars to trace the development of ideas and concepts through time and across geographical ranges. Additionally, even within a given time period and geographical location, the identification of multiple parallel sources for any given idea provides a platform for scholarly inquiry. Identifying all examples of text reuse within a large such corpus is challenging for several reasons, including the large number of comparisons that must be done and the fact that matches tend to be only approximate.

A Systemic Functional Approach to Automated Authorship Analysis

Proceedings of the ECAI'08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008

Ecai, 2008

Bias-Driven Revision of Logical Domain Theories

Eprint Arxiv 1105 2365, May 1, 2011

Web Based Textual Entailment

Translationese and Its Dialects

Web Based Textual Entailment

Markers of translator gender: do they really matter? 1,2

Theory Revision Using Noisy Exemplars

Refinement of Approximate Rule Bases

ABSTRACT

Distilling Reliable Information From Unreliable Theories

Getting the Most from Flawed Theories

The Relevance of Bias in the Revision of Approximate Domain Theories

Probabilistic Revision of Relational Theories

Foundations of artificial intelligence II. Selected papers from the 2nd Bar-Ilan Symposium, Ramat Gan, Israel, June 1991

Annals of Mathematics and Artificial Intelligence

Identifying the Information Contained in a Flawed Theory

Probabilistic Revision of Propositional Domain Theories

Detecting and removing shadows

This paper describes a method for the detection and removal of shadows in RGB images. The shadows... more This paper describes a method for the detection and removal of shadows in RGB images. The shadows are with hard borders. The proposed method begins with a segmentation of the color image. It is then decided if a segment is a shadow by examination of its neighboring segments. We use the method introduced in Finlayson et. al. [1] to remove the shadows by zeroing the shadow's borders in an edge representation of the image, and then re-integrating the edge using the method introduced by Weiss [2]. This is done for all of the color channels thus leaving a shadow-free color image. Unlike previous methods, the present method requires neither a calibrated camera nor multiple images. This method is complementary of current illumination correction algorithms. Examination of a number of examples indicates that this method yields a significant improvement over previous methods.

Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining

We present a web mining method for discov- ering and enhancing relationships in which a specified... more We present a web mining method for discov- ering and enhancing relationships in which a specified concept (word class) participates. We discover a whole range of relationships focused on the given concept, rather than generic known relationships as in most pre- vious work. Our method is based on cluster- ing patterns that contain concept words and other words related to them. We evaluate the method on three different rich concepts and find that in each case the method generates a broad variety of relationships with good pre- cision.

Proceedings of the SIGIR 2007 International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, PAN 2007, Amsterdam, Netherlands, July 27, 2007

Identification of Parallel Passages Across a Large Hebrew/Aramaic Corpus

by Avi Shmidman and Moshe Koppel

We propose a method for efficiently finding all parallel passages in a large corpus, even if the ... more We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at most one word and then identifying clusters of such matched pairs. Using this method, over 4600 parallel pairs of passages were identified in the Babylonian Talmud, a Hebrew-Aramaic corpus of over 1.8 million words, in just over 11 seconds. Empirical comparisons on sample data indicate that the coverage obtained by our method is essentially the same as that obtained using slow exhaustive methods. keywords approximate matching; fuzzy matching; text reuse INTRODUCTION Ancient text corpora in classical languages such as Greek, Latin, Hebrew and Aramaic typically include numerous examples of text reuse, including repetitions of long passages of 20 words or more. Identifying such passages is important because it allows scholars to trace the development of ideas and concepts through time and across geographical ranges. Additionally, even within a given time period and geographical location, the identification of multiple parallel sources for any given idea provides a platform for scholarly inquiry. Identifying all examples of text reuse within a large such corpus is challenging for several reasons, including the large number of comparisons that must be done and the fact that matches tend to be only approximate.

A Systemic Functional Approach to Automated Authorship Analysis

Proceedings of the ECAI'08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008

Ecai, 2008

Bias-Driven Revision of Logical Domain Theories

Eprint Arxiv 1105 2365, May 1, 2011

Web Based Textual Entailment

Translationese and Its Dialects

Web Based Textual Entailment

Markers of translator gender: do they really matter? 1,2

Theory Revision Using Noisy Exemplars

Refinement of Approximate Rule Bases

ABSTRACT

Distilling Reliable Information From Unreliable Theories

Getting the Most from Flawed Theories

The Relevance of Bias in the Revision of Approximate Domain Theories

Probabilistic Revision of Relational Theories

Foundations of artificial intelligence II. Selected papers from the 2nd Bar-Ilan Symposium, Ramat Gan, Israel, June 1991

Annals of Mathematics and Artificial Intelligence

Identifying the Information Contained in a Flawed Theory

Probabilistic Revision of Propositional Domain Theories

Detecting and removing shadows

This paper describes a method for the detection and removal of shadows in RGB images. The shadows... more This paper describes a method for the detection and removal of shadows in RGB images. The shadows are with hard borders. The proposed method begins with a segmentation of the color image. It is then decided if a segment is a shadow by examination of its neighboring segments. We use the method introduced in Finlayson et. al. [1] to remove the shadows by zeroing the shadow's borders in an edge representation of the image, and then re-integrating the edge using the method introduced by Weiss [2]. This is done for all of the color channels thus leaving a shadow-free color image. Unlike previous methods, the present method requires neither a calibrated camera nor multiple images. This method is complementary of current illumination correction algorithms. Examination of a number of examples indicates that this method yields a significant improvement over previous methods.

Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining

We present a web mining method for discov- ering and enhancing relationships in which a specified... more We present a web mining method for discov- ering and enhancing relationships in which a specified concept (word class) participates. We discover a whole range of relationships focused on the given concept, rather than generic known relationships as in most pre- vious work. Our method is based on cluster- ing patterns that contain concept words and other words related to them. We evaluate the method on three different rich concepts and find that in each case the method generates a broad variety of relationships with good pre- cision.

Proceedings of the SIGIR 2007 International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, PAN 2007, Amsterdam, Netherlands, July 27, 2007

Shamela: A Large-Scale Historical Arabic Corpus.

by Alexander Magidow, Yonatan Belinkov, Maxim Romanov, Avi Shmidman, and Moshe Koppel

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centu... more Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.