We propose a method for efficiently finding all parallel passages in a large corpus, even if the ... more We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at most one word and then identifying clusters of such matched pairs. Using this method, over 4600 parallel pairs of passages were identified in the Babylonian Talmud, a Hebrew-Aramaic corpus of over 1.8 million words, in just over 11 seconds. Empirical comparisons on sample data indicate that the coverage obtained by our method is essentially the same as that obtained using slow exhaustive methods. keywords approximate matching; fuzzy matching; text reuse INTRODUCTION Ancient text corpora in classical languages such as Greek, Latin, Hebrew and Aramaic typically include numerous examples of text reuse, including repetitions of long passages of 20 words or more. Identifying such passages is important because it allows scholars to trace the development of ideas and concepts through time and across geographical ranges. Additionally, even within a given time period and geographical location, the identification of multiple parallel sources for any given idea provides a platform for scholarly inquiry. Identifying all examples of text reuse within a large such corpus is challenging for several reasons, including the large number of comparisons that must be done and the fact that matches tend to be only approximate.
This paper describes a method for the detection and removal of shadows in RGB images. The shadows... more This paper describes a method for the detection and removal of shadows in RGB images. The shadows are with hard borders. The proposed method begins with a segmentation of the color image. It is then decided if a segment is a shadow by examination of its neighboring segments. We use the method introduced in Finlayson et. al. [1] to remove the shadows by zeroing the shadow's borders in an edge representation of the image, and then re-integrating the edge using the method introduced by Weiss [2]. This is done for all of the color channels thus leaving a shadow-free color image. Unlike previous methods, the present method requires neither a calibrated camera nor multiple images. This method is complementary of current illumination correction algorithms. Examination of a number of examples indicates that this method yields a significant improvement over previous methods.
We present a web mining method for discov- ering and enhancing relationships in which a specified... more We present a web mining method for discov- ering and enhancing relationships in which a specified concept (word class) participates. We discover a whole range of relationships focused on the given concept, rather than generic known relationships as in most pre- vious work. Our method is based on cluster- ing patterns that contain concept words and other words related to them. We evaluate the method on three different rich concepts and find that in each case the method generates a broad variety of relationships with good pre- cision.
We propose a method for efficiently finding all parallel passages in a large corpus, even if the ... more We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at most one word and then identifying clusters of such matched pairs. Using this method, over 4600 parallel pairs of passages were identified in the Babylonian Talmud, a Hebrew-Aramaic corpus of over 1.8 million words, in just over 11 seconds. Empirical comparisons on sample data indicate that the coverage obtained by our method is essentially the same as that obtained using slow exhaustive methods. keywords approximate matching; fuzzy matching; text reuse INTRODUCTION Ancient text corpora in classical languages such as Greek, Latin, Hebrew and Aramaic typically include numerous examples of text reuse, including repetitions of long passages of 20 words or more. Identifying such passages is important because it allows scholars to trace the development of ideas and concepts through time and across geographical ranges. Additionally, even within a given time period and geographical location, the identification of multiple parallel sources for any given idea provides a platform for scholarly inquiry. Identifying all examples of text reuse within a large such corpus is challenging for several reasons, including the large number of comparisons that must be done and the fact that matches tend to be only approximate.
This paper describes a method for the detection and removal of shadows in RGB images. The shadows... more This paper describes a method for the detection and removal of shadows in RGB images. The shadows are with hard borders. The proposed method begins with a segmentation of the color image. It is then decided if a segment is a shadow by examination of its neighboring segments. We use the method introduced in Finlayson et. al. [1] to remove the shadows by zeroing the shadow's borders in an edge representation of the image, and then re-integrating the edge using the method introduced by Weiss [2]. This is done for all of the color channels thus leaving a shadow-free color image. Unlike previous methods, the present method requires neither a calibrated camera nor multiple images. This method is complementary of current illumination correction algorithms. Examination of a number of examples indicates that this method yields a significant improvement over previous methods.
We present a web mining method for discov- ering and enhancing relationships in which a specified... more We present a web mining method for discov- ering and enhancing relationships in which a specified concept (word class) participates. We discover a whole range of relationships focused on the given concept, rather than generic known relationships as in most pre- vious work. Our method is based on cluster- ing patterns that contain concept words and other words related to them. We evaluate the method on three different rich concepts and find that in each case the method generates a broad variety of relationships with good pre- cision.
Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centu... more Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.
Uploads
Papers by Moshe Koppel