Finite State Models for the Generation of Large Corpora of Natural Language Texts

Cantone, Domenico; Cristofaro, Salvatore; Faro, Simone; Giaquinta, Emanuele

doi:10.3233/978-1-58603-975-2-175

loading subjects...

Finite State Models for the Generation of Large Corpora of Natural Language Texts

Authors

Domenico Cantone, Salvatore Cristofaro, Simone Faro, Emanuele Giaquinta

Pages

175 - 182

DOI

10.3233/978-1-58603-975-2-175

Series

Frontiers in Artificial Intelligence and Applications

Ebook

Volume 191: Finite-State Methods and Natural Language Processing

Abstract

Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. In many cases the problem due to the lack of big corpus of natural language texts can be solved by simply concatenating a set of collected texts, even with heterogeneous contexts and by different authors.

In this note we present a preliminary study on a finite state model for text generation which maintains statistical and structural characteristics of natural language texts, i.e., Zipf's law and inverse-rank power law, thus providing a very good approximation for testing purposes.

Contact

IOS Press Copyright 2024

Contact

IOS Press Copyright 2024

This website uses cookies

This website uses cookies