Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Using Stochastic Automaton for Data Consolidation

Naukovì vìstì Nacìonalʹnogo tehnìčnogo unìversitetu Ukraïni "Kiïvsʹkij polìtehnìčnij ìnstitut", 2017
...Read more
ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ, СИСТЕМНИЙ АНАЛІЗ ТА КЕРУВАННЯ 29 UDC 615.47:616-085 DOI: 10.20535/1810-0546.2017.2.100011 O.V. Koval * , V.A. Kuzminykh, D.V. Khaustov * corresponding author: avkovalgm@gmail.com Igor Sikorsky KPI, Kyiv, Ukraine USING STOCHASTIC AUTOMATON FOR DATA CONSOLIDATION Background. Development of methods and algorithms for efficient search of relevant information on demand. The ar- ticle deals with the consolidation of data for subsequent use in the information and analytical systems. Objective. The aim of the paper is to identify capabilities and build relevant information search algorithms from dis- parate sources by analyzing the probability information identifying the possible presence of relevant documents in these sources. Methods. To find the relevant information for search queries the approach based on the use of probability estimates of relevant documents available in the sources of further increasing the number of selected documents from these sources to analyze their relevance to the query is used. Results. A stochastic programmable automaton structure to ensure selection of the most possible information sources by relevance parameters and information retrieval algorithm based on the use of stochastic automaton were developed. Conclusions. The described algorithm using stochastic automaton for data consolidation allows developing a set of software tools, provides plenty full and holistic data consolidation problem-solving for diverse systems which search for information from information sources different in composition and presentation type. Keywords: open data sources; data consolidation; information-analytical systems; information retrieval systems; prob- abilistic models; relevance; big data tasks. Introduction Directed information gathering based on open sources is considered as one of the standard meth- ods of information gathering in different spheres of modern society. Traditionally, specialists collect and analyze information from the media, public state- ments, official data, press conference materials, public statements, professional and academic re- ports, conferences, reports, and articles. The transi- tion to electronic media largely determines the di- rected search approaches and methods and the effi- ciency increase of both individual procedures and information search in general. The amount of information produced by hu- manity, which is constantly increasing and greatly complicates the problem of storing large amounts of data and extracting from them relevant and im- portant information. Every day the problem of processing such information assumes greater sig- nificance. A significant amount of electronic mate- rials is going to make it difficult for users to find the necessary information [1]. The information con- solidation can help solving these problems [2, 3]. In a broad sense the consolidation can be understood as the process of searching, selecting, analyzing, structuring, conversion, storage, cataloging, and providing consumer information on a given topic. The task of information consolidation is one of the most important problems of processing large amo- unts of data (big data) [4]. Consolidation is generally regarded as a set of techniques and procedures to extract data from various sources to ensure the necessary level of in- formation content and quality conversion in a sin- gle format in which they can be loaded into the data warehouse or analytical system. In some cases data consolidation is the initial stage of any analytical task or project [3]. The basis of consolidation is the process of collecting and storing data in a form optimal in terms of their processing on the specific analytic platform or spe- cific analytical problem solving. The associated consolidation tasks are to assess the quality of data and their enrichment, to reduce the amount of in- formation that has to be processed by data retrieval or information-analytical system. The main results that data consolidation has to ensure for further processing depth are as follows: high access rate to large volumes of data; compactness of large volume data storage; support of a data structure integrity; monitoring of data consistency and rele- vance. The feature of information gathering based on open sources is the instability of the informational contents of these sources, the lack of reliable prior information about their content and its relevance
30 Наукові вісті НТУУ "КПІ" 2017 / 2 and typically low accuracy and efficiency of expert assessments of state and compliance of the sources with the topics and query parameters. Therefore, to process data from open informa- tion sources it is necessary to conduct effective con- solidation of data using specialized software tools that will provide the following possibilities: automatic selection of the information sour- ces, most relevant in compliance with the request; the possibility of accumulation of informa- tion on sources’ state in the course of the request; consideration of the possibility of changing sources’ state during the second request; analysis of the most promising sources from the point of view of the relevance as well as less relevant; creation of information assessment of the perspective sources in terms of relevance. Problem statement The purpose of the study is development of model and algorithm for consolidation of informa- tion from a certain quantity of heterogeneous sources with high content of documents. At the same time, the task to select the maximum number of documents relevant to the request is set in a case of the minimum number of the processed (checked for relevance to a request) documents by detection and use of information sources the most appropriate to the request. Review of existing solutions Today there are many information search mo- dels, that can form the basis for information con- solidation system, that are based on various mathe- matical methods. Modern information-analytical and information retrieval systems based on these models and various modifications are built. Among them are the following most common types of models [2]: Boolean model; fuzzy set model; vector model; latent-semantic model; probabilistic model. Boolean model is based on the use of mathe- matical logic app and set theory [5]. The model based on matrix, which is the ratio between the document and the indexing terms [6]. The model advantages: construction simplicity; ease of program implementation. The model disadvantages: difficulty of query building without know- ledge of Boolean algebra; the search results must contain all terms of the user’s request; virtually impossible to automatically rank the received documents; the search is badly scaled. The fuzzy set model is based on fuzzy set the- ory, allowing the partial set element membership in contrast to the traditional set theory, which does not allow this [7, 8]. The whole array of docu- ments is described as a set of fuzzy set terms in this model. The model advantages: the ability to rank results; indexing and determining the relevance of the selected document by request requires less com- puting; easy to implement; no requirements for large volumes of me- mory. The model disadvantages: computing costs and storage costs higher than Boolean model; lack of accurate search result distinction. Vector model [8, 9] represents the documents and user queries as n-dimensional vectors in n- dimensional vector space. The dimension of the vector space n is the number of different terms in all documents [10]. The model advantages: construction simplicity; the ability to rank results; the model disadvantages; requires large amounts of data processing; requires exact word matching. Latent semantic model is commonly referred to as latent semantic indexing in the information re- trieval theory [11]. Latent semantic analysis (LSA) is a method of extracting and presenting context de- pendent word meanings by using statistical analysis of large text document sets. LSA consists of two stages study and analysis of indexed data [12]. The model advantages: a space of much smaller dimension is used than the vector; no need to match the exact words; the model does not require a complicated setup. The model disadvantages: complexity leads to a large number of cal- culations;
ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ, СИСТЕМНИЙ АНАЛІЗ ТА КЕРУВАННЯ 29 UDC 615.47:616-085 DOI: 10.20535/1810-0546.2017.2.100011 O.V. Koval*, V.A. Kuzminykh, D.V. Khaustov Igor Sikorsky KPI, Kyiv, Ukraine USING STOCHASTIC AUTOMATON FOR DATA CONSOLIDATION Background. Development of methods and algorithms for efficient search of relevant information on demand. The article deals with the consolidation of data for subsequent use in the information and analytical systems. Objective. The aim of the paper is to identify capabilities and build relevant information search algorithms from disparate sources by analyzing the probability information identifying the possible presence of relevant documents in these sources. Methods. To find the relevant information for search queries the approach based on the use of probability estimates of relevant documents available in the sources of further increasing the number of selected documents from these sources to analyze their relevance to the query is used. Results. A stochastic programmable automaton structure to ensure selection of the most possible information sources by relevance parameters and information retrieval algorithm based on the use of stochastic automaton were developed. Conclusions. The described algorithm using stochastic automaton for data consolidation allows developing a set of software tools, provides plenty full and holistic data consolidation problem-solving for diverse systems which search for information from information sources different in composition and presentation type. Keywords: open data sources; data consolidation; information-analytical systems; information retrieval systems; probabilistic models; relevance; big data tasks. Introduction Directed information gathering based on open sources is considered as one of the standard methods of information gathering in different spheres of modern society. Traditionally, specialists collect and analyze information from the media, public statements, official data, press conference materials, public statements, professional and academic reports, conferences, reports, and articles. The transition to electronic media largely determines the directed search approaches and methods and the efficiency increase of both individual procedures and information search in general. The amount of information produced by humanity, which is constantly increasing and greatly complicates the problem of storing large amounts of data and extracting from them relevant and important information. Every day the problem of processing such information assumes greater significance. A significant amount of electronic materials is going to make it difficult for users to find the necessary information [1]. The information consolidation can help solving these problems [2, 3]. In a broad sense the consolidation can be understood as the process of searching, selecting, analyzing, structuring, conversion, storage, cataloging, and providing consumer information on a given topic. The task of information consolidation is one of the * corresponding author: avkovalgm@gmail.com most important problems of processing large amounts of data (big data) [4]. Consolidation is generally regarded as a set of techniques and procedures to extract data from various sources to ensure the necessary level of information content and quality conversion in a single format in which they can be loaded into the data warehouse or analytical system. In some cases data consolidation is the initial stage of any analytical task or project [3]. The basis of consolidation is the process of collecting and storing data in a form optimal in terms of their processing on the specific analytic platform or specific analytical problem solving. The associated consolidation tasks are to assess the quality of data and their enrichment, to reduce the amount of information that has to be processed by data retrieval or information-analytical system. The main results that data consolidation has to ensure for further processing depth are as follows:  high access rate to large volumes of data;  compactness of large volume data storage;  support of a data structure integrity;  monitoring of data consistency and relevance. The feature of information gathering based on open sources is the instability of the informational contents of these sources, the lack of reliable prior information about their content and its relevance 30 Наукові вісті НТУУ "КПІ" and typically low accuracy and efficiency of expert assessments of state and compliance of the sources with the topics and query parameters. Therefore, to process data from open information sources it is necessary to conduct effective consolidation of data using specialized software tools that will provide the following possibilities:  automatic selection of the information sources, most relevant in compliance with the request;  the possibility of accumulation of information on sources’ state in the course of the request;  consideration of the possibility of changing sources’ state during the second request;  analysis of the most promising sources from the point of view of the relevance as well as less relevant;  creation of information assessment of the perspective sources in terms of relevance. Problem statement The purpose of the study is development of model and algorithm for consolidation of information from a certain quantity of heterogeneous sources with high content of documents. At the same time, the task to select the maximum number of documents relevant to the request is set in a case of the minimum number of the processed (checked for relevance to a request) documents by detection and use of information sources the most appropriate to the request. Review of existing solutions Today there are many information search models, that can form the basis for information consolidation system, that are based on various mathematical methods. Modern information-analytical and information retrieval systems based on these models and various modifications are built. Among them are the following most common types of models [2]:  Boolean model;  fuzzy set model;  vector model;  latent-semantic model;  probabilistic model. Boolean model is based on the use of mathematical logic app and set theory [5]. The model based on matrix, which is the ratio between the document and the indexing terms [6]. The model advantages:  construction simplicity;  ease of program implementation. 2017 / 2 The model disadvantages:  difficulty of query building without knowledge of Boolean algebra;  the search results must contain all terms of the user’s request;  virtually impossible to automatically rank the received documents;  the search is badly scaled. The fuzzy set model is based on fuzzy set theory, allowing the partial set element membership in contrast to the traditional set theory, which does not allow this [7, 8]. The whole array of documents is described as a set of fuzzy set terms in this model. The model advantages:  the ability to rank results;  indexing and determining the relevance of the selected document by request requires less computing;  easy to implement;  no requirements for large volumes of memory. The model disadvantages:  computing costs and storage costs higher than Boolean model;  lack of accurate search result distinction. Vector model [8, 9] represents the documents and user queries as n-dimensional vectors in ndimensional vector space. The dimension of the vector space n is the number of different terms in all documents [10]. The model advantages:  construction simplicity;  the ability to rank results;  the model disadvantages;  requires large amounts of data processing;  requires exact word matching. Latent semantic model is commonly referred to as latent semantic indexing in the information retrieval theory [11]. Latent semantic analysis (LSA) is a method of extracting and presenting context dependent word meanings by using statistical analysis of large text document sets. LSA consists of two stages — study and analysis of indexed data [12]. The model advantages:  a space of much smaller dimension is used than the vector;  no need to match the exact words;  the model does not require a complicated setup. The model disadvantages:  complexity leads to a large number of calculations; ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ, СИСТЕМНИЙ АНАЛІЗ ТА КЕРУВАННЯ  there are no rules for choosing the dimension on which the effectiveness of obtaining results depended. Probabilistic model. These search models are based on the use of probability theory methods. They used statistical indicators that determine the document relevance probability to the search query [13]. These models are based on the probabilistic ranking principle in descending probability of their relevance to the user request. Probabilistic models differ from each other in procedures for calculating estimates of probabilities [14]. The model advantages:  the ability to adapt in the use;  there is no volume calculations dependence on the number of terms, and other parameters;  high efficiency when working with dynamically updated data sources. The model disadvantages:  the need for continuous learning of the system;  low efficiency at the initial stages of work. Today there are no models that would possess quantitative and qualitative advantages over the other [15, 16]. Therefore, the development of new algorithms and methods based on the combination of the advantages of the different approaches in building models is very relevant and requires new solutions. This article describes the approach in the implementation of a software system that is built on the use of elements of the stochastic automata theory in selecting information sources and fuzzy sets in assessing the relevant information sources. The mathematical formulation of the problem The structure of the of the main information element interaction, reflecting the consolidation of documents (information units) is shown in Fig. 1. It contains the following elements: 1. Information sources. 2. Query parameters. 3. Document packets. The information sources are information resources that are important and considered in particular information system construction, or considered in a certain information search case. Currently, there are following information resources: 1. The media which includes various kinds of news sites (RSS feeds) and semantic sites (or electronic versions of media). 31 2. Digital libraries — distributed information system that can reliably store and effectively use electronic documents through a global network. 3. Electronic database — a collection of files organized in a special way, documents, grouped by topic and spreadsheets, which are combined into groups. 4. Corporate and personal sites — online resources dedicated to any organization, company, enterprise, a particular issue or person. They are distinguished by the completeness of information that covers all aspects of the activity. 5. Information portals — a group of sites where you can use a variety of services. They can contain various scientific, political, economic and other information as well as electronic mailboxes, blogs, catalogs, dictionaries, directories, weather forecasting, television programs, exchange rates, etc. Generally, their renewal happens in real time. Query parameters Z Sources D Document packets S Z1 D1 S1 Z2 D2 S2 Z3 D3 S3 Zj Di Si Zm Dn Sn Fig. 1. The basic elements of the information consolidation The request content settings — are mostly simple and complex terms (terms related by operators), the parameters that define the various features of the time, place (appearance, publication, storage, etc.), and many other characteristics that determine specific request characteristics. The document packets — specific sets of information units (articles, records, tables, reports, newspapers, magazines, books, abstracts, news, reviews, etc.), with similar presentation format. Let’s consider the main information consolidation description characteristics in accordance with the stochastic model of information source selection based on the theory of stochastic automata [17]. To assess the relevance of the information sources to queries the model based on the theory 32 Наукові вісті НТУУ "КПІ" 2017 / 2 of fuzzy sets is used [7], which links the query parameters and information sources. llows a partial matching between the request and the information source is used. This matching assesses values in the range [0, 1]. The value that estimates the correspondence between the request and the information source is based on the determination of coincidence between the query parameters and characteristics of information units included in a certain source of information in question. Quantitative matching assessment of the l-th document (for l  1,...,V) from the i-th information source to the j-th query parameter in single sampling for relevant source relevance evaluation is defined as qijl  0 — when j-th parameter is not present in the l-th document of i-th source, qijl  1 — when j-th parameter is present in the l-th document of ith source. Then this value after sampling and evaluation of several pieces of information from certain information source is averaged according to the number of selected items (documents). Quantitative assessment of compliance of the i-th source to the j-th query parameter determines the partial match, it corresponds to the range [0, 1] and can be defined as kij  1 V V  qijl l 1 where q ijl is a quantitative matching assessment of the l-th document (for l  1,...,V ) from the i-th information source to the j-th query parameter in single sampling for source relevance evaluation; V — the volume of a single sample from one document source has to meet the following limitation: where Si is a number of information pieces (documents) in each n source considered for this request. Then the i-th information source relevance evaluation will be defined as 1 m  k ij v j m j 1 The consolidation algorithm based on the use of stochastic automaton To organize selection procedures for effective information sources relevant for specific queries can be used algorithms that are based on stochastic automata. Such stochastic automaton is a Mealy type automaton [17]. This is the automaton for which the following conditions are satisfied for the conditional probability density p(u , y / u, x )  p(u  / u, x ) p( y / u, x ) where x  X is a stochastic automaton input values; y  Y is a stochastic automaton output values; u, u   U is a stochastic automaton possible states. For such stochastic automaton the next state, automatic automaton u' takes does not depend on automaton y output, and automaton y output does not depend on the state in which the automaton u' moves for any possible input value of stochastic automaton x and any possible condition of automaton u. So for stochastic Mealy type automaton the automaton y output and automaton u state in which it moves are independent of one another. They depend only on the input value and the previous state. If the stochastic Mealy type automaton runs the relation type p( y / u, x , u )  p( y / u), always when V  Si for i  1,..., n Ri  independently determined by information request customer. Thus evaluation of the i-th source relevance to specific request will be determined in the range [0, 1]. (1) where 0  v i  1, vj is a weight coefficient of the jth topic query parameter determined by expert assessments or on the basis of priorities that can be p(u  / u, x )  0, then it is a Moore type automaton. Stochastic Moore type automaton [17] is a special case of the stochastic Mealy type automaton. For such stochastic automaton the output depends on its condition and does not depend on the input, and the next state of the automaton is determined by its previous state and automaton input. So for stochastic Mealy type automaton can be determined that p(u , y / u, x )  p(u  / u, x ) p( y / u). ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ, СИСТЕМНИЙ АНАЛІЗ ТА КЕРУВАННЯ Considering this automaton as automaton with discrete time (discrete stochastic automaton) where the transition points are defined as the numbers of iterations of the information seeking process under the final number of iterations, we can write that 1. The stochastic automaton initial state u(t ) for as a vector of probabilities is determined: P (t )  { p1 (t ), p 2 (t ), p 3 (t ),..., p i (t ),..., p n 1 (t ), p n (t )}, n p(u(t  1), y(t )/u(t ), x (t ))  p i (t )  1. i 1  p(u(t  1)/u(t )) p( y(t )/u(t )). With the perspective of a search system that is based on the principles described above, these values can be defined as: u(t ) is a state of the system at the current time (the current iteration), which determines the probability of selecting information sources (D1,...,Dn) in the current iteration; u(t  1) is a state of the system at the next time period (the next iteration), which determines the probability of selecting information sources (D1,...,Dn) in the next iteration; x (t ) is system input data on the current iteration, determining sampling estimation results with size V from the selected information source on the current iteration; y (t ) is system output data on the current iteration, determining the selected information source (D1,...,Dn). For greater clarity in some cases, stochastic Moore type automaton can be described in the following canonical form u(t  1)  F (u(t ), x (t  1)), 33 2. Uniformly distributed random variable w on the interval [0, 1] is generated. 3. Depending on the value of the random value w the interval that corresponds one of the n information sources is determined as follows. We accept that p 0  0, then if z 1 z i 0 i 1  p i (t )  w   p(t ), the implementation of the automaton will correspond the information source by number z. 4. Sampling and processing of V information units (Documents) from the information source by number z is performed. 5. Assessment R i (1) of the relevance for V information units (documents) from information sources by number z is performed. 6. Recalculated probability values are performed by a particular algorithm. The example is given in Table. Table. Stochastic automaton probability vector conversion coefficients y (t )  f (u(t )) where t is the variable that determines the time, viz the automaton response times. This time is defined as the integer that is t  1,...,N, where N is the given number of information search iterations, on each of which V document choice from one of selected at this information source iteration (D1,...,Dn) is performed. Implementing the use of the automaton for selecting information sources for consistent consolidation can be constructed in such a way that the automaton u state change (t  1) is defined as a regular constraint, and its output y(t) is determined as a stochastic process. The consolidation information algorithm based on the use of such automaton consists of the following steps: Ri 0.2 0.4 0.6 0.8 1.0 kR 0.5 0.75 1 1.5 2 Indeed, p z (t  1)  p z (t )k R . A calculation for the vector P (t  1) is made in such a way that D (t )  1  p z (t ), D(t  1)  1  p z (t  1), p i (t  1)  p i D (t  1) D (t ) for i  1,..., n and i  z . Наукові вісті НТУУ "КПІ" 34 2017 / 2 Move to step 2 for further document (informative units) sampling of the information sources when choosing a source of accumulated information about their relevance of the request and its parameters that has been obtained in the preceding steps of selection. As a result of described algorithm’s action execution we get a sequence of probability vectors that reflects a consistent process of identifying the most relevant information sources for the request with the acceptability parameters. When you do another search for information on the same request we get the existing probabilistic model of the most relevant information source selection, which significantly reduces the search time, but does not preclude the situation changes over time among information sources and the possibility of other more relevant to the request sources, among those that considered. The algorithm testing was conducted in a specialized test software environment that was developed for the analysis and testing of capabilities of relevant documents algorithm selection upon request. Test environment consists of ten formed in accordance with the request information sources, each of which contains one thousand generated documents. Each source got from 1 to 10 % of relevant documents, which allows evaluating the finding of a certain number of relevant documents by using different search algorithms. 50 Number of relevant documents 45 43 40 35 33 30 25 18 20 17 15 15 10 8 9 10 5 0 0 0 21 19 11 10 4 3 100 200 300 400 500 600 Number of processed documents Fig. 2. An example of the results of testing an algorithm constructed on a stochastic ( ) automaton in comparison with Boolean ( ) ) and vector ( models Sample test results are shown in Fig. 2. The described stochastic algorithm when processing a large number of documents (more than 500) showed twice better performance on the number of selected relevant documents than direct search algorithms that are based on Boolean and vector models. Conclusions The use of stochastic automaton for data consolidation for further use in information-analytical systems as one of the most important problems of processing large amounts of data is presented. The use of an integrated approach based on stochastic information source selection model and fuzzy set model for assessing the relevance of certain documents is proposed. The paper describes the information consolidation algorithm implementing the possibility of using a stochastic automaton for selecting relevant documents. The structure of the interaction of open source information consolidation system basic elements is shown. Described in the article algorithm of using the stochastic automaton for data consolidation was tested in the development of geocoding by information requests system pilot project [18, 19]. The developed algorithm of using the stochastic automaton for data consolidation allows developing a set of software tools that provides enough complete and holistic data consolidation problem solving for various systems searching for information from information sources different in composition and presentation type. One of the perspective directions of use of the consolidation algorithm, based on use of stochastic automata, is the utilization of their capabilities for building systems of search of scientific and technical information by one request from different scientific bibliographic and abstract bases and other open sources (SCOPUS, RSCI, UKRINTEI, scientific RSS feeds, etc.). It will give the chance to wide circles of scientists, graduate students and students considerably reduce the time and improve the quality of searching for materials, necessary for your profile, for scientific and educational activities. Список літератури 1. Хаустов Д.В., Кузьміних В.О., Шевченко О.М. Стандартизація навчальних ресурсів на базі об’єктно-орієнтованого підходу // V(XXIX) Міжнар. міжвуз. школа-семінар “Методи і засоби діагностики в техніці та соціумі (МіЗД ТС-201)”: Зб. матеріалів. — Івано-Франківськ: Факел, 2015. — С. 81—85. ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ, СИСТЕМНИЙ АНАЛІЗ ТА КЕРУВАННЯ 35 2. Певцов Г.В., Фастовский Э.Г., Олейник М.А. Анализ методов консолидации информации и особенностей их применения // Вісник НТУ “ХПІ”. Тем. вип. Інформатика і моделювання. — 2007. — № 39. — С. 45—153. 3. Шаховська Н.Б. Методи опрацювання консолідованих даних за допомогою просторів даних // Проблеми програмування. — 2011. — № 4. — С. 72—84. 4. Черняк Л. Большие Данные — новая теория и практика // Открытые системы. СУБД. — 2011. — № 10. — URL: 5. https://www.osp.ru/os/2011/10/13010990 Salton G., Fox E., Wu H. Extended boolean information retrieval communications of the ACM // CACM. — 1983. — 26, № 11. — P. 1022—1036. 6. 7. 8. 9. 10. 11. Яглом И.М. Булева структура и ее модели. — М.: Сов. радио, 1980. —192 с. Ухоботов В.И. Избранные главы теории нечетких множеств. — Челябинск: Изд-во Челяб. гос. ун-та, 2011. — 245 с. Baeza-Yates R., Rebeiro-Neto B. Modern Information Retrieval. — Menlo Parl, California, New York: ACM Press, AddisonWesley, 1999. — 501 p. Salton G., Wong A., Yang C. A vector space model for automatic indexing // Communications of the ACM. — 1975. — 18, № 11. — P. 613—620. Дубинский А.Г. Некоторые вопросы применения векторной модели представления документов в информационном поиске // Управляющие системы и машины. — 2001. — № 4. — C. 77—83. Landauer T., Foltz P., Laham D. An introduction to latent semantic analysis // Discourse Processes. — 1998. — 25, № 2-3. — P. 259—284. 12. Бондарчук Д.В. Использование латентно-семантического анализа в задачах классификации текстов по эмоциональной окраске // Бюллетень результатов научных исследований. — 2012. — 2, № 3. — C. 146—151. 13. Robertson S.E. The probabilistic ranking principle in IR // J. Documentation. — 1977. — 33, № 4. — P. 294—304. Maron M.E., Kuhns J.L. On relevance, probabilistic indexing, and information retrieval // JACM. — 1960. — 7, № 3. — 216—244 p. 14. 15. 16. Ландэ Д.В., Снарский А.А., Безсуднов И.В. Интернетика: Навигация в сложных сетях: модели и алгоритмы. — М.: ЛИБРОКОМ, 2009. — 264 с. Маннинг К.Д., Рагхаван П., Шютце Х. Введение в информационный поиск / Пер. с англ. — М.: ООО “И.Д. Вильямс”, 2011. — 504 с. 17. Растригин Л.А., Рипа К.К. Автоматная теория случайного поиска. — Рига: Зинатне, 1973. — 344 с. — (ЛатАН). 18. Кузьміних В.О., Бойченко О.С. Система автоматичного геокодування інформаційних запитів // V(XXIX) Міжнар. міжвуз. школа-семінар “Методи і засоби діагностики в техніці та соціумі (МіЗД ТС-201)”: Зб. матеріалів. — ІваноФранківськ: Факел, 2015. — С. 12—17. 19. Кузьміних В.О., Бойченко О.С. Система автоматизованого геокодування запитів користувачів // Екологічна безпека територіально-виробничих комплексів: енергетика, екологія, інформаційні технології. — К.: “МП Леся”, НТУУ “КПІ”, 2015. — С. 217—222. References [1] D.V. Khaustov et al., “Standardization of educational resources based on object-oriented approach”, in Proc. V(XXIX) Int. Interuniversity School Seminar “Methods and Diagnostic Tools in Technology and Society (MiZD TS-201)”, Ivano-Frankivs’k, 2015, pp. 81—85 (in Ukrainian). [2] G.V. Pevtsov et al., “Analysis of information consolidation methods and their application features”, Visnyk "KhPI". Special Edition: Information and Modelling, no. 39, pp. 45—153, 2007 (in Ukrainian). [3] N.B. Shakhovs’ka, “Processing methods of consolidated data using data space”, Problemy Prohramuvannya, no. 4, pp. 72—84, [5] 2011 (in Ukrainian). L. Cherniak. (2011). Big Data — A New Theory and Practice [Online]. Available: https://www.osp.ru/os/2011/10/13010990 (in Russian). G. Salton et al., “Extended boolean information retrieval”, CACM, vol. 26, no. 11, pp. 1022—1036, 1983. doi: 10.1145/ [6] 182.358466 I.M. Yaglom, Boolean Structure and Its Models. Moscow, SU: Sovetskoe Radio, 1980 (in Russian). [4] [7] V.I. Ukhobotov, Selected Chapters of the Theory of Fuzzy Sets. Cheliabinsk, Russia: Publ. House of Cheliabinsk State [8] University, 2011 (in Russian). R. Baeza-Yates and B. Rebeiro-Neto. Modern Information Retrieval. Menlo Parl, California, New York: ACM Press, Addison- [9] Wesley, 1999. G. Salton et al., “A vector space model for automatic indexing”, CACM, vol. 18, no. 11, pp. 613—620, 1975. doi: 10.1145/ 361219.361220 36 Наукові вісті НТУУ "КПІ" 2017 / 2 [10] A.G. Dubinskiy, “Some questions of application of vector model of document representation in information search”, Upravljajushhie Sistemy i Mashiny, no. 4, pp.77—83, 2001 (in Russian). [11] T. Landauer et al., “An introduction to latent semantic analysis”, Discourse Processes, vol. 25, no. 2-3, pp. 259—284, 1998. doi: 10.1080/01638539809545028 [12] D.V. Bondarchuk, “The use of latent-semantic analysis in the case of text classification by emotional coloring”, Bjulleten' Rezul'tatov Nauchnyh Issledovanij, vol. 2, no. 3, pp. 146—151, 2012 (in Russian). [13] S.E. Robertson, “The probabilistic ranking principle in IR”, J. Documentation, vol. 33, no. 4, pp. 294—304, 1977. doi: [14] 10.1108/eb026647 M.E. Maron and J.L. Kuhns, “On relevance, probabilistic indexing, and information retrieval”, JACM, vol. 7, no. 3, pp. 216— 244, 1960. doi: 10.1145/321033.321035 [15] D.V. Lande et al., Navigation in Complex Networks: Models and Algorithms. Moscow, Russia: LIBROCOM, 2009 (in Russian). [16] C.D. Manning et al., Introduction to Information Search. Moscow, Russia: I.D. Williams, 2011 (in Russian). [17] L.A. Rastrigin and K.K. Ripa, Automate Random Search Theory. Riga, Latvia: Zinatne, 1973 (in Russian). [18] V.O. Kuzminykh and O.S. Boichenko, “The information request automatic geocoding system”, in Proc. V(XXIX) Int. Interuniversity School Seminar “Methods and Diagnostic Tools in Technology and Society (MiZD TS-201)”, Ivano-Frankivs’k, 2015, pp. 12—17 (in Ukrainian). [19] V.O. Kuzminykh and O.S. Boichenko, “The user request automatic geocoding system”, in Environmental Security Clusters: Energy, Environment, Information Technology. Kyiv, Ukraine: “MP Lesia”, NTUU “KPI”, 2015, pp. 217—222 (in Ukrainian). О.В. Коваль, В.О. Кузьміних, Д.В. Хаустов ВИКОРИСТАННЯ СТОХАСТИЧНОГО АВТОМАТУ ДЛЯ КОНСОЛІДАЦІЇ ДАНИХ Проблематика. Розробка методів і алгоритмів ефективного пошуку релевантної інформації за запитами. У статті розглядаються питання консолідації даних для подальшого їх використання в інформаційно-аналітичних системах. Мета дослідження. Виявлення можливості та побудова алгоритмів пошуку релевантної інформації з різнорідних джерел на основі аналізу ймовірнісної інформації, що визначає можливість наявності релевантних документів у цих джерелах. Методика реалізації. Для пошуку релевантної інформації за пошуковими запитами використовується підхід, побудований на використанні оцінок ймовірностей наявності релевантних документів у джерелах із подальшим збільшенням кількості вибраних із цих джерел документів для аналізу їх релевантності запиту. Результати досліджень. Розроблено структуру програмованого стохастичного автомату для забезпечення вибору найбільш імовірних за параметрами релевантності джерел інформації та алгоритм пошуку інформації на основі використання стохастичного автомату. Висновки. Наведений алгоритм використання стохастичного автомату для консолідації даних дає змогу розробити комплекс програмних засобів, що забезпечує достатньо повний і цілісний розв’язок задач консолідації даних для різноманітних систем, що здійснюють пошук інформації з різноманітних за складом і видом представлення джерел інформації. Ключові слова: відкриті джерела даних; консолідація даних; інформаційно-аналітичні системи; інформаційно-пошукові системи; ймовірнісні моделі; релевантні документи; задачі обробки великих обсягів даних. А.В. Коваль, В.А. Кузьминых, Д.В. Хаустов ИСПОЛЬЗОВАНИЕ СТОХАСТИЧЕСКОГО АВТОМАТА ДЛЯ КОНСОЛИДАЦИИ ДАННЫХ Проблематика. Разработка методов и алгоритмов эффективного поиска релевантной информации по запросам. В статье рассматриваются вопросы консолидации данных для дальнейшего их использования в информационно-аналитических системах. Цель исследования. Определение возможности и построение алгоритмов поиска релевантной информации из разнородных источников на основе анализа вероятностной информации, которая определяет возможность наличия релевантных документов в этих источниках. Методика реализации. Для поиска релевантной информации по поисковым запросам используется подход, который построен на использовании оценок вероятностей наличии релевантных документов в источниках с последующим увеличением числа выбираемых из этих источников документов для анализа их релевантности запросу. Результаты исследований. Разработаны структура программируемого стохастического автомата для обеспечения выбора наиболее вероятных по параметрам релевантности источников информации и алгоритм поиска информации на основе использования стохастического автомата. Выводы. Приведенный алгоритм с использованием стохастического автомата для консолидации данных позволяет разработать комплекс программных средств, обеспечивает достаточно полное и целостное решение задач консолидации данных для различных систем, которые осуществляют поиск информации из различных по составу и виду представления источников информации. Ключевые слова: открытые источники данных; консолидация данных; информационно-аналитические системы; информационно-поисковые системы; вероятностные модели; релевантность; задачи обработки больших объемов данных. Рекомендована Радою Навчальнонаукового комплексу “Інститут прикладного системного аналізу” КПІ ім. Ігоря Сікорського Надійшла до редакції 28 лютого 2017 року
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Fabio Cuzzolin
Oxford Brookes University
Bogdan Gabrys
University of Technology Sydney
PALIMOTE JUSTICE
RIVERS STATE POLYTECHNIC
Musabe Jean Bosco
Chongqing University of Posts and Telecommunications