Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

document vector
Recently Published Documents


TOTAL DOCUMENTS

37
(FIVE YEARS 11)

H-INDEX

4
(FIVE YEARS 0)

Author(s):  
Rifiana Arief ◽  
Achmad Benny Mutiara ◽  
Tubagus Maulana Kusuma ◽  
Hustinawaty Hustinawaty

<p>This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from pusat data teknologi dan informasi (technology and information data center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed.</p>


2021 ◽  
Author(s):  
Yue Zhao ◽  
Ajay Anand ◽  
Gaurav Sharma

<div>We develop and evaluate an automated data-driven framework for providing reviewer recommendations for submitted manuscripts. Given inputs comprising a set of manuscripts for review and a listing of a pool of prospective reviewers, our system uses a publisher database to extract papers authored by the reviewers from which a Paragraph Vector (doc2vec ) neural network model is learned and used to obtain vector space embeddings of documents. Similarities between embeddings of an individual reviewer’s papers and a manuscript are then used to compute manuscript-reviewer match scores and to generate a ranked list of recommended reviewers for each manuscript. Our mainline proposed system uses full text versions of the reviewers’ papers, which we demonstrate performs significantly better than models developed based on abstracts alone, which has been the predominant paradigm in prior work. Direct retrieval of reviewer’s manuscripts from a publisher database reduces reviewer burden, ensures up-to-date data, and eliminates the potential for misuse through data manipulation. We also propose a useful evaluation methodology that addresses hyperparameter selection and enables indirect comparisons with alternative approaches and on prior datasets. Finally, the work also contributes a large scale retrospective reviewer matching dataset and evaluation that we hope will be useful for further research in this field. Our system is quite effective; for the mainline approach, expert judges rated 38% of the recommendations as Very Relevant; 33% as Relevant; 24% as Slightly Relevant; and only 5% as Irrelevant.</div>


2021 ◽  
Author(s):  
Yue Zhao ◽  
Ajay Anand ◽  
Gaurav Sharma

<div>We develop and evaluate an automated data-driven framework for providing reviewer recommendations for submitted manuscripts. Given inputs comprising a set of manuscripts for review and a listing of a pool of prospective reviewers, our system uses a publisher database to extract papers authored by the reviewers from which a Paragraph Vector (doc2vec ) neural network model is learned and used to obtain vector space embeddings of documents. Similarities between embeddings of an individual reviewer’s papers and a manuscript are then used to compute manuscript-reviewer match scores and to generate a ranked list of recommended reviewers for each manuscript. Our mainline proposed system uses full text versions of the reviewers’ papers, which we demonstrate performs significantly better than models developed based on abstracts alone, which has been the predominant paradigm in prior work. Direct retrieval of reviewer’s manuscripts from a publisher database reduces reviewer burden, ensures up-to-date data, and eliminates the potential for misuse through data manipulation. We also propose a useful evaluation methodology that addresses hyperparameter selection and enables indirect comparisons with alternative approaches and on prior datasets. Finally, the work also contributes a large scale retrospective reviewer matching dataset and evaluation that we hope will be useful for further research in this field. Our system is quite effective; for the mainline approach, expert judges rated 38% of the recommendations as Very Relevant; 33% as Relevant; 24% as Slightly Relevant; and only 5% as Irrelevant.</div>


2021 ◽  
pp. 016327872110142
Author(s):  
Shotaro Komaki ◽  
Fuminori Muranaga ◽  
Yumiko Uto ◽  
Takashi Iwaanakuchi ◽  
Ichiro Kumamoto

Nursing records are an account of patient condition and treatment during their hospital stay. In this study, we developed a system that can automatically analyze nursing records to predict the occurrence of diseases and incidents (e.g., falls). Text vectorization was performed for nursing records and compared with past case data on aspiration pneumonia, to develop an onset prediction system. Nursing records for a patient group that developed aspiration pneumonia during hospitalization and a non-onset control group were randomly assigned to definitive diagnostic (for learning), preliminary survey, and test datasets. Data from the preliminary survey were used to adjust parameters and influencing factors. The final verification used the test data and revealed the highest compatibility to predict the onset of aspiration pneumonia (sensitivity = 90.9%, specificity = 60.3%) with the parameter values of size = 80 (number of dimensions of the sentence vector), window = 13 (number of words before and after the learned word), and min_count = 2 (threshold of wordcount for word to be included). This method represents the foundation for a discovery/warning system using machine-based automated monitoring to predict the onset of diseases and prevent adverse incidents such as falls.


Authorship verification is a task of identifying whether two text documents are written by the same author or not by evaluating the veracity and authenticity of writings. Authorship Verification is used in various applications such as analysis of anonymous emails for forensic investigations, verification of historical literature, continuous authentication in cyber-security and detection of changes in writing styles. The Authorship Verification problem primarily depends on the similarity among the documents. In this work, a new approach is proposed based on the similarity between the known documents of the author and anonymous document. In this approach, extract the most frequent terms from the dataset for document vector representation. These most frequent terms are used to represent the train and test documents. The term weight measure is used to represent the term value in the vector representation. The Cosine similarity measure is used to determine the similarity among the training and test document. Based on the threshold value of similarity score, the author of a test document is verified whether the test document is written by the suspected author or not. The PAN competition 2014 Authorship Verification dataset is used in this experiment. The proposed approach achieved best results for Authorship verification when compared with various solutions proposed in this domain


2020 ◽  
Vol 1631 ◽  
pp. 012111
Author(s):  
Taotao Fu ◽  
Dezhen Kong ◽  
Yilin Jin ◽  
Canghong Jin ◽  
Min Zheng ◽  
...  

A tool that can search over large code corpus directly and list ranked snippets can prove to be an invaluable resource to programmers looking for similar code snippets using natural language queries. It must have a deep understanding of the semantics of source code and queries to evaluate their intent correctly. Over the years, many tools that rely on the textual similarity between source code and query have proven to be ineffective as they fail to learn the high- level semantic understanding of source code and query. While the previous models for code search using deep neural networks do a good job but, most of them only evaluate their models on only a single programming language, mostly Java. In this paper, we propose a novel deep neural network model called Unified Code Net that can handle the intricacies of different programming languages. This model borrows several vital features from different previous models and builds on top of those ideas to make a unified model that can generate document vector embeddings from source code, and using similarity search with the query vector embedding can return the most similar code snippets in any language. This tool can drastically reduce the programmer’s efforts to look for an efficient and viable code snippet for problem at hand which ideally can replace use of search engines for the same.


The recent advancement in technologies are generating huge amount of data and extracting information from it is being outpaced by data accumulation. The development of hybrid approaches by combining different algorithms for extraction of required from the stock-pile of data is a demand of the hour. One such algorithm is vector space model for inverted indexing that has been used traditionally for search engine indexing in computers. In bioinformatics also it has been used for assembly of DNA fragments generated after sequencing. But it has not been applied for retrieval of relevant protein sequence to the query, based on presence or absence of motifs and domains in it. In this paper the concept of inverted indexing has been applied on small motif/domain data of proteins contained in Motivated Proteins database at http://motif.gla.ac.uk/motif/index.html. The index has been built using 17 small hydrogen bonded motifs present in a dataset of 430 proteins. The entire dataset of 430 proteins has been divided into 19 classes. Seven classes’ example cyanovirin, antibiotic and concavalin etc. had very few instances (1 or 2), hence have been omitted from further studies. Rest 12 classes with more than 10 proteins were considered further for testing information retrieval (IR) strategy. The document vector of all the proteins belonging to one class was averaged and 12 queries with averaged vector were prepared for testing. The similarity coefficient (SC) was then compared between query and all the proteins of the dataset. This approach could successfully classify the query as belonging to the class from which it derived. To further validate the importance of document vector as novel attribute for classification, entire dataset of document vector was clustered to ten (10) clusters. Testing was then performed with similarity coefficient (SC) of the query with clusters obtained above. The allocation of cluster to the 12 query sequences followed the same pattern as done with relevant document search using inverted indexing approach. But clustering allocated the queries to only four (4) classes. Maximum number of query proteins (7 proteins or 58%) were found belonging to cluster 5.


Export Citation Format

Share Document