Digitizing Notes Using Optical Character Recognition and Automatic Topic Identification and Classification Using Natural Language Processing
Digitizing Notes Using Optical Character Recognition and Automatic Topic Identification and Classification Using Natural Language Processing
https://doi.org/10.22214/ijraset.2023.52950
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
Abstract: In today’s world digital documents are a major part of everyone’s life as they have a wide scope of usage. However
handwritten notes still contain loads of important and valuable information. In our research, we explore the different methods of
Optical Character Recognition, or OCR which can be used for digitizing manual notes. Along with it we deep dive into the
concept of Topic Detection and Identification and methods to implement it which are useful for extracting the crux of any
document or piece of information. With the aim of integrating both processes into a single system, we study various algorithms
involving neural networks like ANN, RNN, and CNN, and methods such as Tesseract, KNN, and LSTM that are used for
implementing OCR while techniques such as K means clustering, TF-IDF, LDA and LINGO have been employed to perform
topic detection and identification. Based on our study and results from various papers, we have decided to use CNN for OCR.
Keywords: Optical Character Recognition, Natural Language Processing, Convolutional Neural Network, Support Vector
Machine, Deep Learning.
I. INTRODUCTION
Handwritten notes and documents are a ubiquitous part of our world and have invaluable practical worth. Even though documents in
digital format are being widely used and are being rapidly adopted in major applications and domains especially since the Covid-19
pandemic, still a large amount of information and data remains in the form of manual handwritten documents. Thus, extracting this
information from these physical documents and identifying important parts from it is a very crucial job. This process can be
practically performed using Machine Learning and various subdomains such as Image Processing, Natural Language Processing, etc
which fall under it. Therefore for the conversion and extraction of these manually handwritten documents techniques like Optical
Character Recognition (OCR), Topic Detection, and Topic Identification are being widely used presently.
Optical Character Recognition or OCR involves using technology or a model for the conversion of images in typed, handwritten, or
printed text format to a digital format from formats such as a scanned document, image of the document, etc. It is mainly used to
convert physical documents or hard copies of documents to soft copies where they can later be stored in a database or repository and
can be edited using word processors. These converted documents can later be used for applications like feature recognition, pattern
recognition, feature extraction, etc. Many approaches and techniques have been explored for the purpose of OCR and Topic
Detection. Deep Learning concepts involving neural networks such as ANN[20], RNN[16], CNN[5], and Machine Learning
algorithms such as KNN[21], SVM[22], etc have been employed for implementing OCR. Even though OCR is an important part of
converting and extracting data from handwritten documents, topic detection and identification also form a very crucial part of the
overall process. Natural Language Processing or NLP is the sub domain of Artificial Intelligence which combines computational
power and linguistics to create systems that understand, extract and analyze meaning of text and human speech. It analyzes the
semantics, syntax and pragmatic features of that text by implementing processes such as topic detection and identification.Along
with this various other techniques and frameworks have been proposed for creating systems for specific applications based on either
of the mentioned processes. The goal of this paper is to research the wide range of methods being used for the mentioned processes
and present the state-of-the-art techniques and results achieved so far by reviewing the related work and drawing a comparison
amongst them. The findings and outcomes obtained will further be used in research and implementation of a system which
integrates the two procedures and provides the users of the application a single solution for digitizing handwritten notes or
documents and classifying them according to their respectives topics.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5433
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5434
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
S. R. Vispute et al. in 2013[12] proposed a system for retrieving documents and providing personalized documents to end users with
the help of their browsing history.
The paper provides a categorization of Marathi documents by using a clustering algorithm called Lingo clustering which is based on
VSM. The system got an accuracy of 91.10 % on a dataset of 107 Marathi documents from three different categories.
Shunji Mori et al. in 1992[13] has given a brief about the research and development of OCR throughout history. The paper is
divided into two parts: historical development and R&D, which are further divided into structural analysis and template matching.
The paper has also provided their view on neural networks, expert systems, and the future scope of OCR.
Joris D’hondt et al. in 2011[14] proposed an innovative technique to divide a textual document into more components by using the
coherence function, which is based on lexical chains and provides a coherence graph of documents as output. The proposed
methodology in the paper has given the best results in randomized test scenarios and has outplayed other identification techniques.
Pema Gurung et al. in 2017[15] provided the usage of cluster analysis for document collection of various sizes. The paper has given
a brief study of the K -means clustering algorithm for topic identification and provided a comparative study of the results of cluster
analysis for small and large documents.
Bhagyashree P V et al. in 2019[19] used an advanced deep learning technique DAG-CNN(Directed Acyclic Graph Convolutional
neural network) for handwritten character recognition. The given method overcomes some of the disadvantages of CNN, like
misclassifying identical cursive words. As DAG is an acyclic-directed graph it has multiple inputs and outputs, and thus each and
every layer is connected to the final layer with the help of skip connections. This allows various types of features for contributing to
the overall performance.
III. TECHNIQUES
A. Neural Networks
A neural network is a technique in the field of Artificial Intelligence which involves training machines to process data inspired by
the biological neuron structure in the human brain. It falls under the specialized domain of Deep Learning and uses interconnected
nodes present in a layered structure to process the given input. It generally contains input, output, and multiple hidden layers. Some
of the types of neural networks used primarily for the purpose of OCR include Artificial Neural Networks (ANN)[20],
Convolutional Neural Networks (CNN)[18], Recurrent Neural Networks (RNN)[16], and similar subtypes. Fig.1 [29] below shows
the structure of a neural network with majorly an input layer, output layer and multiple hidden layers.
D. K-means Clustering
K means is a popular unsupervised machine learning technique used for finding hidden similarities and inferences in unlabelled
data. It involves creating clusters of similar data around computed centroids based on the euclidean distance of all points from the
respective centroids when the data is plotted.
It executes multiple iterations with the data points to form the most efficient clusters involving maximum data points with similar
characteristics and nature. It is extensively used for the purpose of topic detection in various applications.[15]. Fig. 4 below shows
clustering done by K-means clustering algorithm.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5435
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
E. TF-IDF
Term Frequency measures the frequency of occurrence or count of a particular word in a document. The length and generality of the
word affect the result so the term frequency has to be normalized. Each document is vectorized on vocabulary to create a
generalized vector for any possible word in the corpus.
( )
TF = ( )
Inverse Document Frequency measures the importance of a word in the document which TF does not consider. It provides
weightage to each word based on its frequency and while TF is document-specific, IDF is constant for a corpus.
IDF =
Both these methods are very popular and extensively used for the purpose of topic detection under the domain of Natural Language
Processing (NLP).
IV. METHODOLOGY
OCR is generally done in two major steps -
1) Text detection: Detecting the position of the text i.e. the words and letters on a page and drawing bounding boxes around it. It
could be a very densely populated document or a sparsely worded document. After detection, the next step is to identify the
word.
2) Recognition: There are three approaches that can be taken here:
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5436
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
Image Acquisition: The system initially involves scanning handwritten or manual documents or notes using a mobile phone’s
camera which is assumed to be of decent quality.
Text Detection: A significant step in the pipeline which is used to determine if text is present in the given image or not. And if
present, its coordinates are remembered. This is done using text localization and verification. Usually, bounding boxes are
added to the regions where the text is identified.
Transformation: This is an optional step used to extract and clean the detected text so as to provide the deep learning model
with quality inputs. It handles all kinds of distortion in texts, such as removing tilt, and aligning it horizontally, etc.
Text Recognition: This is where a neural network is applied to actually recognize the text, and convert it into digital form.
Final Text: The final text obtained usually isn't 100% accurate and perfect, and hence NLP can be used to fix mistakes and
misidentifications. For example, the word “dictionary” may be recognized as “dictlonary” or “dlctlonany” where the letter “i” is
confused with the letter “l” due to vast differences in handwriting. Such words can be rectified using NLP models.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5437
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
After the conversion, topic detection and identification are done by employing Natural Language Processing or NLP techniques
such as Linear Discriminant Analysis (LDA), Term Frequency - Inverse Term Frequency (TF-IDF), or similar methods.
Topic identification can be done using two techniques- topic modeling and topic classification. Topic modeling uses unsupervised
machine learning to group together documents with similar words, keeping in mind their relations.
Topic classification on the other hand uses supervised machine learning to identify what topic a document belongs to on the basis of
the previous training documents provided to it. It is classified into three types: Rule-based system, Machine Learning system, and
Hybrid system.
Hence using topic classification, the notes are classified under their respective topics. The user through the app or platform will be
able to access and store topic-wise classified collections of notes for various subjects and respective relevant topics.
V. RESULTS
From the survey conducted the performance of different algorithms is compared in the given table:
Table 1. Analysis of Results
Sr. No. Topic References Algorithms/Technique Accuracy
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5438
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
REFERENCES
[1] Chin-Yew Lin. 1995. Knowledge-based Automatic Topic Identification. In the 33rd Annual Meeting of the Association for Computational Linguistics, pages
308–310, Cambridge, Massachusetts, USA. Association for Computational Linguistics.
[2] Gader, Paul & Mohamed, Magdi & Chiang, Jung Hsien. (1997). Handwritten word recognition with character and inter-character neural networks. Systems,
Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on. 27. 158 - 164. 10.1109/3477.552199.
[3] Arora, Shefali & Bhatia, M.. (2018). Handwriting recognition using Deep Learning in Keras. 142-145. 10.1109/ICACCCN.2018.8748540.
[4] Hamad, Karez & Kaya, Mehmet. (2016). A Detailed Analysis of Optical Character Recognition Technology. International Journal of Applied Mathematics,
Electronics and Computers. 4. 244-244.
[5] Sara Aqab and Muhammad Usman Tariq, “Handwriting Recognition using Artificial Intelligence Neural Network and Image Processing” International Journal
of Advanced Computer Science and Applications(IJACSA), 11(7), 2020.
[6] Hazen, T.J. (2011). Topic Identification. In Spoken Language Understanding (eds G. Tur and R. De Mori). https://doi.org/10.1002/9781119992691.ch12
[7] J. Memon, M. Sami, R. A. Khan and M. Uddin, "Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR),"
in IEEE Access, vol. 8, pp. 142642-142668, 2020, doi: 10.1109/ACCESS.2020.3012542.
[8] Burcu Caglor Gencosman, Huseyin C. Ozmutlu, Seda Özmutlu. Character n-gram application for automatic new topic identification. ELSEVIER 26 June 2014.
[9] Srivastav, A., Singh, S. Proposed Model for Context Topic Identification of English and Hindi News Article Through LDA Approach with NLP Technique. J.
Inst. Eng. India Ser. B 103, 591–597 (2022).
[10] Kim, SW., Gil, JM. Research paper classification systems based on TF-IDF and LDA schemes. Hum. Cent. Comput. Inf. Sci. 9, 30 (2019).
https://doi.org/10.1186/s13673-019-0192-7
[11] G. Xu, Y. Meng, Z. Chen, X. Qiu, C. Wang and H. Yao, "Research on Topic Detection and Tracking for Online News Texts," in IEEE Access, vol. 7, pp.
58407-58418, 2019, doi: 10.1109/ACCESS.2019.2914097.
[12] S. R. Vispute and M. A. Potey, "Automatic text categorization of marathi documents using clustering technique," 2013 15th International Conference on
Advanced Computing Technologies (ICACT), 2013, pp. 1-5, doi: 10.1109/ICACT.2013.6710543.
[13] S. Mori, C. Y. Suen and K. Yamamoto, "Historical review of OCR research and development," in Proceedings of the IEEE, vol. 80, no. 7, pp. 1029-1058, July
1992, doi: 10.1109/5.156468.
[14] Joris D’hondt, Paul-Armand Verhaegen, Joris Vertommen, Dirk Cattrysse, Joost R. Duflou, Topic identification based on document coherence and spectral
analysis, Information Sciences, Volume 181, Issue 18, 2011,https://doi.org/10.1016/j.ins.2011.04.044.
[15] Gurung, Pema,Wagh, Rupali 2017/03/25 A study on Topic Identification using K means clustering algorithm: Big vs. Small Documents Advances in
Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 2
[16] R. Parthiban, R. Ezhilarasi and D. Saravanan, "Optical Character Recognition for English Handwritten Text Using Recurrent Neural Network," 2020
International Conference on System, Computation, Automation and Networking (ICSCAN), 2020, pp. 1-5, doi: 10.1109/ICSCAN49426.2020.9262379.
[17] Patel, Chirag & Patel, Atul & Patel, Dharmendra. (2012). Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study. International
Journal of Computer Applications. 55. 50-56. 10.5120/8794-2784.
[18] Mittal, Usha ,Srivastava, Sonal,Chawla, Priyanka 2019,Object Detection and Classification from Thermal Images Using Region based Convolutional Neural
Network .Journal of Computer Science Doi - 10.3844/jcssp.2019.961.971
[19] Bhagyashree P V, Ajay James, Chandran Sarvanan, A Proposed Framework for Recognition of Handwritten Cursive English Characters using DAG-CNN.
IEEE Doi: 10.1109/ICIICT1.2019.8741412
[20] Tiwari, Usha & Jain, Monika & Mehfuz, Shabana. (2019). Handwritten Character Recognition—An Analysis. 10.1007/978-981-13-0665-5_18.
[21] Tawde, Gaurav Y., Mrs. Jayashree M. Kundargi and Jayashree M. Kundargi. “An Overview of Feature Extraction Techniques in OCR for Indian Scripts
Focused on Offline Handwriting.” (2013).
[22] Nasien, Dewi & Haron, Habibollah & Yuhaniz, Siti. (2010). Support Vector Machine (SVM) for English Handwritten Character Recognition. Computer
Engineering and Applications, International Conference on. 1. 249-252. 10.1109/ICCEA.2010.56.
[23] Magdi Mohamed and Paul Gader. Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modelling and Segmentation-Based Dynamic
Programming Techniques (1996) in IEEE. Doi: 10.1109/34.494644
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5439
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 11 Issue V May 2023- Available at www.ijraset.com
[24] Aisha Sharaf, Bhagya Viswanath, Kavya Chandran, Nishana Salim, Anju S Oommen. Handwritten Text Recognition and Digitization System. IJIRSET 2019
[25] Manoj Sonkusare and Narendra Sahu. A Survey On Handwritten Character Recognition(HCR) Techniques For English Alphabets. Advances in Vision
Computing: An International Journal (AVC) Vol.3, No.1, March 2016. DOI:10.5121/avc.2016.3101
[26] K.Karthick, K.B.Ravindrakumar, R.Francis, S.Ilankannan. Steps Involved in Text Recognition and Recent Research in OCR; A Study. International Journal of
Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8, Issue-1, May 2019
[27] S. Jain and J. Pareek, "Automatic Topic(s) Identification from Learning Material: An Ontological Approach," 2010 Second International Conference on
Computer Engineering and Applications, 2010, pp. 358-362, doi: 10.1109/ICCEA.2010.221.
[28] Xing, W., & Du, D. (2019). Dropout Prediction in MOOCs: Using Deep Learning for Personalized Intervention. Journal of Educational Computing Research.
[29] Baek J, Choi Y. Deep Neural Network for Predicting Ore Production by Truck-Haulage Systems in Open-Pit Mines. Applied Sciences. 2020
[30] Buenaño-Fernández, Diego & Gonzalez, Mario & Gil, David & Luján-Mora, Sergio. (2020). Text Mining of Open-Ended Questions in Self-Assessment of
University Teachers: An LDA Topic Modeling Approach. IEEE Access. PP. 1-1. 10.1109/ACCESS.2020.2974983.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 5440