Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Smart-Image-to-Text-and-Text-to-Speech-Reorganization-Using-Machine-Learning

Uploaded by

Manu Priya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Smart-Image-to-Text-and-Text-to-Speech-Reorganization-Using-Machine-Learning

Uploaded by

Manu Priya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND MANAGEMENT ISSN: 2583-6129

VOLUME: 02 ISSUE: 06 | JUNE – 2023 DOI: 10.55041/ISJEM00630 WWW.ISJEM.COM

AN INTERNATIONAL SCHOLARLY || MULTIDISCIPLINARY || OPEN ACCESS || INDEXING IN ALL MAJOR DATABASE & METADATA

Smart Image To Text And Text To Speech Recognition Using


Machine Learning

Nikita Pakhale Harshada Kamble


Electronics & Telecommunication Electronics & Telecommunication
JSPM’S Rajarshi Shahu College of Engineering JSPM’S Rajarshi Shahu of Enineeing
Pune, India Pune, India
niku.pakhale@gmail.com harshadarupeshkamble@gmail.com

Mrs.Lata More Dr.S.C.Wagaj


Electronics & Telecommunication Head of Department Electronics & Telecommunication
JSPM’S Rajarshi Shahu College of Engineering JSPM”S Rajarshi Shahu College of Engineering
Pune, India Pune, India
ismore_51105@gmail.com

Abstract—The optical character recognition (OCR) and transform to effectively gather and filter text sections in a picture.
The geometric properties and stroke width transform are used to
text-to-speech (TTS) concepts are combined in this project. By
remove the non-text maximum stable exterior regions. Individual
successfully establishing a voice interface connection with
letters and alphabets are then grouped to find text sequences, which
computers, this type of framework helps persons who are visually
are subsequently broken up into words. In order to digitize the
handicapped. Image to text and text to speech conversion is a
words, optical character recognition (OCR) is used. The text is
technique that uses the OCR method to read and scan 20+ different
converted to speech in the final phase by feeding it through our text-
languages and numbers in the image and converts them to voices.
to-speech synthesizer (TTS). On images from documents to nature
The voice processing module and the picture processing module are
settings, the suggested method is tested. The correctness and
both implemented in this project. Numerous methods have been
robustness of the suggested framework have been demonstrated by
used in the past, such as the Edged Based Method, Connected
promising findings, which promote its practical use in real-world
Component Method, Texture-Based Method, and Mathematical
applications.
Morphology Method, however they have significant limitations
when measured by exactness, f-score, and review. These picture
Keywords— Image Processing, Text Recognition and Extraction,
texts can be found in magazines, photographs, newspapers,
Maximally stable extremal regions, OCR(Optical Character-
banners, and other media. The development of intelligent systems
Recognition),SWT(Stroke-Width-Transform) TTS(Text-to-speech
to enhance quality of life is the focus of current technological
synthesizer)
developments in the fields of natural language processing and
image processing. An efficient method for text recognition, I. INTRODUCTION
extraction from images, and text-to-speech conversion is proposed
Modern performance on tasks requiring knowledge of documents
in this paper. In this work, a successful method for text detection,
and natural language has been shown for sequence modelling. Due to its
extraction from photos, and text to speech conversion is suggested.
practical potential for automatically transforming unstructured text input
The incoming image is first improved by using grey scale
into structured information to get insight about a document's contents,
conversion. Then, using the maximum stable external areas feature
form-based document understanding is a burgeoning study area.
detector, the text portions of the improved image are located. The
However, due to the variety of layout patterns in form like documents, it
following step is to use geometric filtering along with a stroke width
can be difficult to correctly serialise tokens in practise. We suggest Form

© 2023, ISJEM (All Rights Reserved) | www.isjem.com | Page 1


INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND MANAGEMENT ISSN: 2583-6129
VOLUME: 02 ISSUE: 06 | JUNE – 2023 DOI: 10.55041/ISJEM00630 WWW.ISJEM.COM

AN INTERNATIONAL SCHOLARLY || MULTIDISCIPLINARY || OPEN ACCESS || INDEXING IN ALL MAJOR DATABASE & METADATA

Net, a sequence model that takes structure into account in order to natural scene will have a certaintilt angle. These all pose great
improve the serialisation of forms. First, we create Rich Attention, which challenges for screen text recognition. This paper proposes a
makes use of the spatial relationship between tokens in a way that allows method based on character segmentation
for a more accurate calculation of attention score. Second, by embedding Database Used: - ICDAR 2017 Robust Reading Competition -
representations from their neighbouring tokens through graph Chinese (RRC-2017): a dataset containing over 2,000 high-
convolutions, we create Super-Tokens for each word. As a result, Form resolution images of natural scenes and born-digital images
Net explicitly restores any possibly lost local syntactic information. with bounding boxes around the text regions.
Source of impaired speech: - There are several sources of
II. LITERACTURE REVIEW
impaired speech that can affect Urdu-text detection and
1) Title: - A Novel Method based on Character Segmentation for recognition in natural scene images using deep learning. Some
Slant ChineseScreen-render Text Detection and Recognition of the most common sources of impaired speech include:
Author: - Tianlun Zheng 1,2, Xiaofeng Wang1,2,*, Xin Yuan Background noise, Accent or dialect and Speech disorders.
1,2, and Shiqin Wang 1 3) Title:- Novel Approach for Image Text Recognition and
Methodology: - Chinese characters were firstly extracted Translation
using vertical projection &error correction; then it can be Author: - Srinandan Komondor, Y. Mohana Roopa, M. Madhu
recognized via inception module based convolutional Bala
neural networks. The proposed model can effectively Methodology: - One of the most concerned problems of today is
segment Chinese characters from screen-rendered images, to exactly translate the text present in an image to a human
and significantly reduce the training time. readable text. This has been gaining attention these days
Database Used: ICDAR 2013 Robust Reading Competition because of the immense work done by the Computer Vision
- Chinese (RRC-2013): a dataset containing over 2,000 high- Community. The main important concept behind this
resolution images of natural scenes and born-digital images technology is something called as OCR(Optical Character
with bounding boxes around the text regions. Recognition.) With the help of the OCR, search& recognize the
Source of impaired speech: - There are several sources of text in electronic documents and can easily convert them into
impaired speech that can affect Urdu-text detection and human readable.
recognition in natural scene images using deep learning. Database Used: paper is likely inspired by the increasing
Some of the most common sources of impaired speech demand for image text recognition and translation systems that
include: Background noise, Accent or dialect and Speech can process text in multiple languages from images captured in
disorders the real world. This demand is driven by the growing use of
Observation: - One of the main observations of the paper is social media and the internet, which generates large amounts of
that traditional segmentation-free approaches may not be multilingual image content that requires accurate and efficient
effective for slant Chinese text, as they do not account for processing
deformation and overlapping of characters. This highlights Source of impaired speech: - The researchers conducted
the need for a new approach that can handle these challenges. experiments to evaluate the accuracy and effectiveness of the
2) Title: - A Novel Method based on Character Segmentation for proposed approach. They tested the system using images with
Slant ChineseScreen-render Text Detection and Recognition text in English, Spanish, and Chinese, and evaluated the
Author: - Tianlun Zheng 1,2, Xiaofeng Wang1,2, *, Xin Yuan system's ability to recognize and translate the text into other
1,2, and Shiqin Wang 1 languages. The results showed that the proposed approach
Methodology: - Screen rendering text has broad application achieved high accuracy in recognizing text in images and
prospects in the fields of medical records, dictionary screen translating it into different languages.
capture, and screen-assisted reading. However, Chinese screen
rendering text always has the challenges of small font size and
low resolution. Obtaining a screen-rendered text image in a

© 2023, ISJEM (All Rights Reserved) | www.isjem.com | Page 2


INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND MANAGEMENT ISSN: 2583-6129
VOLUME: 02 ISSUE: 06 | JUNE – 2023 DOI: 10.55041/ISJEM00630 WWW.ISJEM.COM

AN INTERNATIONAL SCHOLARLY || MULTIDISCIPLINARY || OPEN ACCESS || INDEXING IN ALL MAJOR DATABASE & METADATA

III. METHODOLOGY V. TESTING AND RESULT

The voice processing module and the image processing module are the
1. Unit Testing: -It is the testing of individual
two main components that make up the text-to-speech device. The
software units of the application .it is done after
image processing module uses the camera to capture photos and
the complexion of an individual unit before
converts them into text. In order for the sound to be perceived, the
integration. Unit testing involves the designof test
speech processing module turns the text into audio and processes it with
cases that validate that the internal program logic
explicit physical properties. Second, the voice processing module
is functioning properly, and that program inputs
converts.txt to speech from.jpg, where OCR alters the extension. OCR
produce valid outputs. All decision branches and
(optical character recognition), is a technological advancement that
internal code flow should be validated. This is a
accurately recognizes characters using the optical system. The camera
structural testing, that relies on knowledge of its
functions as the equivalent of the eye, while the computer serves as the
construction and is invasive. Unit tests perform
equivalent of the human intellect when it comes to processing images
basic tests at component level and test a specific
A. AIM: business process, application, and/or system
configuration. Unit tests ensure that each unique
In this work, an effective approach is suggested for text recognition
path of a business process performs accurately to
and extraction from images and text to speech conversion.
the documented specifications and contains
B. OBJECTIVES: clearly defined inputs and expected results.
To detect and extract text from images &convert it into a digital form
2. Integration Testing: Integration tests are
of speech for an effective medium of communication. To extract
designed to test integrated software components
information (text) and convert them into digital form and recite it
to determineif they run as one program. Testing is
accordingly. To be as effective medium for communication.
event driven and is more concerned with the basic
IV. SOFTWARE DESIGN outcome of screens or fields. Integration tests
demonstrate that al- though the components were
The software used in Smart Image to Text & Text to Speech
individually satisfaction, as shown by successfully
Recognition using Machine Learning typically involves several
unit testing, the combination of components is
components that work together to enable this functionality following fig
correct and consistent. Integration testing is
4.1 shows the system architecture process:
specifically aimed at exposing the problems that
arise from the combination of component

Table 5.1: Testcase 1

Fig 4.1 :- System Architecture

© 2023, ISJEM (All Rights Reserved) | www.isjem.com | Page 3


INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND MANAGEMENT ISSN: 2583-6129
VOLUME: 02 ISSUE: 06 | JUNE – 2023 DOI: 10.55041/ISJEM00630 WWW.ISJEM.COM

AN INTERNATIONAL SCHOLARLY || MULTIDISCIPLINARY || OPEN ACCESS || INDEXING IN ALL MAJOR DATABASE & METADATA

Table 5.2: Testcase 2

Fig. 5.3: - First Interface Screen


Image to Text and Text to Voice: -
The speech conversion feature in Smart Image to Text & Text to Speech
The proposed method successfully detects the text regions in most of the Recognition using Machine Learning allows the user to convert the
images and is quite accurate in extracting the text from the detected recognized text into audible speech. Once the text has been detected from
regions. Based on the experimental analysis that performed in which it the uploaded image and translated into the desired language, the user can
will found out that the proposed method can accurately detect the text select the "Text-to-Speech" button from the output page. Upon clicking
regions from images which have different text sizes, styles and color. this button, the system will use a Text-to-Speech (TTS) engine to convert
Although our approach overcomes most of the challenges faced by other the recognized text into speech, which can be heard through the device's
algorithms, it still suffers to work on images where the text regions are speakers or headphone.
very small and if the text regions are blur. Below is the word-confidences
of the words that user retrieve after performing the optical character
recognition on the image which is tested in the experimental analysis
section of this paper. Word confidence is a metric indicating the
confidence of the recognition result. Confidence values ranges between
0 to 1 and should be interpreted as probabilities. As user can see from
the above table that the words having fewer number have characters have
a better word confidence than the words which comprises of a greater
number of characters. The average word-confidence comes out to be
0.8361
Fig. 5.4: - English Image to Text Interface Screen
Final Testing:
The TTS engine uses machine learning algorithms to generate natural-
The home page of Smart Image to Text & Text to Speech Recognition
sounding speech that closely resembles human speech patterns, with
using Machine Learning typically includes a graphical user interface
appropriate intonation, stress, and pauses. Users may have the option to
(GUI) built using the Tkinter module in Python. The GUI displays a
adjust the speed and volume of the speech, as well as choose from
welcome message or a title such as "Smart Image to Speech Using
different voices or accents available in the TTS engine. The speech
Machine Learning" along with the logo of the application. The GUI
conversion feature provides an accessible and convenient way for users
typically contains several buttons that allow the user to interact with the
to consume and understand the detected text, especially for those with
application. These buttons may include a "Login" button, a "Register"
visual impairments or those who prefer auditory learning
button, and an "Exit" button.

© 2023, ISJEM (All Rights Reserved) | www.isjem.com | Page 4


INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND MANAGEMENT ISSN: 2583-6129
VOLUME: 02 ISSUE: 06 | JUNE – 2023 DOI: 10.55041/ISJEM00630 WWW.ISJEM.COM

AN INTERNATIONAL SCHOLARLY || MULTIDISCIPLINARY || OPEN ACCESS || INDEXING IN ALL MAJOR DATABASE & METADATA

2. No More Typing

The last point that will leave you with is the fact that typing could be a
skill of the past if user continue down this road. Text to speech
converters is big, but what if speech to text takes over
It would be a thing of the past to type out your message or paper,
because user could just use our voice. It does make sense in some
regards, because it can probably speak much faster than typing in most
cases. But there are certain drawbacks that could hinder the expansion
of this idea.

Fig. 5.5: - Image to Text and Text to Speech Interface Screen


References:
Language Translation Testing: -
Smart Image to Text & Text to Speech Recognition using Machine [1].S. Zhang, M. Chen, and C. Liu, "A novel image-to-speech system using
machine learning," 2017 IEEE International Conference on Cybernetics and
Learning is a system that can recognize and convert text in images to
Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and
speech using machine learning algorithms. The system uses image Mechatronics (RAM), Ningbo, China, 2017, pp. 415-420, doi:
10.1109/ICCIS.2017.8274752.
processing techniques to identify and extract text from images. Then, it
applies machine learning algorithms to recognize the text and convert it [2].N. Pranesh, M. R. Manjunath, and K. R. Venugopal, "OCR system for
to speech using a text-to-speech (TTS) engine. Kannada script using machine learning," 2019 IEEE International Conference
on Automatic Control and Intelligent Systems (I2CACIS), Mysuru, India, 2019,
pp. 188-193, doi: 10.1109/I2CACIS.2019.8883538.

[3].R. Bhardwaj and K. D. Jain, "Text extraction from natural scene images using
machine learning," 2019 International Conference on Machine Learning, Big
Data, Cloud and Parallel Computing (COMITCon), Allahabad, India, 2019, pp.
56-61, doi: 10.1109/COMITCon.2019.8884491.

[4]https://towardsdatascience.com/image-to-text-recognition-using-deep-
learning-6b2e8d6b5f70

[5]https://www.analyticsvidhya.com/blog/2021/01/ocr-with-deep-learning-and-
Fig. 5.6: - English to Hindi Language Conversion Interface Screen opencv-for-image-to-text-conversion/

VI. FUTURE SCOPE [6]https://towardsdatascience.com/image-to-speech-using-ai-d7f629602f1e

In future, this work can be extended to detect the text from video or real [7]. R. Shrestha and M. Dahal, "Deep Learning: A Comprehensive Guide to
time analysis and can be automatically documented in Word Pad or any Neural Network Methods in Natural Language Processing, Image Recognition,
and Voice Recognition," Packt Publishing, 2019.
other editable format for further use.
1. Taking Over Education [8]. A. Géron, "Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems,"
Another wild reality is that it could possibly overthrow some forms of O'Reilly Media, 2019.

education. The technology is already being used to help people with


[9]. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification
disabilities and not able to read, but it could advance even more. with Deep Convolutional Neural Networks," Advances in Neural Information
Processing Systems 25 (NIPS 2012), 2012, pp. 1097-1105.
It could be a possibility that text to speech converters end up taking over
the education system because it would be cheaper for a school to pay a
converter than a full-time teacher employee! Even though that would
be a long way away before it would happen, it is still a crazy thought to
ponder.

© 2023, ISJEM (All Rights Reserved) | www.isjem.com | Page 5

You might also like