Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Visual Assist

The document discusses the development of a text-from-video system that can extract speeches from video recordings and convert them to text using speech recognition techniques. It aims to automate the transcription process and improve efficiency. The system will be evaluated based on accuracy by comparing generated text to manual transcriptions.

Uploaded by

keerthan9105
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Visual Assist

The document discusses the development of a text-from-video system that can extract speeches from video recordings and convert them to text using speech recognition techniques. It aims to automate the transcription process and improve efficiency. The system will be evaluated based on accuracy by comparing generated text to manual transcriptions.

Uploaded by

keerthan9105
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

Navigating through the world poses significant challenges for individuals with
visual impairments, often requiring reliance on external assistance or memorized
routes. Traditional navigation aids fall short in providing real-time, context-aware
guidance, limiting the independence and mobility of visually impaired individuals.
However, advancements in technology, particularly in the field of computer vision
and smartphone capabilities, offer promising solutions to address these challenges.

The Visual Assist navigation app is a groundbreaking initiative aimed at


revolutionizing navigation support for the visually impaired. By leveraging the
power of modern smartphones and sophisticated computer vision algorithms,
Visual Assist seeks to provide accurate, real-time navigation guidance tailored to
the unique needs of visually impaired users. This introduction sets the stage for
exploring the design, implementation, and potential impact of Visual Assist in
enhancing the independence and mobility of individuals with visual impairments.
Through a combination of innovative technology, user focused design, and
continuous improvement, Visual Assist aims to empower visually impaired
individuals to navigate their surroundings with confidence and autonomy.

1.2 PROBLEM BACKGROUND

Text-from-video systems can address this challenge by automatically extracting


speeches from video recordings and converting them into text documents. In this
project, This system propose a text-from-video system that can extract speeches
from a video Recording file and convert them into a text document using speech
recognition techniques. The system's core components are a video processing
module and a speech recognition module.

1
1.3 SCOPE OF THE PROJECT

To develop a text-from-video system that can extract speeches from a video


recording file and convert them into a text document using speech recognition
techniques. The project aims to automate the process of speech recognition and text
extraction, eliminating the need for manual transcription or captioning, making the
process more efficient and scalable. The project's scope includes evaluating the
effectiveness of the text-from-video system through a user study. The study will
involve comparing the text generated by our system with manually transcribed text
from the video recording. This system will measure the accuracy and completeness
of the text and gather feedback on the system's usability and effectiveness.

The project's scope also includes implementing pre-processing techniques, such as


noise reduction and normalization, to improve the accuracy of the speech
recognition. This system will also use language model adaptation techniques to
adapt the speech recognition system to the speaker's voice and speech patterns.

1.4 OBJECTIVE OF THE PROJECT

The video processing module is responsible for extracting the audio from the
video recording file, while the speech recognition module converts the extracted
audio into a text document. To adapt the speech recognition system to the speaker's
voice and speech patterns, This system will use language model adaptation
techniques. These techniques will enhance the system's ability to recognize the
speaker's words and phrases accurately.

The proposed text-from-video system has several potential applications, such as


creating subtitles for videos, making video content accessible to hearing-impaired
individuals, and providing a convenient way to extract speeches from recorded
lectures or meetings. The system can also be used in content-based video retrieval
systems to enable users to search for specific videos based on the spoken content.

2
1.5 ORGANIZATIONAL REPORT

The rest of this chapter is organized as follows: Chapter 2 deals with


Literature review. Chapter 3 gives the brief about system analysis. System design
as given in Chapter 4. Chapter 5 deals with system specification. Chapter 6 deals
with system description. Chapter 7 gives the system testing. Chapter 8
Experimental results are described in chapter 9. Finally, the conclusion of this
project is given in Chapter 10.

3
CHAPTER 2

LITERATURE SURVEY

TITLE: Generating Subtitles Automatically Using Audio Extraction and


Speech Recognition
Authors: Abhinav Mathur; Tanya Saxena; Rajalakshmi Krishnamurthi,2015

In present scenario, video plays a vital role to help people understand and
comprehend the information for example the songs, movies or the video lectures or
any other multimedia data relevant to the user. Hence, here it becomes important to
make videos available to the people having auditory problems and even more for
the people to remove the gaps of their native language. This can be best done by
the use of subtitles of the video. HoThis systemver, downloading subtitles of any
video from the internet is a monotonous process. Consequently, to generate
subtitles automatically through the software itself and without the use of internet is
a valid subject of research. Hence, this research paper resolves the above issue
through thr|ee distinct modules namely Audio Extraction which converts an input
file of any format supported by MPEG standards to .wav format. Here 24%
reduction rate has been achieved in the size of the song after the extraction. Then
Speech Recognition of the extracted .wav file is implemented and finally, Subtitle
Generation in which a .txt/.start file is generated which is synchronized with the
input file.

TITLE: JARVIS: An interpretation of AIML with integration of GTTS and


Python

Authors: RavivanshikumarSangpal; TanveeGawand; Sahil Vaykar; Neha


Madhavi,2019

This paper presents JARVIS, a virtual integrated voice assistant comprising of


GTTS,AIML [Artificial Intelligence Markup Language], and Python-based state-

4
of-the-art technology in personalized assistant development. JARVIS incorporates
the poThis systemr of AIML and with the industry-leading Google platform for
text-to-speech conversion and the voice of the Male Pitch in the GTTS libraries
inspired from the Marvel World. This is the result of the adoption of the dynamic
base Pythons pyttsx which considers intentionally in adjacent phases of GTTS and
AIML, facilitating the establishment of considerably smooth dialogues betThis
systemen the assistant and the users. This is a unique result of the exaggerated
contribution of several contributors such as the feasible use of AIML and its
dynamic fusion with platforms like Python[pyttsx] and GTTS[Google Text to
Speech] resulting into a consistent and modular structure of JARVIS exposing the
widespread reusability and negligible maintenance.

TITLE: A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-


Tuning Results for Child Speech Synthesis
Authors: Rishabh Jain; Mariam YahayahYiThis systemre; Dan Bigioi; Peter
Corcoran; HoriaCucu,2022

Speech synthesis has come a long way as current text-to-speech (TTS) models can
now generate natural human-sounding speech. HoThis systemver, most of the TTS
research focuses on using adult speech data and there has been very limited work
done on child speech synthesis. This study developed and validated a training
pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child
speech datasets. This approach adopts a multi-speaker TTS retuning workflow to
provide a transfer-learning pipeline. A publicly available child speech dataset was
cleaned to provide a smaller subset of approximately 19 hours, which formed the
basis of our fine-tuning experiments. Both subjective and objective evaluations
This systemre performed using a retrained MOSNet for objective evaluation and a
novel subjective framework for mean opinion score (MOS) evaluations. Subjective
evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice
naturalness, and 3.96 for voice consistency.

5
Objective evaluation using a retrained MOSNet shoThis systemd a strong
correlation betThis systemen real and synthetic child voices. Speaker similarity was
also verified by calculating the cosine similarity betThis systemen the embeddings
of utterances. An automatic speech recognition (ASR) model is also used to
provide a word error rate (THIS SYSTEMR) comparison betThis systemen the real
and synthetic child voices. The final trained TTS model was able to synthesize
child-like speech from reference audio samples as short as 5 seconds.

TITLE: Artificial Intelligence-based Voice Assistant

Authors: S Subhash; Prajwal N Srivatsa; S Siddesh; A Ullas; B Santhosh,2020

Voice control is a major growing feature that change the way people can live. The
voice assistant is commonly being used in smart phones and laptops. AI-based
Voice assistants are the operating systems that can recognize human voice and
respond via integrated voices. This voice assistant will gather the audio from the
microphone and then convert that into text, later it is sent through GTTS (Google
text to speech). GTTS engine will convert text into audio file in English language,
then that audio is played using play sound package of python programming
Language.

TITLE: Audio-based Near-Duplicate Video Retrieval with Audio Similarity


Learning
Authors: PavlosAvgoustinakis; GiorgosKordopatis-Zilos; Symeon
Papadopoulos; Andreas L. Symeonidis ; IoannisKompatsiaris,2021

In this work, This system address the problem of audio-based near-duplicate video
retrieval. This system propose the Audio Similarity Learning (AuSiL) approach
that effectively captures temporal patterns of audio similarity betThis systemen
video pairs. For the robust similarity calculation betThis systemen two videos, This
system first extract representative audio-based video descriptors by leveraging
transfer learning based on a Convolution Neural Network (CNN) trained on a large

6
scale dataset of audio events, and then This system calculate the similarity matrix
derived from the Pair wise similarity of these descriptors. The similarity matrix is
subsequently fed to a CNN network that captures the temporal structures existing
within its content. This system train our network following a triplet generation
process and optimizing the triplet loss function. To evaluate the effectiveness of the
proposed approach, This system have manually annotated two publicly available
video datasets based on the audio duplicity betThis systemen their videos. The
proposed approach achieves very competitive results compared to three state-of-
the-art methods. Also, unlike the competing methods, it is very robust to the
retrieval of audio duplicates generated with speed transformations.

2.1 INFERENCE FROM THE LITERATURE

AI poThis systemred navigation is very useful, for blind people that help them
in identify object, known and unknown faces. The device can provide audio
descriptions, such as the presence of a tree and a park bench, recognize and
announce the pre-programmed faces of friends and family and provides both color
and object recognition.

7
CHAPTER 3

SYSTEM ANALYSIS

3.1 EXISTING SYSTEM

A transcriber is someone who makes a handwritten or hand typed copy of either


live or recorded spoken content. In short, they convert speech to text. Just without
the automated transcription and speedy turn around. Manual transcription services
use professional human transcribers to create a written record of the audio or video.
Transcribers can also be used for text-to-text services, converting, for example,
PDFs into word documents. While human transcribers are generally considered to
be more accurate when it comes to specialized terminology, they can't compete
with the simplicity, cost or security of automated transcription. For every minute of
audio, it can take up to four minutes for a human to transcribe and when you're
paying up to $5 per minute, every second counts.

The existing methods for extracting text from video recordings involve manual
transcription or captioning, which is a time-consuming and expensive process. In
manual transcription, a human transcriber listens to the audio portion of the video
recording and types out the speech into a text document. Captioning, on the other
hand, involves adding subtitles to the video recording manually. While this method
can provide a convenient way for hearing-impaired individuals to access video
content, it requires a significant amount of time and effort to produce captions.

Disadvantages:

 Very time consuming


 Waste of paper material
 High cost
 Difficult to organize

8
3.2 PROPOSED SYSTEM

In our proposed system, This system aim to automate the process of speech
recognition and text extraction from video recordings using artificial intelligence
techniques. The system will use the Movie Py library to extract the audio portion of
the video recording file and pass it through Google's speech recognition library for
text conversion. This system will use pre-processing techniques, such as noise
reduction and normalization, to improve the accuracy of the speech recognition.
This system will also use language model adaptation techniques to adapt the speech
recognition system to the speaker's voice and speech patterns.

The proposed system has several advantages over the existing methods. Firstly,
it eliminates the need for manual transcription or captioning, making the process
more efficient and scalable. Secondly, it provides a convenient way for hearing-
impaired individuals to access video content by automatically generating subtitles.
Thirdly, it can be used to extract speeches from recorded lectures or meetings,
providing a convenient way to review and analyze the content of these recordings.
Finally, it can be used in content-based video retrieval systems to enable users to
search for specific videos based on the spoken content. GTTS (Google Text-to-
Speech), a Python library and CLI tool to interface with Google Translate text-to-
speech API. Writes spoken mp3 data to a file, a file-like object (byte string) for
further audio manipulation, or std out. It features flexible pre-processing and
tokenizing.

9
Fig.3.1 Proposed Diagram

MoviePy is a Python module for video editing, which can be used for basic
operations on videos and GIF’s. Video is formed by the frames, combination of
frames creates a video each frame is an individual image

Fig 3.2 Audio Extraction Process

Overall, our proposed system can significantly improve the accessibility and
convenience of video content for everyone, regardless of their hearing ability or
preference for reading over watching videos.

10
CHAPTER 4

SYSTEM SPECIFICATION

4.1 HARDWARE SPECIFICATION

 Processors: Intel® Core™ i5 processor 4300M at 2.60 GHz or 2.59


GHz (1 socket, 2 cores, 2 threads per core), 8 GB of DRAM
 Disk space: 320 GB
 Operating systems: Windows® 10, MACOS*, and Linux*

4.2 SOFTWARE SPECIFICATION

 Server Side : Python 3.7.4(64-bit) or (32-bit)


 Client Side : HTML, CSS, Bootstrap
 IDE : Flask 1.1.1
 Back end : MySQL 5.
 Server : Wamp Server 2i
 OS : Windows 10 64 –bit or Ubuntu 18.

4.3 SOFTWARE DESCRIPTION

4.3.1 Python 3.7.4


Python is a general-purpose interpreted, interactive, object-oriented, and
high-level programming language. It was created by Guido van Re sum during
1985- 1990. Like Perl, Python source code is also available under the GNU
General Public License (GPL). This tutorial gives enough understanding on Python
programming language.
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has feThis systemr

11
syntactical constructions than other languages.

Python is a MUST for students and working professionals to become a great


Software Engineer Specially when they are working in This systemb Development
Domain. I will list down some of the key advantages of learning Python:
 Python is Interpreted − Python is processed at runtime by the interpreter.
You do not need to compile your program before executing it. This is similar
to PERL and PHP.
 Python is Interactive − You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
 Python is Object-Oriented − Python supports Object-Oriented style or
technique of programming that encapsulates code within objects.
 Python is a Beginner's Language − Python is a great language for the
beginner-level programmers and supports the development of a wide range
of applications from simple text processing to WWW browsers to games.
 The Python Package Index (PyPI) hosts thousands of third-party modules for
Python. Both Python's standard library and the community-contributed
modules allow for endless possibilities.
The most basic use case for Python is as a scripting and automation language.
Python isn’t just a replacement for shell scripts or batch files; it is also used to
automate interactions with This systemb browsers or application GUIs or to do
system provisioning and configuration in tools such as an sable and Salt. But
scripting and automation represent only the tip of the iceberg with Python.
General application programming with Python
You can create both command-line and cross-platform GUI applications with
Python and deploy them as self-contained executables. Python doesn’t have the
native ability to generate a standalone binary from a script, but third-party packages
like cx Freeze and Py Installer can be used to accomplish that.

12
Data science and machine learning with Python
Sophisticated data analysis has become one of fastest-moving areas of IT and one
of Python’s star use cases. The vast majority of the libraries used for data science
or machine learning have Python interfaces, making the language the most popular
high-level command interface to for machine learning libraries and other numerical
algorithms.
This systemb services and REST Full APIs in Python
Python’s native libraries and third-party This systemb frameworks provide fast and
convenient ways to create everything from simple REST APIs in a few lines of
code to full-blown, data-driven sites. Python’s latest versions have strong support
for asynchronous operations, letting sites handle tens of thousands of requests per
second with the right libraries.
Meta programming and code generation in Python
In Python, everything in the language is an object, including Python modules and
libraries themselves. This lets Python work as a highly efficient code generator,
making it possible to write applications that manipulate their own functions and
have the kind of extensibility that would be difficult or impossible to pull off in
other languages.
Python can also be used to drive code-generation systems, such as LLVM, to
efficiently create code in other languages.
“Glue code” in Python
Python is often described as a “glue language,” meaning it can let disparate code
(typically libraries with C language interfaces) interoperate. Its use in data science
and machine learning is in this vein, but that’s just one incarnation of the general
idea. If you have applications or program domains that you would like to hitch up,
but cannot talk to each other directly, you can use Python to connect them.

13
Python 2 vs. Python 3
Python is available in two versions, which are different enough to trip up many
new users. Python 2.x, the older “legacy” branch, will continue to be supported
(that is, receive official updates) through 2020, and it might persist unofficially
after that. Python 3.x, the current and future incarnation of the language, has many
useful and important features not found in Python 2.x, such as new syntax features
(e.g., the “walrus operator”), better concurrency controls, and a more efficient
interpreter.
Python 3 adoption was sloThis systemd for the longest time by the relative lack of
third-party library support. Many Python libraries supported only Python 2, making
it difficult to switch.
But over the last couple of years, the number of libraries supporting only Python 2
has dwindled; all of the most popular libraries are now compatible with both
Python 2 and Python 3. Today, Python 3 is the best choice for new projects; there
is no reason to pick Python 2 unless you have no choice.
Python’s libraries
The success of Python rests on a rich ecosystem of first- and third-party software.
Python benefits from both a strong standard library and a generous assortment of
easily obtained and readily used libraries from third-party developers. Python has
been enriched by decades of expansion and contribution.
Python’s standard library provides modules for common programming tasks
—math, string handling, file and directory access, networking, asynchronous
operations, threading, multiprocessors management, and so on. But it also includes
modules that manage common, high-level programming tasks needed by modern
applications: reading and writing structured file formats like JSON and XML,
manipulating compressed files, working with internet protocols and data formats
(This systemb pages, URLs, email). Most any external code that exposes a C-

14
compatible foreign function interface can be accessed with Python’s types module.

The default Python distribution also provides a rudimentary, but useful, cross-
platform GUI library via Tkinter, and an embedded copy of the SQLite 3 database.
The thousands of third-party libraries, available through the Python Package
Index (PyPI), constitute the strongest showcase for Python’s popularity and
versatility.
For example:
The Beautiful Soup library provides an all-in-one toolbox for scraping HTML—
even tricky, broken HTML—and extracting data from it.
Requests makes working with HTTP requests at scale painless and simple.
Frameworks like Flask and Django allow rapid development of This systemb
services that encompass both simple and advanced use cases.
Like C#, Java, and Go, Python has garbage-collected memory management,
meaning the programmer doesn’t have to implement code to track and release
objects. Normally, garbage collection happens automatically in the background, but
if that poses a performance problem, you can trigger it manually or disable it
entirely, or declare whole regions of objects exempt from garbage collection as a
performance enhancement.
An important aspect of Python is its dynamism. Everything in the language,
including functions and modules themselves, are handled as objects. This comes at
the expense of speed (more on that later), but makes it far easier to write high-level
code.
Developers can perform complex object manipulations with only a few
instructions, and even treat parts of an application as abstractions that can be
altered if needed.
Python’s use of significant whitespace has been cited as both one of
Python’s best and worst attributes. The indentation on the second line below isn’t
just for readability; it is part of Python’s syntax.

15
Python interpreters will reject programs that don’t use proper indentation to
indicate control flow.
with open(‘myfile.txt’) as my_file:
file lines = [x. strips(‘\n’) for x in my_file]
Syntactical white space might cause noses to wrinkle, and some people do
reject Python for this reason. But strict indentation rules are far less obtrusive in
practice than they might seem in theory, even with the most minimal of code
editors, and the result is code that is cleaner and more readable.
Another potential turnoff, especially for those coming from languages like C
or Java, is how Python handles variable typing. By default, Python uses dynamic or
“duck” typing—great for quick coding, but potentially problematic in large code
bases. That said, Python has recently added support for optional compile-time type
hinting, so projects that might benefit from static typing can use it.
4.3.2 MySQL
What is MySQL? – An Introduction to Database Management Systems
Database Management is the most important part when you have humungous data
around you. MySQL is one of the most famous Relational Database to store &
handle your data. In this What is MySQL blog, you will be going through the
following topics:
 What are Data & Database?
 Database Management System & Types of DBMS
 Structured Query Language (SQL)
 MySQL & its features
 MySQL Data Types
What are Data & Database?
Suppose a company needs to store the names of hundreds of employees working in
the company in such a way that all the employees can be individually identified.
Then, the company collects the data of all those employees. Now, when I say data,

16
I mean that the company collects distinct pieces of information about an object. So,
that object could be a real-world entity such as people, or any object such as a
mouse, laptop etc.
Database Management System & Types of DBMS
A Database Management System (DBMS) is a software application that interacts
with the user, applications and the database itself to capture and analyze data. The
data stored in the database can be modified, retrieved and deleted, and can be of
any type like strings, numbers, images etc.
Types of DBMS
There are mainly 4 types of DBMS, which are Hierarchical, Relational, Network,
and Object-Oriented DBMS.
 Hierarchical DBMS: As the name suggests, this type of DBMS has a
style of predecessor-successor type of relationship. So, it has a structure
similar to that of a tree, wherein the nodes represent records and the
branches of the tree represent fields.
 Relational DBMS (RDBMS): This type of DBMS, uses a structure that
allows the users to identify and access data in relation to another piece of
data in the database.
 Network DBMS: This type of DBMS supports many to many relations
wherein multiple member records can be linked.
 Object-oriented DBMS: This type of DBMS uses
small individual software called objects. Each object contains a piece of
data, and the instructions for the actions to be done with the data.
Structured Query Language (SQL)
SQL is the core of a relational database which is used for accessing and managing
the database. By using SQL, you can add, update or delete rows of data, retrieve
subsets of information, modify databases and perform many actions. The different
subsets of SQL are as follows:
 DDL (Data Definition Language) – It allows you to perform various
17
operations on the database such as CREATE, ALTER and DELETE
objects.
 DML (Data Manipulation Language) – It allows you to access and
manipulate data. It helps you to insert, update, delete and retrieve data
from the database.
 DCL (Data Control Language) – It allows you to control access to the
database. Example – Grant or Revoke access permissions.
 TCL (Transaction Control Language) – It allows you to deal with the
transaction of the database. Example – Commit, Rollback, save point, Set
Transaction.
4.3.3 Using MySQL
There’s not a lot of point to being able to change HTML output dynamically unless
you also have a means to track the changes that users make as they use your This
systembsite. In the early days of the This systemb, many sites used “flat” text files
to store data such as usernames and passwords. But this approach could cause
problems if the file wasn’t correctly locked against corruption from multiple
simultaneous accesses. Also, a flat file can get only so big before it becomes
unwieldy to manage—not to mention the difficulty of trying to merge files and
perform complex searches in any kind of reasonable time. That’s where relational
databases with structured querying become essential. And MySQL, being free to
use and installed on vast numbers of Internet This systemb servers, rises superbly
to the occasion.
The highest level of MySQL structure is a database, within which you can have one
or more tables that contain your data. For example, let’s suppose you are working
on a table called users, within which you have created columns for surname, first
name, and email, and you now wish to add another user. One command that you
might use to do this is: INSERT INTO users VALUES ('Smith', 'John',
'jsmith@mysite.com'); Of course, as mentioned earlier, you will have issued other
commands to create the database and table and to set up all the correct fields, but
18
the INSERT command here shows how simple it can be to add new data to a
database. The INSERT command is an example of SQL (which stands for
Structured Query Language), a language designed in the early 1970s and
reminiscent of one of the oldest programming languages, COBOL. It is This
systemll suited, hoThis systemver, to database queries, which is why it is still in
use after all this time. It’s equally easy to look up data. Let’s assume that you have
an email address for a user and you need to look up that person’s name. To do this,
you could issue a MySQL query such as: SELECT surname, first name FROM
users WHERE email='jsmith@mysite.com'; MySQL will then return Smith, John
and any other pairs of names that may be associated with that email address in the
database. As you’d expect, there’s quite a bit more that you can do with MySQL
than just simple INSERT and SELECT commands. For example, you can join
multiple tables according to various criteria, ask for results in a variety of different
orders, make partial matches when you know only part of the string that you are
searching for, return only the nth result, and a lot more
The Apache This systemb Server
In addition to PHP, MySQL, JavaScript, and CSS, there’s actually a fifth hero in
the dynamic This systemb: the This systemb server. In the case of this book, that
means the Apache This systemb server. This system’ve discussed a little of what a
This systemb server does during the HTTP server/client exchange, but it actually
does much more behind the scenes.
But these objects don’t have to be static files, such as GIF images. They can all be
generated by programs such as PHP scripts. That’s right: PHP can even create
images and other files for you, either on the fly or in advance to serve up later. To
do this, you normally have modules either precompiled into Apache or PHP or
called up at runtime. One such module is the GD library (short for Graphics Draw),
which PHP uses to create and handle graphics.
Apache also supports a huge range of modules of its own. In addition to the
PHP module, the most important for your purposes as a This systemb programmer

19
are the modules that handle security. Other examples are the Rewrite module,
which enables the This systemb server to handle a varying range of URL types and
rewrite them to its own internal requirements, and the Proxy module, which you
can use to serve up often-requested pages from a cache to ease the load on the
server.
Later in the book, you’ll see how to actually use some of these modules to enhance
the features provided by the core technologies This system cover. About Open
Source Whether or not being open source is the reason these technologies are so
popular has often been debated, but PHP, MySQL, and Apache are the three most
commonly used tools in their categories. What can be said, though, is that being
open source means that they have been developed in the community by teams of
programmers writing the features they themselves want and need, with the original
code available for all to see and change.
What Is a WAMP, MAMP, or LAMP?
WAMP, MAMP, and LAMP are abbreviations for “Windows, Apache, MySQL,
and PHP,” “Mac, Apache, MySQL, and PHP,” and “Linux, Apache, MySQL, and
PHP,” 13 www.it-ebooks.info respectively. These abbreviations describe a fully
functioning setup used for developing dynamic Internet This systemb pages.
WAMPs, MAMPs, and LAMPs come in the form of a package that binds the
bundled programs together so that you don’t have to install and set them up
separately. This means you can simply download and install a single program and
follow a few easy prompts to get your This systemb development server up and
running in the quickest time with the minimum hassle. During installation, several
default settings are created for you. The security configurations of such an
installation will not be as tight as on a production This systemb server, because it is
optimized for local use. For these reasons, you should never install such a setup as
a production server. HoThis systemver, for developing and testing This
systembsites and applications, one of these installations should be entirely
sufficient.

20
Using an IDE
As good as dedicated program editors can be for your programming productivity,
their utility pales into insignificance when compared to Integrated Developing
Environments (IDEs), which offer many additional features such as in-editor
debugging and program testing, as This systemll as function descriptions and much
more.

This systemb Framework


This systemb Application Framework or simply This systemb Framework
represents a collection of libraries and Modules that enables a This systemb
application developer to write applications without having to bother about low-
level details such as protocols, thread manage etc.
Flask
Flask is a This systemb framework. This means flask provides you with tools,
libraries and technologies that allow you to build a This systemb application. This
This systemb application can be some This systemb pages, a blog, a wiki or go as
big as a This systemb-based calendar application This systembsite. Flask is often
referred to as a micro framework. It aims to keep the core of an application simple
yet extensible. Flask does not have built-in abstraction layer for database handling,
nor does it have form a validation support. Instead, Flask supports the extensions to
add such functionality to the application. Let’s take a closer look into Flask, so-
called “micro” framework for Python.
 WSGI
This systemb Server Gateway Interface (WSGI) has been adopted as a
standard for Python This systemb application development. WSGI is a
specification for a universal interface betThis systemen the This systemb server and
the This systemb applications.
 This systemrkzeug
This enables building a This systemb framework on top of it. The Flask

21
framework uses This systemrkzeug as one of its bases.Plus, Flask gives you so
much more CONTROL on the development stage of your project. It follows the
principles of minimalism and lets you decide how you will build your application.
 Flask has a lightThis systemight and modular design, so it easy to
transform it to the This systemb framework you need with a few
extensions without This systemighing it down
 Flask documentation is comprehensive, full of examples and This
systemll structured. You can even try out some sample application to
really get a feel of Flask.

22
CHAPTER 5
SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE
The video to speech to text system architecture consists of several components

working together to extract and convert speech from a video recording into a text
document.
Fig 5.1 SYSTEM ARCHITECTURE
The first component is the video processing module, which is responsible for
extracting the audio from the video recording file. This module typically uses a
library such as MoviePy or OpenCV to extract the audio from the video file. The
second component is the speech recognition module, which converts the extracted
audio into text. This module typically uses a speech recognition library such as
Google's Speech-to-Text API or IBM's Watson Speech-to-Text API to perform the
speech recognition. The third component is the pre-processing module, which
prepares the audio for speech recognition by applying techniques such as noise
reduction, normalization, and segmentation. These techniques help to improve the

23
accuracy of the speech recognition and ensure that the text generated is of high
quality.
Finally, the output of the speech recognition module is a text document, which can
be further processed by natural language processing (NLP) techniques to extract
key phrases, identify entities, and perform other text analysis tasks.
Overall, the video to speech to text system architecture is designed to automate the
process of speech recognition and text extraction from video recordings. By
integrating these components into a single system, the architecture enables the
efficient and scalable conversion of speech to text, making video content more
accessible and convenient for a wide range of users
5.2 DATA FLOW DIAGRAM
A two-dimensional diagram explains how data is processed and transferred in a
system. The graphical depiction identifies each source of data and how it interacts
without her data source store each a common output. Individuals seeking to draft a
dataflow diagram must identify external inputs and outputs, determine how the
inputs and outputs
Relate to each other, and explain with graphics how these connections relate
and what they result.
Data flow diagrams can be divided into logical and physical. The logical data
flow diagram describes flow of data through a system to perform certain
functionality of a business. The physical data flow diagram describes the
implementation of the logical data flow. In this type of diagram helps business
development and design teams visualize how data is processed and identify or
improve certain aspects.

24
DATA FLOW SYMBOLS:

Symbol Description

An entity. A source of data or a


destination for data.

A process or task that is


performed by the system.

A data store, a place where data


is held betThis systemen
processes.

A data flows.

Fig 5.2 DFD Symbol

25
Fig 5.2 Data Flow Diagram

5.3 ER-DIAGRAM

5.3.1 INTRODUCTION:

An entity–relationship model (ER model) describes inter-related thing finite


rest in a specific domain of knowledge. An ER model is composed of entity types
(which classify the things of interest) and specifies relationships that can exist
betThis systemen instances of those entity types. In software engineering an ER
model is commonly formed to represent things that a business needs to remember
in order to perform business processes. Consequently, the ER model becomes an
abstract data model that defines a data or information structure that can be
implemented in a database.

5.3.2 CONCEPTUAL DATA MODEL:

This is the highest level ER model in that it contains the least granular detail
but establishes the overall scope of what is to be included within the model set.
The conceptual ER model normally defines master referenced at a entities that are
commonly used by the organization. Developing an enterprise-wide conceptual
ER model issue full to support document the data architecture for an organization.

26
5.3.3 LOGICAL DATA MODEL:
A logical ER model does not require a conceptual ER model, especially if
the scope of the logical ER model includes only the development to of a
distinction formation system. The logical ER model contains more detail than the
conceptual ER model.

5.3.2 PHYSICAL DATA MODEL:

One or more physical ER models may be developed from each logical ER


model. The physical ER model is normally developed to be instantiated as a
database. Therefore, each physical ER model must contain enough detail to
produce a database and each physical ER model is technology dependent since
each data base management system is some What different.

5.4 UML DIAGRAMS

5.4.1 USE CASE DIAGRAM


A use case diagram at its simplest is a representation of a user's interaction
with the system that shows the relationship betThis systemen the user and the
different use cases in which
The user is involved. A use case diagram can identify the different types of users of
a system and the different use cases and will often be accompanied by other types
of diagrams as This systemll.

27
Fig 5.4.1 Use Case Diagram

28
CHAPTER 6
SYSTEM IMPLEMENTATION
6.1 VIDEO PROCESSING
MoviePy is a Python module for video editing, which can be used for basic
operations on videos and GIF’s. Video is formed by the frames, combination of
frames creates a video each frame is an individual image. An audio file format is a
file format for storing digital audio data on a computer system. The bit layout of
the audio data is called the audio coding format and can be uncompressed, or
compressed to reduce the file size, often using lossy compression. This system can
load the audio file with the help of Audio File Clip method
6.2 SPEECH RECOGNITION
There are several APIs available to convert text to speech in Python. One of
such APIs is the Google Text to Speech API commonly known as the GTTS API.
GTTS is a very easy to use tool which converts the text entered, into audio which
can be saved as a mp3 file. The GTTS API supports several languages including
English, Hindi, Tamil, French, German and many more. The speech can be
delivered in any one of the two available audio speeds, fast or slow. HoThis
systemver, as of the latest update, it is not possible to change the voice of the
generated audio.
The technology works by analyzing written text and converting it into an audio
file that can be played back through a speaker or headphones. The process involves
several steps, including natural language processing, linguistic analysis, and audio
Synthesis. First, the text is analyzed using natural language processing algorithms
to identify the words and their meaning. The software then applies linguistic
analysis to the text, determining the pronunciation of each word and the appropriate
intonation, stress, and rhythm for the sentence.
Finally, the software uses audio synthesis techniques to create a digital audio
file that replicates the human voice, which can then be played back through a

29
speaker or headphones.
The Google Text-to-Speech (GTTS) library in Python is a Python wrapper for the
Google Text-to-Speech API. It allows developers to easily convert written text into
natural-sounding audio speech in a variety of languages and voices using the
poThis systemr of Google’s neural network. With the GTTS library, developers can
simply import the package, pass in the text to be spoken and the language, and then
save the audio output to a file or play it directly. The library also supports
customizable settings such as the speech rate and volume, as This systemll as the
option to save the audio output in various audio formats.
To use the GTTS library, developers will need to have an internet connection as the
library requires access to the Google Text-to-Speech API to convert the text to
speech. The library is compatible with Python 2 and 3, and can be installed using

pip, the Python package manager.

Fig 6.2 Speech to Text Process


Overall, the Google Text-to-Speech library in Python provides developers with a
simple and convenient way to incorporate text-to-speech functionality into their
applications, enabling them to enhance the user experience and accessibility of
their software.
6.3 PRE PROCESSING
The pre-processing module in the video to speech to text system is an important
component that prepares the audio extracted from the video for speech recognition.
The pre-processing module applies various techniques to improve the quality and
30
accuracy of the speech recognition output.
One of the most important techniques used in the pre-processing module is noise
reduction. Noise can often be present in video recordings due to environmental
factors, microphone limitations, or other sources. This noise can interfere with the
accuracy of the speech recognition system, making it difficult to accurately
transcribe the speech. The pre-processing module uses algorithms to reduce the
amount of noise in the audio signal, which improves the accuracy of the speech
recognition system. Another important technique used in the pre-processing
module is normalization.
Normalization involves adjusting the audio signal's volume levels to ensure that
the speech recognition system can accurately capture the speech. This technique is
particularly important when dealing with audio recordings with varying levels of
volume.
Segmentation is another technique used in the pre-processing module.
Segmentation involves dividing the audio signal into smaller segments or frames,
making it easier for the speech recognition system to analyze the speech accurately.
Overall, the pre-processing module plays a crucial role in the video to speech to
text system architecture, ensuring that the speech recognition system can accurately
transcribe the speech from the video recording. The combination of noise
reduction, normalization, and segmentation techniques significantly improves the
accuracy and quality of the system's output, making it a reliable and efficient tool
for converting video speech to text.

6.4 VIDEO SPEECH TO TEXT SYSTEM


The final component in the video to speech to text system architecture is the
post-processing module, which is responsible for processing the text output
generated by the speech recognition module. The post-processing module typically
involves the application of natural language processing (NLP) techniques to
analyze the text and extract useful information.

31
One of the primary tasks of the post-processing module is to perform text
normalization, which involves standardizing the text to ensure consistency in
spelling, grammar, and punctuation. This step is particularly important when
dealing with large amounts of text generated by the system, as it helps to improve
the readability and accuracy of the text output.
Another important task of the post-processing module is named entity
recognition (NER), which involves identifying and categorizing named entities in
the text, such as people, organizations, and locations. This task is useful in
applications such as automatic subtitling or video indexing, where the identification
of important keywords and phrases can be used to improve the searchability and
accessibility of the video content.
In addition to text normalization and NER, the post-processing module may
also perform other NLP tasks, such as sentiment analysis or topic modeling, to
extract further insights from the text generated by the system. These techniques can
be particularly useful in applications such as market research or social media
analysis, where the extraction of meaningful information from large amounts of
text data is essential.
Overall, the post-processing module plays a crucial role in the video to speech
to text system architecture, enabling the extraction of useful information from the
text generated by the system.

32
CHAPTER 7
SYSTEM TESTING
Application testing is a software testing technique exclusively adopted to test the
applications that are hosted on the This systemb in which the application
interfaces and other functionalities are tested.

With all the impending issues, This systemb app testing holds more importance
than ever. HoThis systemver, testing a This systemb application is not just an
ordinary task, and depends on several factors such as compatibility across various
browsers, application performance, user experience, user acceptance, ensuring
proper security, etc.

The enterprises must deploy skilled testers to assess all aspects of the This
systembsite across platforms, browsers, and devices. The testers must always
implement This systemb application testing best practices in order to produce
accurate and reliable test results without increasing testing times.

The most common types of testing involved in the development process are:

 Functionality Test
 Usability Test
 Interface Test
 Compatibility Test
 Performance Test
7.1 FUNCTIONALITY TESTING
This step ensures that the functionalities of an application are properly
functioning or not. Functional testing takes place in the source code.
Functionality testing includes:
 Determining the data input and entry

33
 Test case execution
 Functions need to be properly identified because the software runs
effectively through the integration of functions
 Actual results must be analyzed.
 Verify there is no dead page or invalid redirects.
 First check all the validations on each field.
 Wrong inputs to perform negative testing.
 Verify the workflow of the system.
 Verify the data integrity.

7.2 USABILITY TESTING


This testing type focuses on how user experiences while using a Particular
Application. Efforts are put in to ensure that the application is built in-line with
user needs. This testing method makes it a point to see that a user is able to easily
navigate through the various functions of an application. The content that is
displayed in the This systemb application must also be clearly visible.
 Test the navigation and controls.
 Content checking.
 Check for user intuition.

7.3 INTERFACE TESTING


This testing type focuses on how user experiences while using a particular
This systemb application. Efforts are put in to ensure that the application is
built in-line with user needs. This testing method makes it a point to see that a
user is able to easily navigate through the various functions of an application.
The content that is displayed in the This systemb application must also be
Visible.
 Verify that communication betThis systemen the systems are done

34
correctly
 Verify if all linked documents be supported/opened on all
platforms
 Verify the security requirements or encryption while
communication happens betThis systemen systems

7.4 COMPATIBILITY TESTING


This testing methodology ensures that a particular This systemb application is
compatible with all browsers. Compatibility testing takes place at three levels
which are browser compatibility, operating system compatibility and device
compatibility.

 Operating system Compatibility Testing - Linux, Mac OS, Windows

 Database Compatibility Testing - Oracle SQL Server

 Browser Compatibility Testing - IE, Chrome, Firefox

 Other System Software - This systemb server, networking/ messaging


tool, etc.

7.5 PERFORMANCE TESTING


A specific application is tested in terms of how better it can perform under
stress conditions and heavy load. How the application is able to perform under
different internet speeds, networks and browsers are also worked upon. which is
Also, how an application performs when it is running through various hardware
configurations and what are the mechanisms that need to be strategized in order to
prevent the application from crashing. All the above-mentioned aspects are
thoroughly scrutinized and tested under the ambit of performance testing.

7.6 SECURITY TESTING

35
The final and most important step of testing an application is Security testing.
When a application is being built, there is a lot of data that is being used and stored.
Some of this data can be sensitive and needs to be protected at any cost, failure of
which can cause a lot of technical issues for an application to function properly. In
order to fully secure these types of mission-critical data, security testing is
implemented.

36
CHAPTER 8
EXPERIMENTAL RESULT

14

12

10

0
Existing System Proposed System

Fig 8.1 Experimental Result

Based on the research that has been done, it can be concluded that to be able to do
extract audio from raw video using Movie py library , the first stage is the process
of audio extracting. core components are a video processing module and a speech
recognition module. The video processing module will be responsible for
extracting the audio from the video recording file, while the speech recognition
module will convert the extracted audio into a text document. The system will use
the MoviePy library to extract the audio portion of the video recording file and
Google's speech recognition library for text conversion

37
CHAPTER 9
CONCLUSION AND FUTURE ENHANCEMENT

9.1 CONCLUSION
In conclusion, the text-from-video system This system have proposed can
significantly improve the accessibility and convenience of video content for
everyone, regardless of their hearing ability or preference for reading over
watching videos. The system can automate the process of speech recognition and
text extraction from video recordings, eliminating the need for manual transcription
or captioning, making the process more efficient and scalable. The system's core
components are a video processing module and a speech recognition module that
uses the Movie Py library to extract the audio portion of the video recording file
and Google's speech recognition library for text conversion. This system have also
implemented pre-processing techniques such as noise reduction and normalization
to improve the accuracy of the speech recognition. other live broadcasts. This could
significantly improve the accessibility and inclusivity of such events, as it would
enable individuals who are deaf or hard of hearing to follow along with the spoken
content.

In addition, the system could be further improved by incorporating natural


language processing (NLP) techniques to analyze the text generated by the system.
NLP techniques could be used to identify important keywords and phrases in the
text and provide a summary of the content. This could be particularly useful in
situations where users need to quickly understand the key points of a video, such as
in news broadcasts or educational lectures.

38
Overall, the text-from-video system This system have proposed has the
potential to significantly improve the accessibility and convenience of video
content for a wide range of users.

The system's efficiency and scalability make it suitable for a variety of


applications, such as creating subtitles for videos or providing a convenient way to
extract speeches from recorded lectures or meetings. With further development and
enhancement, this system could become a valuable tool for improving the
accessibility of video content and making it more inclusive for all users.

9.2 FUTURE ENHANCEMENT

There are several potential future enhancements for this system. Firstly, This
system could explore different speech recognition libraries and compare their
performance to Google's speech recognition library. This could potentially improve
the accuracy of the system and make it more robust. Secondly, This system could
explore different pre-processing techniques and language model adaptation
techniques to improve the system's accuracy further. Thirdly, This system could
develop a more advanced user interface that allows users to interact with the
system and edit the generated text. This could improve the usability of the system
and make it more accessible to a broader range of users.

39
CHAPTER 10
APPENDICES
10.1 SOURCE CODE

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<style>

@import url('https://fonts.googleapis.com/css?family=Montserrat:400,800');

*{
box-sizing: border-box;
}

body {
background: #f6f5f7;
display: flex;
justify-content: center;
align-items: center;
flex-direction: column;
font-family: 'Montserrat', sans-serif;
height: 100vh;
margin: -20px 0 50px;
}
40
h1 {
font-This systemight: bold;
margin: 0;
}

h2 {
text-align: center;
}

p{
font-size: 14px;
font-This systemight: 100;
line-height: 20px;
letter-spacing: 0.5px;
margin: 20px 0 30px;
}

span {
font-size: 12px;
}
a{
color: #333;
font-size: 14px;
text-decoration: none;
margin: 15px 0;
}
.button {
border-radius: 20px;

41
border: 1px solid #FF4B2B;
background-color: #FF4B2B;
color: #FFFFFF;
font-size: 12px;
font-This systemight: bold;
padding: 12px 45px;
letter-spacing: 1px;
text-transform: uppercase;
transition: transform 80ms ease-in;
}

.button:active {
transform: scale(0.95);
}

.button:focus {
outline: none;
}

.button.ghost {
background-color: transparent;
border-color: #FFFFFF;
}

form {
background-color: #FFFFFF;
display: flex;
align-items: center;

42
justify-content: center;
flex-direction: column;
padding: 0 50px;
height: 100%;
text-align: center;
}
input {
background-color: #eee;
border: none;
padding: 12px 15px;
margin: 8px 0;
width: 100%;
}
.container {
background-color: #fff;
border-radius: 10px;
box-shadow: 0 14px 28px rgba(0,0,0,0.25),
0 10px 10pxrgba(0,0,0,0.22);
position: relative;
overflow: hidden;
width: 768px;
max-width: 100%;
min-height: 480px;
}

.form-container {
position: absolute;
top: 0;
height: 100%;

43
transition: all 0.6s ease-in-out;
}

.sign-in-container {
left: 0;
width: 50%;
z-index: 2;
}

.container.right-panel-active .sign-in-container {
transform: translateX(100%);
}

.sign-up-container {
left: 0;
width: 50%;
opacity: 0;
z-index: 1;
}

.container.right-panel-active .sign-up-container {
transform: translateX(100%);
opacity: 1;
z-index: 5;
animation: show 0.6s;
}

@keyframes show {

44
0%, 49.99% {
opacity: 0;
z-index: 1;
}

50%, 100% {
opacity: 1;
z-index: 5;
}
}

.overlay-container {
position: absolute;
top: 0;
left: 50%;
width: 50%;
height: 100%;
overflow: hidden;
transition: transform 0.6s ease-in-out;
z-index: 100;
}

.container.right-panel-active .overlay-container{
transform: translateX(-100%);
}

.overlay {
background: #FF416C;
background: -This systembkit-linear-gradient(to right, #FF4B2B,

45
#FF416C);
background: linear-gradient(to right, #FF4B2B, #FF416C);
background-repeat: no-repeat;
background-size: cover;
background-position: 0 0;
color: #FFFFFF;
position: relative;
left: -100%;
height: 100%;
width: 200%;
transform: translateX(0);
transition: transform 0.6s ease-in-out;
}
.container.right-panel-active .overlay {
transform: translateX(50%);
}

.overlay-panel {
position: absolute;
display: flex;
align-items: center;
justify-content: center;
flex-direction: column;
padding: 0 40px;
text-align: center;
top: 0;
height: 100%;
width: 50%;

46
transform: translateX(0);
transition: transform 0.6s ease-in-out;
}

.overlay-left {
transform: translateX(-20%);
}

.container.right-panel-active .overlay-left {
transform: translateX(0);
}

.overlay-right {
right: 0;
transform: translateX(0);
}

.container.right-panel-active .overlay-right {
transform: translateX(20%);
}

.social-container {
margin: 20px 0;
}

.social-container a {
border: 1px solid #DDDDDD;
border-radius: 50%;
display: inline-flex;

47
justify-content: center;
align-items: center;
margin: 0 5px;
height: 40px;
width: 40px;
}

footer {
background-color: #222;
color: #fff;
font-size: 14px;
bottom: 0;
position: fixed;
left: 0;
right: 0;
text-align: center;
z-index: 999;
}

footer p {
margin: 10px 0;
}

footer i {
color: red;
}

footer a {
color: #3c97bf;

48
text-decoration: none;
}
</style>
</head>
<body>
<h2>KALAIARASAN & TEAM PROJECT</h2>
<div class="container" id="container">
<div class="form-container sign-up-container">
<form action="#">
<h1>Create Account</h1>
<div class="social-container">
<a href="#" class="social"><i class="fab fa-facebook-
f"></i></a>
<a href="#" class="social"><i class="fab fa-google-
plus-g"></i></a>
<a href="#" class="social"><i class="fab fa-linkedin-
in"></i></a>
</div>
<span>or use your email for registration</span>
<input type="text" placeholder="Name" />
<input type="email" placeholder="Email" />
<input type="password" placeholder="Password" />
<button>Sign Up</button>
</form>
</div>
<div class="form-container sign-in-container">
<form action="#">
<h1>Sign in</h1>
<input type="email" placeholder="Email" />

49
<input type="password" placeholder="Password" />
<a href="#">Forgot your password?</a>
<a class="button" href="/upload">Login</a>
</form>
</div>
<div class="overlay-container">
<div class="overlay">
<div class="overlay-panel overlay-left">
<h1>This systemlcome Back!</h1>
<p>To keep connected with us please login with your
personal info</p>
<button class="ghost" id="signIn">Sign In</button>
</div>
<div class="overlay-panel overlay-right">
<h1>Hello!</h1>
<p>Click Login</p>

<!-- <a><button
class="ghost" id="signUp">Login</button></a>-->
</div>
</div>
</div>
</div>
</body>
</html>

50
10.2 SCRENSHOTS

10.2.1 SIGN IN PAGE

10.2.2 FILE UPLOAD PAGE

51
10.2.3 TEXT GENERATOR MESSAGE

52
REFERENCES

[1]. M. Xu, S. Yan, T-S. Chua, R. Hong, M. Wang. Dynamic captioning: video
accessibility enhancement for hearing impairment. In Proc of the ACM
Multimedia Conference, pp. 421–430, New York, NY, USA, 2010
[2]. L. Jelinek, D. Jackson. Television literacy: comprehension of program
content using closed captions for the deaf. Journal of Deaf Stud. Deaf Educ.,
Vol. 6, N. 1, pp. 43–53, 2001
[3]. T. Garza. Evaluating the use of captioned video materials in advanced
foreign language learning. Foreign Language Annals, Vol. 24, N. 3, pp. 239-
258, May 1991
[4]. G. Penn, E. Toms, D. James, C. Munteanu, R. Baecker. The effect of
speech recognition accuracy rates on the usefulness and usability of This
systembcast archives
[5]. M. Wald. Crowdsourcing correction of speech recognition captioning
errors. In Proc of the W4A Conference, New York, NY, USA, 2011
[6]. A. Knight, K.C. Almeroth. Fast caption alignment for automatic indexing of
audio. International Journal of Multimedia Data Engineering and
Management, Vol. 1, N. 2, pp. 1–17. June 2010
[7]. The quality imperative. Global Monitoring Report, 2005
[8]. S. Tsuboi, N. Shimogori, T. Ikeda. Automatically generated captions: will
they help non-native speakers communicate in English
[9]. Maria Labied; AbdessamadBelangour; MouadBanane; AllaeErraissiAn
overview of Automatic Speech Recognition Preprocessing Techniques 2022.

53

You might also like