Abstract: Digital Healthcare has become the most prominent and trending platform for treatment now a days. One such
initiative is to build a doctor-friendly digital system. This system will allow doctors to store their patient details, consultations,
surgeries performed and many more related information about each and every patient unlike the traditional methods. To
build a prototype showcasing digital medical transcription platform which will help surgeons and physicians to document
their patients consultations and summary of surgeries performed by recording with a click of a button. Some open source
network technologies like uniMRCP, open source EPBX scalable FreeSwitch, standard protocols of Voice Over IP, (i.e. signalling
- SIP and audio media - RTP), Speech Recognition engines supporting uniMRCP as Google Speech Recognition or CMU's
PocketSphinx are used. The main idea behind is to transform voice recording to a text document to be presented as a part of
Electronic Medical Record system using Speech Recognition and Synthesis technologies.
Key Words: Cloud, Freeswitch, uniMRCP, Google Speech Recognition plugin, PocketSphinx.
analysis has been introduced in medical and healthcare server architecture. Abdullah Mohammad Ansari et al.
organizations. The main idea was to link events to type [5] implemented Interactive Voice Response System
of hazards with efficient engineer centric solutions, for (IVRS) model for Session Initiation Protocol (SIP) based
safety during adverse situations. Healthcare hazards phones. Backbone of this application is scalable
means losing sensitive patient records if there is less FreeSwitch connects to SIP-based soft phones either
awareness within healthcare. This may influence patient desktop or mobile client as a FreeSwitch server. The SIP
safety corresponding to erroneous output of Medical registers themselves as client to FreeSwitch servers,
Information System (MIS) like the Electronic Health which in turn has information of all registered clients,
Record. Richard W. Jones et al. [2] highlights prevalence and other connected FreeSwitch servers. The idea of
of indirect hazards and regulatory standard measures accessing information in web browser of phone while on
implemented during deployment of addressed problem. call could save time and make this a more reliable
This research not only focuses on identification and approach.
removal of the risk but also indirect ones after
deployment. The problem is addressed in three Process of making a computer system understand what
ways.(1)modified MIL-STD-882E addresses currently we speak is nothing but computer speech recognition or
existing deficiencies when user makes executable interpretation of voice in the form of text. There are
decisions. It defines risks associated with erroneous many such recognition of speech software for
Information System (IS) system failure modes, Software appropriate speech to text conversions. Aditya Amberkar
Control Categories (SCC) hazard severity table to get et al. [6] proposes a Speech to Text using Speech
Software Criticality Index (SwCI), whose outputs are Recognition and Recurrent Neural Networks (RNN)
given to Level of Rigor (LOR). (2)Health applications based speech recognition model for prediction. Initially
(mHealth App) risk assessment. (3)Generic 8-Step IS the speech, which is an analog signal is digitized or
safety management Process adapted to applications. It is sampled by Nyquist theorem and pre-processing of the
very important to have a hazard-free transcribing signal to 20-millisecond chunks is done. This pre-
system; hence, there is always a priority for patient data processed data is fed to RNN. The application of RNN
safety. increases performance accuracy in much speech to text
conversion engines like Java, Python based snowboy hot
Voice over Internet Protocol (VoIP) and Electronic word detection, C, CMU pocket-sphinx. Amazon's Alexa
Private Branch Exchange (EPBX) are cost effective and Google's STT are online speech to text engines
methods unlike the traditional. Asterisk is an Open whereas CMU pocket-sphinx is offline conversion engine
Source is a Linux based server and Private Branch but training the dataset is done online. Although training
Exchange framework that allows a user to have a phone- the RNN algorithm is complex, it results as best
system of one's choice because of it flexibility to algorithm for speech processing and voice controlled
customize modules. Mohammed Abdul Qadeer et al. [3] technologies.
implements an Asterisk server within a local Wi-Fi
network and Public Switch Telephony Network (PSTN) Worldwide, commercial applications are having high
for registered devices within University usage. The demand for Automatic Speech recognition (ASR) but in
application architecture model involved Asterisk server, India, it is still evolving. Chadalavada Sai Manasa et al. [7]
a Client and PSTN Exchange for placing a voice and video developed acoustical model for the speech recognition in
based call over a private Wi-Fi cloud. Hindi using CMU’s PocketSphinx with a database of 177
words and dictionary of cross language adapted for
A distributed application Real-time online interactive speech recognition such as English.ASR model is based
application (ROIA)is a Cloud environment emerging in on Gaussian Mixture Hidden Markov model(GMM-HMM)
large scale. Additionally have issues like scalability and based acoustic modelling using LPC and Mel Frequency
network latency. Previous researchers tried and focused cepstral coefficients (MFCC) for feature extraction.
on mixed deployment of ROIA and extension of ROIA. PocketSphinx is lightweight, free and real-time
LIU Dong, et al. [4] proposes a system in which a new continuous medium vocabulary Speech Recognition
technology MRCP is deployed that overcomes network system developed for hand held devices.
latency issues, scalability of ROIA in cloud computing.
The solution focuses on MRCP architecture and external Freeswitch is a highly scalable engine for routing,
balance strategy to overcome fluctuations of concurrent interconnecting communication protocols for any type of
users and network latency requirements. The MRCP media namely audio, video or texts and is a cross
architecture has ROIA Servers (RS) and one MRCP Local platform telephone exchange that bridges business
Controller (MLC) for each data centre distributed across solution gaps. It uses embedded languages like Lua or
the world. MLC and RS responsibility is load balancing, JavaScript that makes it more flexible. Wei Tang, et al. [8]
storage, zoning and instancing. introduces a soft switch solution i.e. FreeSwitch for
efficient communication dispatching and accuracy in
FreeSwitch acts as a PBX (Private Branch Exchange) information using IMS architecture and SG-UAP based
server, open source scalable soft-switch. It follows client- application.
Sila Chunwijitra, et al. [9], propose a cloud based the conversation with patient ID and then starts
framework for speech recognition in Thai language. They speaking as he or she normally does over the regular
also deploy Docker (lightweight Linux container) phone call. The FreeSWITCH IVR system records the
platform to migrate baseline Distributed speech speech of the doctor which is essentially the document
recognition (DSR) system. The main idea here is to that is supposed to be typed in a traditional way. The
improve response time in real time using cloud recorded speech or audio is then processed by the server
computing. Furthermore, the workflow is modified by application that communicates over SIP and MRCP
paralleling running multiple Speech Recognition (SR) signaling protocols. The voice is transmitted between
Engines with help of utterance decoding. Then on Word FreeSWITCH and MRCP server via RTP packets. The
Error Rate(WER)is computed and results seem to be MRCP server uses either Google Speech Recognition
scalable and reliable with no significant difference (GSR) cloud based API or PockSphinx module for
between proposed and baseline approaches. Hence transcribing the doctors speech to a text. The transcribed
overall performance is boosted with cloud computing text is returned by the GSR or PocketSphinx is stored in
benefits and improved response time in terms of real- the respective patient record database for presentation
time factor (RTF). as a part of patient Electronic Medical Record flows
during clinical visits or reviews conducted by the doctor
Resource Sharing is the benefit of using cloud-based web or surgeons in subsequent follow-ups. The transcribed
services. Sila Chunwijitra, et al. [10] focuses on documentation could be made available and viewable as
distributing and sharing resources for Automatic Speech a plain text at any time once the dictated document has
Recognition(ASR) applications. In case of Transcription, been transcribed in near real-time basis. The proposed
ASR needs more resources as many utterances must be system is as shown in Fig-1. Steps involved in the back-
handled in real time computing. For this key solution is end
scaling ASR by multithread processing, exploiting
multiplexing and demultiplexing technique to network 1. Call IVR-Internal extension (eg:1000)
socket or distributing ASR in real-time streaming or 2. Announce patient ID
distributing engines (load balance). This proposed work 3. Start audio recording or voice mail option
reduces RTF by 15% of the improved framework when 4. Store audio file(as patient_id.wav)
compared to the baseline system architecture and shares 5. notify/send command to Uni-MRCP Server
lesser resources like working memory. 6. Initiate Speech to Text / enable speech-to-text
API on Google Cloud Platform
"Google Cloud Speech API" is a Speech-to-Text and Text- 7. Store Text file
to-speech converting Google service, whose speech
recognition accuracy is high due to its deep learning
neural network algorithms. The algorithms do not
require high performance processors because everything
is processed in cloud. Gustavo Boza-Quispe, et al. [11]
proposes an user friendly speech interface to access
tourist semantic information based on Google Cloud
Platform. The flow has stages like Text-to-Speech(TTS)
and Speech-to-Text(STT) Converter, Web Interface,
SPARQL Generator and Semantic Representation. Open Source
Scalable PBX
Due to increased adoption of smart phones and other
consumer devices speech has become one of the modes
of interaction. Yanzhang He, et al. [12] focuses beyond
acoustic (AM), pronunciation (PM), and language (LM)
models) satisfying computational and memory
constraints improved in earlier large vocabulary MRCP 2.0 SPEECH GOOGLE
continuous speech recognition (LVCSR)systems of ASR. RECOGNOTION CLOUD
Their model throws 20% improved WER over a UNI-MRCP
