0% found this document useful (0 votes)

36 views

Automation of Answer Script Evaluation

The goal of this study, "Automation of Answer Scripts Evaluation," is to create an end-to-end automated process that can quickly and fairly evaluate answer scripts and grade students. Optical Character Recognition (OCR), Artificial Intelligence (AI), Machine Learning (ML), Natural Language Processing (NLP) are brought together to build a workflow for automating this tedious, time taking, subjective activity. The paper discusses failures and successes of various models applied in our endeavour.

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Automation of Answer Script Evaluation

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

Automation of Answer Script Evaluation

Ganesh Prasad Tamminedi1; Sri Abhirama Maganti2; Tarush Chandra3
G.I.T.A.M. University- Hyderabad 502329, Rudraram, Telangana, India

Abstract:-The goal of this study, "Automation of Answer  What are we Expecting the Proposed Process to
Scripts Evaluation," is to create an end-to-end automated Accomplish? The Process Needs to
process that can quickly and fairly evaluate answer
scripts and grade students. Optical Character  Reduce the time and effort required for grading.
Recognition (OCR), Artificial Intelligence (AI), Machine  Apply predefined criteria consistently, minimizing the
Learning (ML), Natural Language Processing (NLP) are potential for subjective grading biases.
brought together to build a workflow for automating this  Handle a large volume of answer scripts for specific
tedious, time taking, subjective activity. The paper subjects
discusses failures and successes of various models applied  Utilize predefined algorithms and models to evaluate
in our endeavour. answers objectively.

Keywords:- OCR Model, Bert Model, NLP, GPT Model,  Issues Identified in Our Exploration
Optimization, Cosine Similarity, Vectorization, Rubric Model,
Evaluating Model, Datasets, Ensemble, Majority Voting,  OCR model struggles with accurately deciphering unclear
Gradient Descent. or unconventional handwriting.
 Struggle with subjective questions that require nuanced
I. INTRODUCTION understanding and context.
 Handling diverse ways in which students express their
In the world of education, the persistent struggle to
answers.
understand messy handwriting and meet strict grading
 Evaluating open-ended answers those requiring creative
deadlines remains a challenge. Traditional grading systems
thinking.
are currently facing problems like giving subjective grades,
 Models like Similarity, BERT and GPT, face limitations
dealing with different student answers, and struggling with
due to a lack of real-world datasets.
many evaluations. These issues, involving personal opinions
and difficulties in handling a lot of papers, highlight the need  To accommodate various subjects and exams without
for new ways to make grading simpler and better. Essentially, compromising accuracy.
we need to find creative solutions using AI, NLP, ML.
II. LITERATURE REVIEW
The first hurdle is ability to translate handwritten
answers into computer analysable text. OCR models are Previous models for automating the evaluation of
explored for this purpose. Current OCR models [Ref] have answer scripts, such as those by Ravikumar et al., del Gobbo
given limited success in our experiments. Once the et al., and Rahman and Siddiqui, have shown notable
manageable text is available, interpreting the answer and limitations in terms of time complexity, correctness in
evaluating has shown even more difficulties. Similarity, assessing student responses, and handling the variability of
BERT and GPT models of AI have been tried to find the subjective content in educational contexts. While efforts have
feasibility of automated evaluation. been made to create frameworks and apply machine learning
and NLP-based techniques, these models struggled to
replicate the nuanced judgment of human graders,
particularly in processing diverse handwriting styles and
contextual understanding. Consequently, this paper aims to
address these shortcomings by developing a new model from
scratch that focuses on enhancing the speed, accuracy, and
adaptability of automated grading systems, building upon the
insights and limitations highlighted in prior research.

The following sections discuss methodologies tried,

results and analysis.

IJISRT24OCT205 www.ijisrt.com 27
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

III. METHODOLOGY

Data sets have been created by us from the real-world examinations.

 Challenges Encountered During Real Time Exam Starting with handwritten to text that can be analysed
Answer Sheet to CSV File Conversion: with computers, the following OCR models have been tried.

 Insufficient data. A. OCR Model

 OCR model scanning accuracy is less to fully identify the
text on answer sheets.  OCR Model 1
 Diagrams cannot be converted to CSV. We made two of our own OCR models - one from
 Illegible handwriting from students. examples on GitHub and the other following instructions
 All the best OCR models are pay to use, huge number of from YouTube. Unfortunately, both the models only worked
resources are consumed to train a new model. well with the specific data we provided.

IJISRT24OCT205 www.ijisrt.com 28
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

First was the model was made on the instructions given MaxPooling2D, Dense, Flatten, Dropout with metrics such as
by PythonLessons on YouTube confusion matrix, classification report. Optimizers such as
[https://www.youtube.com/watch?v=WhRC31SlXzA&t=5s] Adam, SGD were also used. This model also was giving
. It was trained on words from IAM datasets (sample images output that doesn’t match with the expected output for an
in fig- in section 3) which consisted of words, sentences and input outside of the dataset it is trained on. This model was
forms. The model used TensorFlow with different layers suspended due to the significant demand of resources and
consisting of CNN and LSTM with loss functions such as started experimenting with free and trail versions of paid
CTC. The model despite showing high accuracy for the input OCR models.
from the dataset it is trained on, gives output which doesn’t
match with the expected output for an input outside of the  Open-Source OCR Models
dataset. So, the model was discontinued. We experimented with OCR models like Tesseract open-
source OCR model[https://tesseract-
 OCR Model 2 ocr.github.io/tessdoc/Installation.html], google
The model which was built on idea taken from GitHub lens[https://lens.google/] which are readily available.
was trained on EMNIST dataset which consists of alphabets However, these models were limited to recognizing
using sequential model and layers such as Conv2D, computer-generated text and struggled with handwritten text.

 Tesseract Input_1.png (Computer Generated Text): -

 Tesseract Output_1.txt (Output for Computer Generated Text): -

 Tesseract Input_2.png (Handwritten Text): -

IJISRT24OCT205 www.ijisrt.com 29
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Tesseract Output_2.txt (Output for Handwritten Text): -

 Google Lens Input_1.png:

 Googlee Lens Output_1.txt( Text Reading from Image):-

 Models Offered by Entrepreneurial Companies Studio-Azure. While these models performed better,
For a more advanced approach, we explored paid OCR especially with handwritten text, their consistency varied
models such as Google cloud vision, Nanonets and Vision across different styles of handwriting.

IJISRT24OCT205 www.ijisrt.com 30
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Google cloud vision input_1.jpg:

 Google cloud vision output_1(reading text from handwritten document):

Using the output of Vision Studio azure’s OCR model, we tried the interpretation and analysis with Similarity models, BERT
models, GPT models. Discussion on these models follows below.

IJISRT24OCT205 www.ijisrt.com 31
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Vision Studio Azure Input:

 Vision Studio Azure Output (Text Reading from Images):

As you can see it is accurate, we decided to use outputs of reduced the need for traditional water pitchers. In conclusion,
this OCR for developing the NLP models. as modern technology continues to evolve, traditional
methods that are time-consuming will struggle to keep up
B. Similarity Model with the latest trends. Thus, preserving these methods
document1 = "Technology is rapidly advancing, and this becomes futile and a waste of time."
progress is undoubtedly affecting traditional values. I believe
that in today's technological era, traditional values are likely query_doc = "Technology is flourishing by leaps and
to fade away. To begin with, there are several reasons why bounds and undoubtedly this advancement is taking a toll on
these age-old values are diminishing in the modern world. the traditional values. I do believe that in this technological
Firstly, in our fast-paced society, mobile phones have become world traditional values are bound to disappear. To initiate
everyone's companion for staying connected with family and with, there are many reasons why these conventional values
friends. In contrast, in the past, people used to send letters and have no existence in this modern world. First, in this fast-
wait in long lines for a single telephone call using STD and paced world, everyone is assisted with the mobile phone to
ISD services. The evolution of communication methods stay connected with their family and friends. However, in
highlights that traditional practices hold little value today. olden times people used to send letters and stand in the long
Secondly, technology has revolutionized the realm of fashion. queues on S.T.D and I.S.D just for a maturity of one call. This
People used to engage in manual activities like knitting, advancement in the modes of communication has proved that
stitching, and designing, but now machines have simplified traditional skills are worth for nothing. Secondly, technology
every task. As an illustration, fashion design students now has transformed the world of fashion. Earlier people used to
swiftly assess color compatibility using advanced software do knitting, stitching, and designing manually but now
instead of manually portraying themselves as models. machines have made every task easier and comfortable. To
Furthermore, the advent of refrigerators has drastically substantiate my view, many of fashion designing students

IJISRT24OCT205 www.ijisrt.com 32
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

were seen portraying pictures of model themselves for  Similarity Checker:

checking the compatibility of colors but at present, this work
is done in seconds on multifarious advanced software. Apart  Model 1.1: Word2Vec Basic Implementation
from this, hardly anyone is seen purchasing pitcher for cool
water because of the invention of refrigerators. To conclude,  Tokenization: Splits documents into words.
an evolution of modern technology is an ongoing process, so,  Word2Vec Embeddings: Converts words into vectors
the time-consuming traditional methods will not be able to capturing meaning.
maintain their pace with these latest trends. hence, it is useless  Cosine Similarity: Measures similarity based on vector
and wastage to time to preserve them." angles. This is used to compare the documents

Document1 and query_doc are the inputs given to every

model below.

 Output:

 Model 1.2: Preprocessing Enhancement

 Stop Word Removal: Eliminates common words.

 Lemmatization: Reduces words to base form.
 Enhanced Preprocessing: Combines stop word removal and lemmatization.

 Output:

 Model 1.3: Stemming Integration

 Stemming: Reduces words to root forms.

 Stem-Based Preprocessing: Integrates stemming with tokenization and stop word removal.

IJISRT24OCT205 www.ijisrt.com 33
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Output:

 Model 1.4: Dual Similarity Metrics

 Jaccard Similarity: Measures token overlap.

 Dual Metrics: Provides Jaccard and cosine similarity scores.

 Output:

 Model 1.5: TF-IDF Vectorization

 TF-IDF: Highlights term importance. Hence, I gave multiple best answers to find the best matching model answer for all those
answers
 Cosine Similarity with TF-IDF: Considers weighted term importance.
 The best matching model answer is used to improve accuracy again

 Output:

 Model 1.6: Simple Text Similarity

 Stem-Based Preprocessing: Simplifies token variations based on stem.

 Cosine Similarity: Measures similarity based on document vectors.

IJISRT24OCT205 www.ijisrt.com 34
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Output:

 Model 1.7: Detailed Approach with Word2Vec

 Preprocessing: Converts text to lowercase, tokenizes, removes stop words, and stems.
 Word Embedding Generation: Uses Word2Vec for semantic understanding.
 Document Vectorization: Computes average vector for document representation.
 Similarity Computation: Calculates cosine similarity between document vectors.

 Output:

 Model 1.8: Jaccard Similarity with N-Grams

 Preprocessing: Converts text to lowercase, tokenizes, and removes stop words.

 N-gram Generation: Captures sequential patterns for similarity assessment.
 Similarity Computation: Calculates Jaccard similarity based on n-gram overlap.

 Output:

 Model 1.9: Word2Vec with Cosine Similarity

 Output:

 Analysis: with will significantly affect the similarity score, as you can
After extensive exploration, we concluded that none of see in Model 1.8. The complexity of evaluating subjective
the similarity models were suitable for evaluating answer content in answer scripts posed challenges beyond the
scripts. Despite our efforts in incorporating various NLP capabilities of our researched similarity models.
methods such as stemming, lemmatization, Word2Vec, N-
Gram, different types of vectorization, different types of Our exploration led us to BERT models, a discussion
checking similarities, the results were not feasible for our follows here.
intended application, Eg: a student writing Hi instead of Hello
the meaning same but hello is not in our corpus document

IJISRT24OCT205 www.ijisrt.com 35
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

C. BERT Model testing if BERT could understand written words in images as

it does with plain text.
 BERT Model-2.0: BERT Exploration
In the BERT Exploration phase, we began by  BERT Model-2.1: BERT Ensemble Approach
understanding how the BERT model, which reads text in both To enhance performance, we tried combining multiple
directions (forward and backward), could be applied to BERT models into an ensemble. This involved merging their
recognize text within our OCR system. We were essentially predictions to create a more robust and accurate OCR system.

 Input:

 Output:

 BERT Model-2.2: Text Processing and Feature text. It's like sharpening the tools before starting the work—
Engineering cleaning the data and picking out the key parts of the text that
For the Text Processing and Feature Engineering phase, would help the model learn better and make more accurate
we fine-tuned the way our model reads and understands the predictions.

 Input:

IJISRT24OCT205 www.ijisrt.com 36
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Output:

 BERT Model-2.3: Data Augmentation and Ensemble to the original). This expanded our model's exposure, making
Facing limitations with the amount and variety of text it akin to reading more books to learn more about the world,
our model had seen, we used data augmentation, specifically thus improving its ability to understand and process text.
back translation (translating text to another language and back

 Input:

 Output:

 BERT Model-2.4: BERT Model Optimization effectively from the OCR tasks we gave it. We wanted to
During the BERT Model Optimization phase, we ensure the model was running at its best to handle the
adjusted the model's settings like tuning hyper parameters for complex task of reading handwriting from our images.
better performance so that it could learn faster and more

 Input:

IJISRT24OCT205 www.ijisrt.com 37
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Output:

 BERT Model-2.5: Finalizing BERT Model Features We combined text processing, feature engineering, and
After trying out various adjustments, we picked the best methods to handle lots of data at once, and used the ensemble
features that helped our BERT model recognize text most approach where all the 'expert' models we created worked
accurately—like choosing the best ingredients for a recipe. together to give their best prediction.

 Input: -

 Output:

IJISRT24OCT205 www.ijisrt.com 38
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 BERT Model-2.6: Testing with Custom Dataset ultimate test to see if all our fine-tuning paid off, and if our
Finally, we put our BERT model to the test with a model could indeed understand and process the variety of
custom dataset—essentially a tailored exam, including handwriting styles and texts that students might use in their
content that we expected the model to recognize. This was the answers.

 Input:

 Output:

Table 1: Analysis: Real-World Data Limitations

Model Improvement Over Previous Model Advantages Disadvantages
BERT Model-2.1 Introduced ensemble approach, combining Enhanced performance Increase complexity in model
multiple BERT models for enhanced through ensemble integration
performance. learning.
BERT Model-2.2 Implemented text processing and feature Improved accuracy Require additional
engineering to optimize input for BERT, through optimized input preprocessing steps.
aiming for improved accuracy. preprocessing.
BERT Model-2.3 Explored data augmentation techniques like Enhanced dataset May introduce noise or bias
back translation and ensemble methods to diversity and with augmentation
diversify dataset and enhance performance. performance through techniques.
augmentation.
BERT Model-2.4 Fine-tuned BERT model parameters and Optimized performance Time and resource intensive
configuration for optimized performance. through parameter tuning process.
tuning.
BERT Model-2.5 Integrated features from text processing, Comprehensive approach Increased complexity and
feature engineering, data augmentation, time for improved OCR resource requirements.
complexity optimization, and ensemble for a model robustness.
more robust OCR model.
BERT Model-2.6 Tested the final BERT model with a custom Real-world validation of Requires access to diverse
dataset to evaluate its effectiveness in model effectiveness. and representative datasets
recognizing and processing varied content.

IJISRT24OCT205 www.ijisrt.com 39
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

GPT Model-3.1 Tailored the GPT model specifically for Enhanced performance Limited applicability to other
single-answer scripts, aiming to enhance its in handling single answer types of data.
performance in handling this type of data. scripts.
GPT Model-3.2 Implemented the GPT model using a custom Improved adaptation to Dependency on availability
dataset to improve its adaptation to specific specific research needs. and quality of custom dataset.
research needs.

Despite our efforts, the lack of a sufficiently diverse understand and generate human-like text, making it
real-world dataset for training hindered the success of the suitable for complex language tasks.
BERT model. Obtaining more varied and representative data  Special Features: The model is built on a transformer
could be crucial for future improvements in OCR model architecture that prioritizes context and coherence, which
performance. allows it to perform a wide range of text-based tasks
effectively.
Further on, GPT models have been explored and the
outcomes are discussed below.  GPT Model 3.1

D. GPT Model  Customization: Tailored specifically for handling single-

The GPT (Generative Pre-trained Transformer) models answer scripts
are a series of language processing AI designed by OpenAI  Purpose and Application: The focus was to refine the
that use deep learning to produce human-like text. Here's how GPT model to improve its performance specifically in
each version was utilized: scenarios where precise, single-answer responses were
needed. This adaptation makes it ideal for applications
 GPT Model 3.0 such as automated FAQs, where direct answers are
preferred over elaborate text.
 Engine Used: text-davinci-003  Special Features: Enhancements in this model version
 Purpose and Application: This version of the GPT include better handling of specific query types and
model was employed to leverage its advanced language improved accuracy in short-answer predictions. Input and
processing capabilities to enhance our research outputs. Output:
The text-davinci-003 engine is known for its ability to

 GPT Model 3.2  Special Features: The ability to train on specific data
enhances the model’s relevance to the user’s needs,
 Customization: Implemented using a custom dataset potentially improving both the quality and applicability of
 Purpose and Application: This iteration of the GPT its outputs in tailored scenarios.
model was customized with a particular dataset tailored to
the specific needs of our research. Using a custom dataset
allows the model to better understand and generate text
that is more aligned with the thematic elements of the
research.

IJISRT24OCT205 www.ijisrt.com 40
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Input:

IJISRT24OCT205 www.ijisrt.com 41
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Output:

 Analysis IV. DATASETS

After extensive experimentation, we achieved a 60%
accuracy in validating our custom dataset using the GPT For our project focused on the "Automation of Answer
model, as compared to evaluations done by human experts. Scripts Evaluation" using OCR (Optical Character
This indicates that while the models showed promise, there's Recognition) technology, we have specifically chosen
room for improvement in their performance, especially when handwritten text images from college mid-term papers as our
dealing with unique datasets and tasks. primary dataset. Here’s a comprehensive explanation of how
these datasets were collected and prepared, their granularity,
and the annotations process:

A. Data Collection Methodology

 Source: The datasets were sourced from an array of

college mid-term examination papers available at our
institution. These papers typically contain answers written
by students in a handwritten format, making them ideal
for our OCR model training.
 Selection Criteria: Papers with diverse handwriting styles
were selected to ensure the model's robustness in
recognizing different handwriting patterns.

IJISRT24OCT205 www.ijisrt.com 42
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

B. Granularity of Data E. Improving Data and Annotation Quality

Revising Data Collection: To enhance the OCR model's
 Granularity Level: The data was processed at the word performance, we are continually looking to diversify the
level. Each word from the handwritten texts was treated handwriting samples in our dataset. This involves including
as a single unit for OCR processing. more varied handwriting styles and ensuring that even subtle
 Input Size: The OCR model was trained on individual nuances in text are accurately annotated.
words extracted from sentences. This approach allows the
OCR technology to focus on recognizing each word Enhanced Annotation Guidelines: Annotations are now
independently, enhancing the accuracy of text rigorously checked for consistency and legibility. A
recognition. standardized font guideline was introduced for annotators to
follow, which helps in minimizing errors during data entry
C. Annotations and Data Preprocessing and improves the model's learning accuracy.

 Annotation Process: Each word in the handwritten scripts For the BERT and the GPT model we have taken the
was manually annotated to match its corresponding text in dataset as our college mid term papers as the input so that they
typed form. This step is crucial as it serves as the ground can help the model to train on the student point of view and
truth for training the OCR model. also understand how the teachers are correcting and based on
 Annotation Quality Control: Specific attention was paid what criteria are the marks allocated so that the model can be
to ensure high-quality annotations. Any samples with trained similarly to replicate the teachers corrections on the
illegible handwriting or unclear annotations were omitted student answer script when it is fed to the model as input for
to maintain the integrity of the training data. the grading part.

D. Challenges with Font and Annotation F. BERT Model and GPT Model Datasets Gathering Process
Font Issues: Initially, some of the handwritten samples For the BERT and GPT models, we chose our college's
used fonts or styles that were not conducive to accurate OCR mid-term exam papers as the training material. Think of these
recognition (e.g., cursive or highly stylized handwriting). models like students learning to grade papers just like
This led to complications in training the OCR model teachers do. The mid-term papers are full of varied answers
effectively. from students these are the 'lessons' for our models. They
study how teachers check these answers and what reasons
Annotation Standards: It was observed that inconsistent they give for the marks they award. This way, our models are
annotations could potentially skew the model’s learning learning to grade by understanding the 'teacher's way' of
process. To counter this, we set strict guidelines for how scoring. The goal is for them to get so good that they can look
annotations should be formatted, focusing on clarity and at new answers they've never seen before and grade them just
uniformity in the text. like a teacher would, using the same logic and attention to
detail that a real teacher applies when marking a student's
work.
 Handwritten Text Image(Words): -

IJISRT24OCT205 www.ijisrt.com 43
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

 Mid-Term Papers from our Teachers:

IJISRT24OCT205 www.ijisrt.com 44
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

V. CONCLUSION

Table 2: The comparison of OCR, Similarity, BERT and GPT Models

Aspect Similarity Models BERT Models GPT Models
Primary Purpose Assessing text similarity. Recognizing contextual Generating human-like text
information in text. responses.
Approach Iterative improvements on text Ensemble and optimization Customization for single-answer
comparison. of language models. scripts.
Challenges Difficulty in handling subjective Real-world data limitations, Adapting to specific research needs.
content. diverse datasets.
Performance Varied success with refining Hindered by data Achieved 60% accuracy with room
techniques. limitations. for improvement.
Key Findings Limitations in handling Read-world data limitations Potential for improvement with
subjective content. impact success. unique datasets.
Use Cases Basic benchmarking, struggles Adaption to specific Customization for single-answer
with uniqueness. datasets. scripts.
Conclusion Not suitable for evaluating Hindered by real-world data Promising but room for
answer scripts. limitations. improvement.

The comparison of OCR, Similarity, BERT, and GPT and subjectivity of the text data, underscoring the need for a
models as depicted in the provided summary offers an diverse set of benchmarks.
insightful overview of the strengths, limitations, and
applicability of these diverse approaches to text analysis and B. OCR Models
generation. The OCR models were primarily evaluated for their
ability to recognize characters within images. Here, the
A. Data Considerations success was largely dependent on the clarity and consistency
Our research indicated that the quality of the dataset is of handwriting in the datasets provided. Limitations arose
paramount across all models. For OCR, the granularity of when models faced cursive or highly stylized handwriting,
data at the character level is critical, while for BERT and GPT, leading to inconsistencies in recognition. Despite trials with
context and coherence of text play a significant role. The various models, including homemade, free, and paid services,
effectiveness of the Similarity Models hinges on the richness the challenge of achieving consistent accuracy with diverse
handwriting remained.

IJISRT24OCT205 www.ijisrt.com 45
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205

C. Similarity Models REFERENCES

These models' iterative approach to improving text
comparison initially seemed promising for evaluating answer [1]. AUTOMATION OF ANSWER SCRIPTS
scripts. However, the subjective nature of content in EVALUATION-A REVIEW Ravikumar M1, Sampath
educational settings posed difficulties, as the models Kumar S1 and Shivakumar G1
struggled to interpret the nuances and context that a human [2]. Grade Aid: a framework for automatic short answers
grader naturally understands. Consequently, while these grading in educational contexts—design,
models served as a basic benchmark, they were ultimately implementation and evaluation Emiliano del Gobbo1
found unsuitable for evaluating complex answer scripts, as · Alfonso Guarino2 · Barbara Cafarelli1 · Luca Grilli1
they could not match the depth of human judgment. [3]. NLP-based Automatic Answer Script Evaluation Md.
Motiur Rahman1 and Fazlul Hasan Siddiqui2*
D. BERT Models [4]. An Automatic Short Answer Correction System Based
The BERT models aimed to recognize contextual on the Course Material Zeinab Ezz Elarab Attia1*
information within text, and their performance was often Waleed Arafa1 Mervat Gheith1
impeded by the limitations of real-world educational datasets, [5]. Automatic Evaluation of Descriptive Answers Using
which are diverse and often sparse. While ensemble and NLP and Machine learning. Prof. Sumedha P Raut1,
optimization techniques were applied, the lack of a Siddhesh D Chaudhari2, Varun B Waghole3,
substantial and varied dataset meant that the models could not Pruthviraj U Jadhav4, Abhishek B Saste5
be trained to their full potential, leading to an adaptation
challenge when applied to specific datasets.

E. GPT Models
The GPT models showed potential in generating human-
like text responses, with a primary focus on customizing for
single-answer scripts. The models were adaptable to specific
research needs, achieving a 60% accuracy rate. Although this
was a significant milestone, there was a consensus that with
unique and larger datasets, further improvement in
performance could be achieved.

F. Overall Conclusion
The comparative analysis of these models in our project
underscores a recurrent theme: the success of machine
learning models is intricately tied to the data they are trained
on. Real-world educational data presents unique challenges
due to its variability and complexity. While OCR models
require clear and consistent data, Similarity Models need
subjective understanding, and BERT and GPT models
necessitate large datasets with varied contextual information
to train effectively.

The promise shown by BERT and GPT models suggests

a pathway forward—focusing on data enhancement and
model fine-tuning could potentially bridge the gap between
automated grading and human-like evaluation accuracy.
Hence, our future efforts will be directed towards curating
more comprehensive datasets, encompassing a wider array of
handwriting and answer styles, to further improve the models’
ability to evaluate student scripts with the same nuance and
insight as experienced educators.

IJISRT24OCT205 www.ijisrt.com 46