Automation of Answer Script Evaluation
Automation of Answer Script Evaluation
Abstract:-The goal of this study, "Automation of Answer What are we Expecting the Proposed Process to
Scripts Evaluation," is to create an end-to-end automated Accomplish? The Process Needs to
process that can quickly and fairly evaluate answer
scripts and grade students. Optical Character Reduce the time and effort required for grading.
Recognition (OCR), Artificial Intelligence (AI), Machine Apply predefined criteria consistently, minimizing the
Learning (ML), Natural Language Processing (NLP) are potential for subjective grading biases.
brought together to build a workflow for automating this Handle a large volume of answer scripts for specific
tedious, time taking, subjective activity. The paper subjects
discusses failures and successes of various models applied Utilize predefined algorithms and models to evaluate
in our endeavour. answers objectively.
Keywords:- OCR Model, Bert Model, NLP, GPT Model, Issues Identified in Our Exploration
Optimization, Cosine Similarity, Vectorization, Rubric Model,
Evaluating Model, Datasets, Ensemble, Majority Voting, OCR model struggles with accurately deciphering unclear
Gradient Descent. or unconventional handwriting.
Struggle with subjective questions that require nuanced
I. INTRODUCTION understanding and context.
Handling diverse ways in which students express their
In the world of education, the persistent struggle to
answers.
understand messy handwriting and meet strict grading
Evaluating open-ended answers those requiring creative
deadlines remains a challenge. Traditional grading systems
thinking.
are currently facing problems like giving subjective grades,
Models like Similarity, BERT and GPT, face limitations
dealing with different student answers, and struggling with
due to a lack of real-world datasets.
many evaluations. These issues, involving personal opinions
and difficulties in handling a lot of papers, highlight the need To accommodate various subjects and exams without
for new ways to make grading simpler and better. Essentially, compromising accuracy.
we need to find creative solutions using AI, NLP, ML.
II. LITERATURE REVIEW
The first hurdle is ability to translate handwritten
answers into computer analysable text. OCR models are Previous models for automating the evaluation of
explored for this purpose. Current OCR models [Ref] have answer scripts, such as those by Ravikumar et al., del Gobbo
given limited success in our experiments. Once the et al., and Rahman and Siddiqui, have shown notable
manageable text is available, interpreting the answer and limitations in terms of time complexity, correctness in
evaluating has shown even more difficulties. Similarity, assessing student responses, and handling the variability of
BERT and GPT models of AI have been tried to find the subjective content in educational contexts. While efforts have
feasibility of automated evaluation. been made to create frameworks and apply machine learning
and NLP-based techniques, these models struggled to
replicate the nuanced judgment of human graders,
particularly in processing diverse handwriting styles and
contextual understanding. Consequently, this paper aims to
address these shortcomings by developing a new model from
scratch that focuses on enhancing the speed, accuracy, and
adaptability of automated grading systems, building upon the
insights and limitations highlighted in prior research.
IJISRT24OCT205 www.ijisrt.com 27
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
III. METHODOLOGY
Challenges Encountered During Real Time Exam Starting with handwritten to text that can be analysed
Answer Sheet to CSV File Conversion: with computers, the following OCR models have been tried.
IJISRT24OCT205 www.ijisrt.com 28
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
First was the model was made on the instructions given MaxPooling2D, Dense, Flatten, Dropout with metrics such as
by PythonLessons on YouTube confusion matrix, classification report. Optimizers such as
[https://www.youtube.com/watch?v=WhRC31SlXzA&t=5s] Adam, SGD were also used. This model also was giving
. It was trained on words from IAM datasets (sample images output that doesn’t match with the expected output for an
in fig- in section 3) which consisted of words, sentences and input outside of the dataset it is trained on. This model was
forms. The model used TensorFlow with different layers suspended due to the significant demand of resources and
consisting of CNN and LSTM with loss functions such as started experimenting with free and trail versions of paid
CTC. The model despite showing high accuracy for the input OCR models.
from the dataset it is trained on, gives output which doesn’t
match with the expected output for an input outside of the Open-Source OCR Models
dataset. So, the model was discontinued. We experimented with OCR models like Tesseract open-
source OCR model[https://tesseract-
OCR Model 2 ocr.github.io/tessdoc/Installation.html], google
The model which was built on idea taken from GitHub lens[https://lens.google/] which are readily available.
was trained on EMNIST dataset which consists of alphabets However, these models were limited to recognizing
using sequential model and layers such as Conv2D, computer-generated text and struggled with handwritten text.
IJISRT24OCT205 www.ijisrt.com 29
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Models Offered by Entrepreneurial Companies Studio-Azure. While these models performed better,
For a more advanced approach, we explored paid OCR especially with handwritten text, their consistency varied
models such as Google cloud vision, Nanonets and Vision across different styles of handwriting.
IJISRT24OCT205 www.ijisrt.com 30
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Using the output of Vision Studio azure’s OCR model, we tried the interpretation and analysis with Similarity models, BERT
models, GPT models. Discussion on these models follows below.
IJISRT24OCT205 www.ijisrt.com 31
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
As you can see it is accurate, we decided to use outputs of reduced the need for traditional water pitchers. In conclusion,
this OCR for developing the NLP models. as modern technology continues to evolve, traditional
methods that are time-consuming will struggle to keep up
B. Similarity Model with the latest trends. Thus, preserving these methods
document1 = "Technology is rapidly advancing, and this becomes futile and a waste of time."
progress is undoubtedly affecting traditional values. I believe
that in today's technological era, traditional values are likely query_doc = "Technology is flourishing by leaps and
to fade away. To begin with, there are several reasons why bounds and undoubtedly this advancement is taking a toll on
these age-old values are diminishing in the modern world. the traditional values. I do believe that in this technological
Firstly, in our fast-paced society, mobile phones have become world traditional values are bound to disappear. To initiate
everyone's companion for staying connected with family and with, there are many reasons why these conventional values
friends. In contrast, in the past, people used to send letters and have no existence in this modern world. First, in this fast-
wait in long lines for a single telephone call using STD and paced world, everyone is assisted with the mobile phone to
ISD services. The evolution of communication methods stay connected with their family and friends. However, in
highlights that traditional practices hold little value today. olden times people used to send letters and stand in the long
Secondly, technology has revolutionized the realm of fashion. queues on S.T.D and I.S.D just for a maturity of one call. This
People used to engage in manual activities like knitting, advancement in the modes of communication has proved that
stitching, and designing, but now machines have simplified traditional skills are worth for nothing. Secondly, technology
every task. As an illustration, fashion design students now has transformed the world of fashion. Earlier people used to
swiftly assess color compatibility using advanced software do knitting, stitching, and designing manually but now
instead of manually portraying themselves as models. machines have made every task easier and comfortable. To
Furthermore, the advent of refrigerators has drastically substantiate my view, many of fashion designing students
IJISRT24OCT205 www.ijisrt.com 32
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Output:
Output:
IJISRT24OCT205 www.ijisrt.com 33
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Output:
Output:
TF-IDF: Highlights term importance. Hence, I gave multiple best answers to find the best matching model answer for all those
answers
Cosine Similarity with TF-IDF: Considers weighted term importance.
The best matching model answer is used to improve accuracy again
Output:
IJISRT24OCT205 www.ijisrt.com 34
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Output:
Preprocessing: Converts text to lowercase, tokenizes, removes stop words, and stems.
Word Embedding Generation: Uses Word2Vec for semantic understanding.
Document Vectorization: Computes average vector for document representation.
Similarity Computation: Calculates cosine similarity between document vectors.
Output:
Output:
Preprocessing: Converts text to lowercase, tokenizes, removes stop words, and stems.
Word Embedding Generation: Uses Word2Vec for semantic understanding.
Document Vectorization: Computes average vector for document representation.
Similarity Computation: Calculates cosine similarity between document vectors.
Output:
Analysis: with will significantly affect the similarity score, as you can
After extensive exploration, we concluded that none of see in Model 1.8. The complexity of evaluating subjective
the similarity models were suitable for evaluating answer content in answer scripts posed challenges beyond the
scripts. Despite our efforts in incorporating various NLP capabilities of our researched similarity models.
methods such as stemming, lemmatization, Word2Vec, N-
Gram, different types of vectorization, different types of Our exploration led us to BERT models, a discussion
checking similarities, the results were not feasible for our follows here.
intended application, Eg: a student writing Hi instead of Hello
the meaning same but hello is not in our corpus document
IJISRT24OCT205 www.ijisrt.com 35
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Input:
Output:
BERT Model-2.2: Text Processing and Feature text. It's like sharpening the tools before starting the work—
Engineering cleaning the data and picking out the key parts of the text that
For the Text Processing and Feature Engineering phase, would help the model learn better and make more accurate
we fine-tuned the way our model reads and understands the predictions.
Input:
IJISRT24OCT205 www.ijisrt.com 36
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Output:
BERT Model-2.3: Data Augmentation and Ensemble to the original). This expanded our model's exposure, making
Facing limitations with the amount and variety of text it akin to reading more books to learn more about the world,
our model had seen, we used data augmentation, specifically thus improving its ability to understand and process text.
back translation (translating text to another language and back
Input:
Output:
BERT Model-2.4: BERT Model Optimization effectively from the OCR tasks we gave it. We wanted to
During the BERT Model Optimization phase, we ensure the model was running at its best to handle the
adjusted the model's settings like tuning hyper parameters for complex task of reading handwriting from our images.
better performance so that it could learn faster and more
Input:
IJISRT24OCT205 www.ijisrt.com 37
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Output:
BERT Model-2.5: Finalizing BERT Model Features We combined text processing, feature engineering, and
After trying out various adjustments, we picked the best methods to handle lots of data at once, and used the ensemble
features that helped our BERT model recognize text most approach where all the 'expert' models we created worked
accurately—like choosing the best ingredients for a recipe. together to give their best prediction.
Input: -
Output:
IJISRT24OCT205 www.ijisrt.com 38
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
BERT Model-2.6: Testing with Custom Dataset ultimate test to see if all our fine-tuning paid off, and if our
Finally, we put our BERT model to the test with a model could indeed understand and process the variety of
custom dataset—essentially a tailored exam, including handwriting styles and texts that students might use in their
content that we expected the model to recognize. This was the answers.
Input:
Output:
IJISRT24OCT205 www.ijisrt.com 39
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
GPT Model-3.1 Tailored the GPT model specifically for Enhanced performance Limited applicability to other
single-answer scripts, aiming to enhance its in handling single answer types of data.
performance in handling this type of data. scripts.
GPT Model-3.2 Implemented the GPT model using a custom Improved adaptation to Dependency on availability
dataset to improve its adaptation to specific specific research needs. and quality of custom dataset.
research needs.
Despite our efforts, the lack of a sufficiently diverse understand and generate human-like text, making it
real-world dataset for training hindered the success of the suitable for complex language tasks.
BERT model. Obtaining more varied and representative data Special Features: The model is built on a transformer
could be crucial for future improvements in OCR model architecture that prioritizes context and coherence, which
performance. allows it to perform a wide range of text-based tasks
effectively.
Further on, GPT models have been explored and the
outcomes are discussed below. GPT Model 3.1
GPT Model 3.2 Special Features: The ability to train on specific data
enhances the model’s relevance to the user’s needs,
Customization: Implemented using a custom dataset potentially improving both the quality and applicability of
Purpose and Application: This iteration of the GPT its outputs in tailored scenarios.
model was customized with a particular dataset tailored to
the specific needs of our research. Using a custom dataset
allows the model to better understand and generate text
that is more aligned with the thematic elements of the
research.
IJISRT24OCT205 www.ijisrt.com 40
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Input:
IJISRT24OCT205 www.ijisrt.com 41
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Output:
IJISRT24OCT205 www.ijisrt.com 42
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
Annotation Process: Each word in the handwritten scripts For the BERT and the GPT model we have taken the
was manually annotated to match its corresponding text in dataset as our college mid term papers as the input so that they
typed form. This step is crucial as it serves as the ground can help the model to train on the student point of view and
truth for training the OCR model. also understand how the teachers are correcting and based on
Annotation Quality Control: Specific attention was paid what criteria are the marks allocated so that the model can be
to ensure high-quality annotations. Any samples with trained similarly to replicate the teachers corrections on the
illegible handwriting or unclear annotations were omitted student answer script when it is fed to the model as input for
to maintain the integrity of the training data. the grading part.
D. Challenges with Font and Annotation F. BERT Model and GPT Model Datasets Gathering Process
Font Issues: Initially, some of the handwritten samples For the BERT and GPT models, we chose our college's
used fonts or styles that were not conducive to accurate OCR mid-term exam papers as the training material. Think of these
recognition (e.g., cursive or highly stylized handwriting). models like students learning to grade papers just like
This led to complications in training the OCR model teachers do. The mid-term papers are full of varied answers
effectively. from students these are the 'lessons' for our models. They
study how teachers check these answers and what reasons
Annotation Standards: It was observed that inconsistent they give for the marks they award. This way, our models are
annotations could potentially skew the model’s learning learning to grade by understanding the 'teacher's way' of
process. To counter this, we set strict guidelines for how scoring. The goal is for them to get so good that they can look
annotations should be formatted, focusing on clarity and at new answers they've never seen before and grade them just
uniformity in the text. like a teacher would, using the same logic and attention to
detail that a real teacher applies when marking a student's
work.
Handwritten Text Image(Words): -
IJISRT24OCT205 www.ijisrt.com 43
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
IJISRT24OCT205 www.ijisrt.com 44
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
V. CONCLUSION
The comparison of OCR, Similarity, BERT, and GPT and subjectivity of the text data, underscoring the need for a
models as depicted in the provided summary offers an diverse set of benchmarks.
insightful overview of the strengths, limitations, and
applicability of these diverse approaches to text analysis and B. OCR Models
generation. The OCR models were primarily evaluated for their
ability to recognize characters within images. Here, the
A. Data Considerations success was largely dependent on the clarity and consistency
Our research indicated that the quality of the dataset is of handwriting in the datasets provided. Limitations arose
paramount across all models. For OCR, the granularity of when models faced cursive or highly stylized handwriting,
data at the character level is critical, while for BERT and GPT, leading to inconsistencies in recognition. Despite trials with
context and coherence of text play a significant role. The various models, including homemade, free, and paid services,
effectiveness of the Similarity Models hinges on the richness the challenge of achieving consistent accuracy with diverse
handwriting remained.
IJISRT24OCT205 www.ijisrt.com 45
Volume 9, Issue 10, October – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24OCT205
E. GPT Models
The GPT models showed potential in generating human-
like text responses, with a primary focus on customizing for
single-answer scripts. The models were adaptable to specific
research needs, achieving a 60% accuracy rate. Although this
was a significant milestone, there was a consensus that with
unique and larger datasets, further improvement in
performance could be achieved.
F. Overall Conclusion
The comparative analysis of these models in our project
underscores a recurrent theme: the success of machine
learning models is intricately tied to the data they are trained
on. Real-world educational data presents unique challenges
due to its variability and complexity. While OCR models
require clear and consistent data, Similarity Models need
subjective understanding, and BERT and GPT models
necessitate large datasets with varied contextual information
to train effectively.
IJISRT24OCT205 www.ijisrt.com 46