Expert Systems With Applications: Manika Garg, Anita Goel

Expert Systems With Applications 225 (2023) 120111
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
Preserving integrity in online assessment using feature engineering and

machine learning
Manika Garg a, *, 1, Anita Goel b, 2
a
Department of Computer Science, University of Delhi, New Delhi 110007, India
b
Department of Computer Science, Dyal Singh College, University of Delhi, Lodhi Road, New Delhi 110003, India
A R T I C L E I N F O A B S T R A C T
Keywords: Preserving integrity in online assessments is a matter of concern, worldwide. There exist several ways for
E-learning cheating in online assessments. Exploiting the available Internet for finding solutions is one of the popular ways
Academic dishonesty of cheating. Several researchers have proposed solutions to handle Internet cheating. Some of them suggest the
Cheating
use of secure browsers, however, these are vulnerable to hacking and prone to technical errors. Many works
Machine learning
E-Assessments
propose e-proctoring, but it is a resource-intensive method and has significant privacy concerns. Other works
Feature engineering propose preventive measures like paraphrasing of questions, but smart google search algorithms defeat the
purpose.
In this work, we use machine learning to create a model by analysing the assessment log files, for the detection
of cheaters who indulge in Internet cheating. Additionally, to address the persistent problem of the identification
of ground truth in academic dishonesty, we modify an online quiz tool to collect labelled data. We transform the
raw dataset using feature engineering methods, to derive thirteen features from the student and question-related
attributes of the assessment log files. Our objective is to obtain the best predictive model for the classification of
honest and dishonest students. We create models using two feature selection algorithms (ANOVA and Mutual
Information) and five machine learning classifiers (Logistic regression, Support vector machines, Naïve Bayes, K
nearest neighbour and Random forest) and evaluate them. From among all the models, Random forest classifier
with top features selected by the MI method obtains the best performance with an accuracy of about 85%. We
discuss the features that are most influential for the automated detection of cheaters. We also give insights into
the critical aspects of a cheat-proof assessment design.
1. Introduction more in online assessments because of the limited control over the test
environment and readily available opportunities that facilitate cheating
Online education has seen significant growth in the last two decades (Rogerson & McCarthy, 2017; von Grunigen et al., 2018). The high
and much more during the COVID-19 pandemic. The evolution of in frequency of cheating incidents has, therefore, stimulated research into
formation technology has given rise to new learning modalities such as the security of online assessment (Amigud et al., 2018; Noorbehbahani
Massive Open Online Courses (MOOC) and Small Private Open Online et al., 2022).
Courses (SPOC); with many reputed institutions incorporating these This work builds on top of an Academic Dishonesty Mitigation Plan
learning models to transform traditional offline courses into blended and (ADMP) for effective handling of cheating in online assessments that we
online programs (Ahmad Muhammad et al., 2022; Butler-Henderson & proposed in our previous study (Garg & Goel, 2022). The ADMP pro
Crawford, 2020; Gomez et al., 2022). However, much like traditional vides strategies for both prevention and detection for effective mitiga
face-to-face courses, online courses consist of assessments that raise tion of different forms of cheating behaviours, such as collusion (Crook
concerns about the risk of academic cheating (Costley, 2019; McCabe & Nixon, 2019), impersonation (Mungai & Huang, 2017), forbidden aids
et al., 2001). The prevalence of academic dishonesty or cheating is much (Hylton et al., 2016), and plagiarism (Awasthi, 2019). We observed that
* Corresponding author.
E-mail address: manikagarg2007@gmail.com (M. Garg).
1
ORCID ID - https://orcid.org/0000-0002-7053-1144.
2
ORCID ID - https://orcid.org/0000-0002-9460-1557.
https://doi.org/10.1016/j.eswa.2023.120111
Received 5 September 2022; Received in revised form 11 March 2023; Accepted 8 April 2023
Available online 13 April 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
M. Garg and A. Goel Expert Systems With Applications 225 (2023) 120111
the use of forbidden aids like, the Internet, is very common in online • We apply the proposed model to a real-world dataset, in addition to a
assessments (Li et al., 2021; Nguyen et al., 2020; Okada et al., 2019). controlled dataset. The experimental results demonstrate the effec
The absence of invigilation (Mungai & Huang, 2017) and the availability tiveness and generality of our approach (accuracy of 82.35% and
of the Internet is often perceived as a cheating opportunity by the stu 85% respectively).
dents as they obtain direct answers or solving hints from online re • We analyse the composition of influential features and discuss the
sources such as Google search engines and ChatGPT (King, 2023). usage pattern of features by the cheaters. The findings help to un
The exploitation of the Internet for finding solutions is a key chal derstand the test-taking pattern of cheaters and give insights into the
lenge in maintaining the integrity of online assessments. Several re critical aspects of a cheat-proof assessment design.
searchers have proposed solutions to prevent Internet cheating. Some
works suggest the use of third-party tools like secure browsers to lock the The rest of the paper is organized as follows: Section 2 presents the
assessment screen, however, these tools are vulnerable to hacking and review of the literature. Section 3 describes the datasets and Section 4
prone to technical errors (Ullah et al., 2019; von Grunigen et al., 2018). presents the feature engineering process. Section 5 describes the
Many researchers propose paraphrasing of questions, but the smart methods and classifiers used in the study. Section 6 provides a detailed
google search algorithms fail the motive of obscuring the cheaters evaluation of experimental results and Section 7 discusses the results.
(Golden & Kohlbeck, 2020). Other works recommend the use of integ Finally, Section 8 concludes the paper.
rity policies, but due to lack of implementation and mild penalties, these
policies remain ineffective (Nguyen et al., 2020; Salhofer, 2017). 2. Literature review
In this paper, we use a data analysis approach for the detection of
Internet cheating, as proposed in the ADMP plan of our previous study 2.1. Research on Internet cheating
(Garg & Goel, 2022). Specifically, we use Machine Learning (ML) and
feature engineering methods for the detection of cheaters, by analysing Internet is a prerequisite for the conduct of online assessments, but it
the log files. A typical challenge for cheating-related ML studies is the is widely exploited by students for cheating purposes (Rogerson &
collection of labelled datasets. For this, we improvise a self-developed McCarthy, 2017; von Grunigen et al., 2018). It is observed that due to
online quiz tool, namely iQuiz3 and use it to collect labelled datasets the Internet availability (Li et al., 2021; Nguyen et al., 2020) and the
from two lab-based online assessments. To increase confidence in our absence of invigilation (Mungai & Huang, 2017) online students
collected datasets, we further validated the ground truth using the perceive opportunities to consult online resources to find answers during
existing methods of simulation and self-reports. the assessment (Okada et al., 2019).
The data in the assessment log file is recorded in the question-wise Previous studies have explored this problem, for example, Alessio
format that describes patterns of how students attempt a question. et al. (2018) and Ghizlane et al. (2019) propose the use of lockdown
However, we assume that even though a student’s attempt at a question browsers to refrain students from accessing unauthorized resources on
may appear genuine, the collective attempt can be used to identify ir the Internet. They suggest locking the test environment to the online
regularities in the test-taking pattern. Therefore, to consider the col assessment window until the submission of the assessment and, also
lective view over the assessment behaviour, we use feature engineering disabling keyboard shortcuts like copy, paste and screenshots. However,
methods to derive thirteen indicators from raw data of the assessment it is observed that these tools can be easily compromised by students and
log file. The derived indicators are used as features in the machine are prone to many technical errors like frequent crashes.
learning model. Many studies (Cramp et al., 2019; Hylton et al., 2016) suggest
Extensive experiments are conducted with different feature sets and traditional methods of handling cheating like proctoring which involves
classifiers. First, we employ two feature selection algorithms, Mutual watching real-time or recorded videos of exam sessions or conducting
Information (MI) and ANOVA, to select the relevant set of features. Next, lab-based online assessments. But proctoring is a resource-intensive
we create models using five machine learning classifiers, namely Lo method, especially if there are large number of students. It also raises
gistic Regression (LR), Support Vector Machines (SVM), K Nearest significant privacy concerns, like granting access to students’ personal
Neighbour (KNN), Naïve Bayes (NB) and Random Forest (RF) classifier computers and monitoring them during the exam (Kharbat & Abu
with all the thirteen features, and with the subset of selected features. Daabes, 2021).
We also use k-fold cross-validation with hyperparameter tuning to Another recommended approach is the formulation of integrity
enhance the performance of the classifiers. We compare the perfor policies and honour codes for online assessments that outlines about
mance of all models to obtain the best predictive model. By comparing what constitutes cheating, how it will be monitored and the corre
the accuracy of all models, we find that the RF classifier applied to the sponding consequences. However, it is observed that the poor imple
top features selected by the MI method yields the best performance. The mentation and students’ lack of awareness regarding the integrity policy
best model is further validated on a real-world dataset to demonstrate yield unsatisfactory results (El-Nakla et al., 2019; Mellar et al., 2018).
the generality of the approach. In summary, the main contributions of Several assessment design techniques are proposed by researchers to
our work are: minimize opportunities for Internet cheating. For example, Golden &
Kohlbeck (2020) propose paraphrasing questions to complicate the
• We capture labelled data for the determination of the ground truth process of searching for answers on the Internet because the question is
and validate it using the existing methods of simulation and self- not available in an exact form online, however, smart google search
reports. For this, we improvise an online quiz tool. algorithms often defeat the purpose. In addition, Nguyen et al. (2020)
• We use feature engineering to derive thirteen features from the raw suggest designing new and different questions to obscure dishonest
data of the assessment log file. Two feature selection algorithms - students from finding solutions on the Internet.
ANOVA and MI have been used for the selection of the most influ In our study, we used Bloom’s Taxonomy (BT) to design different
ential features. We create models using five ML classifiers (LR, SVM, questions assessing varying levels of students’ cognitive skills. BT has
NB, KNN and RF) applied to all the thirteen features, and to the been successfully applied in the education domain for course design and
subset of selected features. We evaluate the models to obtain the best evaluation purposes (Thompson et al., 2008). The categories in BT are
predictive model for the detection of cheaters. hierarchically ordered; the first three categories - knowledge, compre
hension and application measure the students’ lower order thinking
skills (LOTS), and the other three categories - analysis, synthesis and
evaluation measure the higher order thinking skills (HOTS) (Chang &
3
https://github.com/anitagoel/iQuiz. Mao, 1999; Pappas et al., 2013). The LOTS questions are simple and
2
their solutions can be easily searched online whereas HOTS questions students. For our approach, we utilized the following entries:
are complex and relatively difficult to search online (Gagné, 2020).
• Username: Name of the student. This has been anonymized for
2.2. Research in Machine learning and online cheating privacy purposes.
• Question: Numeric variable describing the question number.
Machine learning can be used to discover useful patterns in data that • BT level: Categorical variable describing the Bloom’s Taxonomy
can be analysed to draw valuable conclusions. There are research works level (HOTS/LOTS) of the question.
that apply ML methods using different combinations of student, problem • Question-Visit-Timestamp: List of timestamps when the question
and submission features, for the detection of cheating in online assess appears on the student’s screen. Please note that the student can visit
ments. But there remains a major challenge to collect labelled data for the question multiple times.
proving that the detected students are actual cheaters. Previous studies • Response-Submit-Timestamp: List of submitted responses and the
(Chuang et al., 2017) mostly rely on self-reports where students are respective timestamps of submission to the question. Please note that
interviewed post-assessment about whether they cheated or not. How the students can revise the previous response or resubmit the same
ever, these methods may lead to under-reporting or overreporting as response (after proofread) any number of times.
they are highly dependent on students’ willingness to admit their • Response: Final response (option 1/2/3/4) submitted to the
misconduct. Ruipérez-Valiente et al. (2021) and Jaramillo-Morillo et al. question
(2020) used implications where they associated the observed behaviour • Score: Numeric score (0/1) to the question
with general dishonest behaviour to determine the ground truth. This • Time: The total time spent (seconds) on the question
method is again prone to bias as there could be multiple explanations for • Tab Switch: Total number of times the student leaves the online
the results. Ranger et al. (2020) used the simulation method where some assessment tab while answering the question. For this work, we
students were instructed to cheat so that the cheaters were known be convert this variable into a binary variable (if the count of tab switch
forehand. However, their results largely depend on the artificial simu for a question is greater than zero, then we assign tab switch a value
lation conditions and also, there is no way to validate if there exist any of 1, otherwise zero).
cheaters other than the selected students.
Nonetheless, several authors have applied techniques from educa Fig. 1 presents a sample of the assessment log file with some columns
tional data mining and machine learning for the detection of different relevant to this study. Apart from the above entries, various other entries
forms of cheating (Ranger et al., 2020; Steger et al., 2021; Kamalov like IP address, location of the question etc. are also recorded but are
et al., 2021). Much of it has been focused on the detection of students beyond the scope of this work. To further increase the usability and
who unethically collaborate and help each other by sharing answers reproducibility of iQuiz tool, we have open-sourced the tool on GitHub.
during the assessment, for example, many studies (Balderas & Caballero-
Hernández, 2020; Ruipérez-Valiente et al., 2021; Salhofer, 2017; San 3.2. Assessment design and participants
galli et al., 2020) detected cheaters based on their response submission
timestamps, Jaramillo-Morillo et al. (2020) used text mining along with Our approach for cheating detection involved two field experiments.
event log analysis, and Alexandron et al. (2019) proposed person-fit Both the experiments were performed in lab-based online assessment
statistics for discriminating cheaters. Other studies have applied ma conducted on the iQuiz tool integrated with LMS Moodle. The partici
chine learning analytics for the detection of plagiarism; for example, pants of the experiments were undergraduate students enrolled in a
Amigud et al. (2017), Ljubovic & Pajic (2020), Opgen-Rhein et al. software engineering course in the University of Delhi, during the aca
(2018) and Ramnial et al. (2016) applied supervised ML algorithms demic year 2021–2022. The assessments were designed using Bloom’s
using the textual features of the documents and Trezise et al. (2019) Taxonomy consisting of both HOTS and LOTS questions. Both assess
used an unsupervised clustering approach to identify writing patterns to ments had a maximum time limit of 25 min and consisted of fifteen
verify authorship claims. Likewise, ML algorithms have also been used questions from the software engineering course. The questions were
to detect impersonation by analysing keystroke dynamics (Mungai & multiple choice questions consisting of four response options from
Huang, 2017) and face recognition methods (Asep & Bandung, 2019). which one correct response had to be selected. The students may select a
As we see, the aforementioned studies have demonstrated the use of response option or may decide to abort responding to it. Moreover, the
data mining and machine learning for detecting different forms of online students can come back to any previously visited question and can revise
cheating behaviours. However, even though Internet cheating is one of the response option. The questions were presented in randomized order
the most common forms of cheating in online assessments, we could not to all students and only one question was shown per page. The assess
find any study that applies machine learning to detect Internet cheating; ment was proctored so that the students are not able to use resources
our research study is focused in this direction. such as mobile phones, books and notes. Also, the students were not
permitted to talk or interact with each other. However, the Internet was
3. Dataset allowed on the machine in which they were taking the online assessment
and the students were not proctored for their Internet browsing activ
3.1. Data collection tool ities. This kind of test environment creates opportunities that facilitate
Internet cheating and at the same time deter other forms of cheating. For
We modified a self-developed quiz tool, namely, iQuiz integrated example, the use of objective questions mitigates the chances of
with Moodle Learning Management System (LMS) platform for data plagiarism, proctoring and randomization of questions help control
collection purposes. iQuiz is a Learning Tools Interoperability (LTI) collusion and lab-based testing does not allow impersonation.
compliant web application that can be integrated with any LTI-
compliant LMS or Content Management System (CMS). iQuiz allows to 3.3. Data collection procedure
create and administer quizzes, and also capture all events performed by
students while interacting with the online assessment. We included a Data collection took place in two separate experiments. The first
tab-switch recording functionality in the tool for the purpose of deter experiment was conducted in a cheating-induced environment within a
mining ground truth required for ML classification purposes. All the laboratory-based setting. In total, sixty students (42 males, 18 females,
interactions (e.g., responses, response times, scores, response revisions, mean age = 19.25) participated in the first experiment. The students
and question visit timestamps) are captured in a downloadable log file of were informed that the experiment was conducted as part of a research
CSV format. The log entries are recorded question-wise for all the study. They were also informed that the participation was voluntary and
3
Fig. 1. Sample of assessment log file.
that the results will not contribute to the final credits. All participants the test-taking pattern. Furthermore, the raw features do not consider
provided their informed consent. Of the sixty students, twenty-one the instilled question dependencies between the student attempts. For
students were randomly selected and recruited to induce cheating. example, a common cheating pattern consists of small response times
The selected students were encouraged to answer all questions and were and higher scores on high difficulty questions. The question-wise
even allowed to search the Internet for answers. On the other hand, the recorded features cannot account for such patterns as the model treats
remaining students were strictly prohibited from using the Internet for all the questions equally irrespective of their individual attributes like
answer searching purposes. Post assessment we analysed the tab switch BT level or difficulty. Moreover, we assume that in the event of cheating,
records obtained from the assessment log file to confirm that the it is expected that the test-taking pattern of the student deviates from the
recruited students were the only cheaters in the assessment. The tab commonly observed pattern. This is a classic anomaly detection
switch records verified that only the recruited twenty-one students approach where the events or observations are measured against the
switched tabs on multiple questions of the assessment. These twenty-one dataset’s normal behaviour. The question-wise records cannot consider
students were labelled as cheaters in the experimental dataset to be later this deviation as all the questions are treated similarly, while in practice,
used in machine learning algorithms. the common behaviour differs with the type of question. Therefore, in
The second experiment was conducted in a typical lab-based online our study, we decide to transform the raw question-wise dataset into
assessment setting. However, unlike the first experiment, no cheating student-wise dataset using feature engineering methods. This allows
was induced and the dataset obtained was purely empirical. In total, detecting cheaters based on a different view of the dataset. We process
fifty-one students (35 males, 16 females, mean age = 19.29) participated the raw data in the log files to extract the features for building our
in the second experiment. The students were informed that the results of predictive model.
the experiments would contribute to the final credits. We further warned We decided to extract features related to two different dimensions:
all the students against the unethical use of the Internet during the question attributes and student attributes. The rationale for the selection
assessment and that any violation will be considered as a cheating of these dimensions was based on considering different aspects of the
attempt. However, based on the previous research findings, we pre detection of Internet cheating. The initial selection of features was based
sumed that the offered academic credits and the Internet availability on the numerous research studies related to the detection of online ac
could be significant motivating factors for the students to cheat. Post ademic cheating (as discussed in Section 2) and also the personal
assessment, the tab switch records from the obtained log file were teaching and research experience of co-authors. We consider two types
analysed to identify probable cheaters. Moreover, all the students were of question attributes – BT level and the perceived difficulty of the
personally interviewed by the experimenter. The interview required question. We map these attributes on the scores and the response times
students to provide a self-report of whether they cheated or not. The of the students to derive eight new features, namely Score-HOTS, Score-
experimenter confronted all the students with their tab switch records LOTS, Time-HOTS, Time-LOTS, Score-HighDiff, Score-LowDiff, Time-
and then asked if the students had cheated or not. All the students with HighDiff and Time-LowDiff. In the case of student attributes, we
zero and one tab-switch records refused to cheat. However, fourteen consider their individual attributes as well as attributes relative to the
students, with multiple tab switch records, confessed to cheating. These class. In terms of their individual attributes, we derive two features,
fourteen students were labelled as cheaters in this experimental dataset. namely Revisions and Visit that describe their revision and question
A total of 1665 student interactions were obtained from 111 students on viewing frequency, respectively. In terms of their relative attributes, we
two online assessments with 15 questions each. Both datasets can be derive three features, namely Flag, Kendall’s Tau (KT) and Cook’s Dis
downloaded from https://github.com/anitagoel/Dataset. tance (CD). Flag and KT determine outlying patterns in the response
times of the students. CD is based on the regression of the total score and
4. Feature engineering total testing time of students. KT and CD have been previously used for
cheating detection by (Ranger et al., 2020).
In applied ML, the performance of any predictive model largely de The thirteen features that have been engineered are identified into
pends upon the quality of data representation. Feature engineering is the four categories – Question Type, Question Difficulty, Student Attempt,
main task in the preparation of data for ML models (Nargesian et al., and Student-Class Attempt, as shown in Table 1. The feature engineering
2017). It is a process of transforming raw data into features that better has been applied to the datasets of our first and second experiment to
characterize the problem and lead to increased model performance on obtain Dataset1 and Dataset2, respectively. Each dataset is a student-
unseen data (Pham et al., 2016). These features are developed based on feature matrix that consists of thirteen independent variables as fea
the domain experience of the data scientists. The relevant features are tures and one dependent variable as the class label (1 if cheater and 0 if
selected via iterative trials and model evaluations and irrelevant fea non-cheater; as obtained from Section 3.2) that is used for machine
tures are discarded. Developing and incorporating various features learning classification. In the following, section 4.1 describes all the
based on the test-taking pattern of students is one of the main contri engineered features along with formulas specifying their calculation and
butions of this study. section 4.2 describe the feature selection methods.
Feature engineering is critical to the detection of cheating in online
assessments. Some authors only use raw features, such as response
submitted, response time, and score, on each question for cheating 4.1. Engineered features
detection (Alexandron et al., 2017; Chuang et al., 2017). Such features
do not consider the aggregated view over the assessment behaviour. We 4.1.1. Features based on question type
assume that even though a student’s attempt at a question may appear This section describes the features that have been determined based
genuine, the collective attempt can be used to identify irregularities in on the type of question. The following notation is used. Let tij and sij
denote the time taken and score earned (1 if correct and 0 if incorrect/
4
Table 1 {
Feature information. 1, if dj ≥ 0.5
Dj = (6)
0, if dj < 0.5
S.no Feature Type Feature name Code
1 Question Type Score-HOTS SHOTS

2 Score-LOTS SLOTS • Score-HighDiff - This is the total score of any student i on high
3 Time-HOTS THOTS difficulty questions of the assessment. It is calculated as follows:
4 Time-LOTS TLOTS ∑Q [ ]
SHD (i) = j=1
sij Dj = 1 (7)
5 Question Difficulty Score-HighDiff SHD
6 Score-LowDiff SLD
7 Time-HighDiff THD • Score-LowDiff - This is the total score of any student i on low dif
8 Time-LowDiff TLD ficulty questions of the assessment. It is calculated as follows:
∑Q [ ]
9 Student Attempt Revisions R SLD (i) = j=1
sij Dj = 0 (8)
10 Visit V
11 Student-Class Attempt Flag F • Time-HighDiff - This is the total time spent by any student i on the
12 KT K high difficulty questions of the assessment. It is calculated as follows:
13 CD D ∑Q [ ]
THD (i) = j=1
tij Dj = 1 (9)
unanswered) respectively by student i(i = 1⋯N) on question j(j =

1⋯Q). Let cj denote the level of cognitive skill (1 if HOTS and 0 if LOTS) • Time-LowDiff - This is the total time spent by any student i on the
required to solve question j based on Bloom’s Taxonomy. low difficulty questions of the assessment. It is calculated as follows:
∑Q [ ]
• Score-HOTS - This is the total score of any student i in the higher- THD (i) = j=1
tij Dj = 0 (10)
order Bloom’s Taxonomy questions of the assessment. Here, we use
the Iversion bracket, i.e., [statement] = 1 if the statement is true and 4.1.3. Features based on student attempt
[statement] = 0 otherwise. The Score-HOTS is calculated as follows:
∑Q [ ] • Revisions - This feature represents the total number of revisions
SHOTS (i) = sij cj = 1 (1)
j=1 done by a student i on all Q questions. Here, revision essentially
means a change in response and does not include proofreading the
• Score-LOTS - This is the total score of any student i on the lower- question. Let Aij = (a1 , a2 ⋯ak ) denote the list of responses submitted
order Bloom’s Taxonomy questions of the assessment. It is calcu to question j by a student i in k attempts. We first calculate rij (1 in
lated as follows: case of revision, 0 otherwise) that determines if the student i has ever
∑Q [ revised his/her answer on question j.
]
SLOTS (i) = sij cj = 0 (2) {
j=1 1, ∃ak ∈ Aij , ak− 1 ∕= ak
rij = (11)
0, Otherwise
• Time-HOTS - This is the total time spent by any student i on the Finally, the summation of revisions for every question gives the value
higher-order Bloom’s Taxonomy questions of the assessment. It is of the feature (min = 0, max = Q), as calculated below:
calculated as follows: ∑Q
∑Q [ ] R(i) = rij (12)
(3)
j=1
THOTS (i) = j=1
tij cj = 1
• Visit – This feature represents the average number of views per

• Time-LOTS - This is the total time spent by any student i on the question by any student i. Let T ij = (t1 , t2 ⋯tk ) be the set of Question-
lower-order Bloom’s Taxonomy questions of the assessment. It is Visit-Timestamps on question j by a student i. Here, we use the no
calculated as follows: tation |A| to denote the cardinality of a set A. The feature is calcu
∑Q [ ] lated as follows:
TLOTS (i) = j=1
tij cj = 0 (4)
1 ∑Q ⃒⃒ ⃒⃒
V(i) = Tij (13)
Q j=1
4.1.2. Features based on question difficulty
This section describes the features that have been determined based
4.1.4. Features based on student-class attempt
on the difficulty of the question. We calculate the difficulty of any
Flag – This feature records the occurrences of students’ conspicu
question j on basis of the Proportion Correct (PC) method as described
ously small and long outlying response times as compared to the median
by Singh et al. (2021). Let, the difficulty dj of any question j be stated as
response time of the class for the respective question. It uses Inter
the ratio of the number of wrong answers (and unanswered questions)
Quartile Range (IQR) to calculate the outlying response times for every
and the total number of students, as shown below:
question separately. Note that a similar measure was used by (Steger
1 ∑N [ ] et al., 2021) for cheating detection. Let IQRj = (Qrtl3 )j − (Qrtl1 )j repre
dj = sij = 0 (5)
N i=1
sent the IQR of time spent by all students on question j where (Qrtl1 )j and
The calculated difficulty can be further categorized into two levels (1 (Qrtl3 )j represent the first and third quartiles respectively. The response
( )
for High and 0 for Low) using a threshold of 50%. The categorized dif times which fall below (Qrtl1 )j − 1.5 IQRj or above (Qrtl3 )j +1.5(IQRj )
ficulty is represented as Dj and is calculated as follows: are considered outliers (fij = 1) as shown below:
5
{
1, if (tij > Uj ) |(tij < Lj ) independent relation is given a value of zero. ANOVA is a statistical
fij = (14)
0, if (Lj ≤ tij ≤ Uj ) technique used to test for significant differences in class means. The test
can also be used to see the impact of numerical independent variables on
where Uj = (Qrtl3 )j + 1.5(IQRj ) and Lj = (Qrtl1 )j − 1.5(IQRj ). The sum the categorical dependent variable. The features having higher weights
mation of outlying response times for every student represents the final are used in the model and the remaining features with small weights are
feature Flag (min = 0, max = Q), as calculated below: ignored.
∑Q
F(i) = j=1
fij (15) 5. Methodology
The purpose of our study is to develop a machine learning model to

effectively detect cheaters in online assessments. The target variable is
• KT – The KT statistic (Ranger et al., 2020) assesses whether the in defined as a binary variable and thus, we treat the cheating detection
dividual time spent on the questions by the students has the same problem as a binary classification problem. Two datasets, as described in
order as the median time spent by the class. It is calculated as Ken Section 4, were used to build our predictive model. Firstly, Dataset1 is
dall’s tau correlation between the individual response times of the used for both training and testing the machine learning algorithms.
student and the median response times in the question, as shown Additionally, Dataset2 is used as an unseen dataset to further validate
below: the model trained with Dataset1. The framework of our model is given in
(→ →) Fig. 2. The data is pre-processed which includes removing the missing
K(i) = τKen T i , M (16) values and scaling the data between 0 and 1 using the MinMax Scalar for
effective representation for machine learning algorithms. In this section,
→
where M = (m1 , ⋯mj ⋯mQ ) is a list of median response times of all we describe the employed machine learning classifiers and their evalu
→ ation metrics.
questions and T i = (ti,1 ⋯ti,j ⋯ti,Q ) is a list of individual response times of
any student i in all questions. τKen or Kendall’s tau is a measure of the
5.1. Machine learning classifiers
correspondence between two rankings. Values close to 1 indicate strong
agreement whereas values close to − 1 indicate strong disagreement. For
In this work, we employ five ML classifiers, which are Logistic
calculations, we used the Kendall Tau statistic from the inbuilt python
Regression, Support Vector Machine, Naïve Bayes, K Nearest Neighbor
library called scipy.
and Random Forest. These classifiers are widely used for binary classi
fication (Russell & Norvig, 2009). It is to be noted that deep learning
• CD – The CD statistic (Ranger et al., 2020) assesses whether a stu
techniques have shown better performance over traditional machine
dent’s data pattern is consistent with the general relation between
learning techniques in various applications, especially when dealing
the total exam score and the total exam duration in the sample.
with large datasets (Lecun et al., 2015). However, in our case, we have a
Cook’s Distance is a measure of an observation’s influence on linear
relatively small dataset (around hundred observations). The small
regression (Cook, 1977). Observations with a large influence may be
dataset makes it challenging to train the deep learning models effec
treated as outliers. Cook’s distance was determined for all students
tively and often results in overfitting. The five ML classification algo
after regressing the total exam duration on the total exam score. The
rithms were applied to both the full feature set as well as the selected
regression can be expressed as:
feature subset. The ML algorithms are briefly explained in this section.
Y = aX + b (17) Logistic Regression is a binary classification algorithm that is used
to predict the probability of the categorical dependent variable from one
→
where Y = (y1 ⋯yi ⋯yN ) be the list representing the total exam duration or more independent variables. The logistic function converts the linear
→ combination of independent variables into a probability score between
of all students and X = (x1 ⋯xi ⋯xN ) be the list representing the total
0 and 1, which is used to classify the dependent variable into one of the
score of all students. For calculations, we used the inbuilt python library
two possible outcomes. The model is trained by minimizing the error
called statsmodels.
between the predicted and actual outcomes using a cost function. The
coefficients of the independent variables are adjusted during the
4.2. Feature selection training process to maximize the likelihood of observing the actual
outcomes given the input data.
Feature selection algorithms are used to determine the relevant Support Vector Machine is a supervised machine learning algo
features in the dataset and to potentially avoid overfitting the predictive rithm that works by finding the maximum margin hyperplane, which is
model. The cheating detection problem is a classification problem with the hyperplane that separates the two classes of data with the maximum
numerical input variables and categorical output variable (1 if cheater margin, i.e., the distance between the hyperplane and the closest data
and 0 if non-cheater). In this work, two feature selection algorithms – points of each class. To find the maximum margin hyperplane, SVM
Mutual Information (MI) and the Analysis of Variance (ANOVA), are solves a constrained optimization problem by maximizing the margin
used. We used MI and ANOVA because they are commonly used methods subject to the constraint that all data points are correctly classified. If the
for feature selection in datasets with numerical inputs and categorical data is not linearly separable, SVM uses a kernel function to map the
output. These methods have been shown to be effective in selecting data into a higher dimensional space where it becomes linearly sepa
relevant features for classification tasks (Wang et al., 2017). While there rable. Once the optimal hyperplane is found, new data points can be
are other feature selection methods available, such as Recursive Feature classified by projecting them onto the hyperplane and checking which
Elimination and Principal Component Analysis, they may not be as side of the hyperplane they belong to.
suitable for datasets with a categorical output (Hira & Gillies, 2015). Naïve Bayes classifier is a probabilistic classifier that is based on
Additionally, both MI and ANOVA methods are simple and interpret Bayes’ theorem, which describes the probability of a hypothesis based
able, as they allow to identify the most important features in the model on prior knowledge and evidence. The algorithm calculates the condi
and explain the results more easily. Mutual information is a measure tional probability of each feature given the class and multiplies them to
between two random variables that quantify the amount of information obtain the joint probability of the feature set given the class. Then it uses
obtained about one random variable, through the other random vari Bayes’ theorem to calculate the probability of the class given the feature
able. A higher value of MI means more information, and a completely set. The class with the highest probability is chosen as the predicted
6
Fig. 2. The framework of building the cheating detection model.
class. used for the experimentation was Python 3.7.13 and all the algorithm
K Nearest Neighbor is a non-parametric classification algorithm. implementations are from the scikit-learn 1.1.1 framework. We have
Given a new observation, KNN searches for the k-nearest training ob used the mutual_info_classif() and f_classif() function in the sklearn.fea
servations in terms of distance to the new observation. The class label of ture_selection module to implement feature selection algorithms. The
the new observation is then assigned based on the majority class among machine learning classifiers were implemented using the Python classes,
these k neighbors. The distance measure can be Euclidean distance, namely, LogisticRegression (), SVC (), KNeighborsClassifier (), GaussianNB
Manhattan distance, or any other distance metric. The value of k is a () and RandomForestClassifier ().
hyperparameter that needs to be determined beforehand.
Random Forest is an ensemble learning method that combines 6.2. Results of feature selection
multiple decision trees to improve the performance of binary classifi
cation. The algorithm creates a forest of decision trees, where each tree In our study, two feature selection methods – MI and ANOVA, were
is trained on a random subset of the data and a random subset of fea applied to select the important features from the dataset. Fig. 3 and
tures. During prediction, each tree in the forest independently predicts Fig. 4 show the importance of features based on the MI and ANOVA
the class of the input data and the final prediction is made by taking the methods, respectively. We select the top five features from both
majority vote of the individual tree predictions. The number of trees in methods. According to Fig. 3, the important five features which have the
the forest and the number of features used for each tree are hyper highest scores and were determined by the MI method are – Score-LOTS,
parameters that can be tuned to optimize the performance of the model. Score-HighDiff, Time-LOTS, Time-LowDiff and Time-HighDiff. It can be
noticed that Score-LOTS has the highest score at 0.169 with a small
5.2. Evaluating machine learning classifiers difference compared to the second feature, which is Score-HighDiff at
0.166. Both features have considerable influence on model training and
Stratefied K -Fold Cross-validation and hyperparameter tuning are testing. The figure also displays that Revisions, KT, Score-HOTS and
used to obtain the best performance of the models. In this work, accu Time-HOTS have no importance in our model. On the other hand, the
racy, precision, recall and F-score are employed to compare the per important five features having the highest scores, determined by
formances of different classifiers. ANOVA method are - Score-LOTS, Score-HighDiff, Time-LowDiff, Time-
Accuracy is a ratio of correctly predicted observations to the total LOTS and Flag. We can notice that Score-LOTS has achieved the highest
observations: score at 27.507, followed by the second feature i.e., Score-HighDiff at
16.957. The graph also shows that there is no significant difference
Number of Correct Predictions
Accuracy = (18) between Time-LowDiff at 11.125 and Time-LOTS at 10.133. At the
Total number of Predictions
bottom comes both CD at 0.128 and Revisions at 0.051, having the least
F-score is defined as the harmonic mean of precision and recall: influence on model training and testing.
We notice that among the top five important features selected by MI
2*Precision*Recall
F-score = (19) and ANOVA, four features are common, namely Score-LOTS, Socre-
Precision + Recall
HighDiff, Time-LOTS, Time-LowDiff. The fifth feature selected by MI is
where precision is the ratio of true positives to all the predicted positive Time-HighDiff; on the contrary, ANOVA selects Flag as the fifth most
observations and recall is the proportion of true positives in all the important feature for model training and testing.
actual positive observations. In our scenario, we would rather have
models with high precision, reducing the number of honest students 6.3. Results of ML classifiers
wrongly classified as cheaters.
In this section, we train the five classification algorithms – LR, SVM,
6. Experimental results KNN, NB and RF, with the full feature set and also with the top five
features selected by MI and ANOVA, and compare their performance. To
6.1. Experimental setup improve the performance of the models, we performed hyperparameter
tuning using GridSearchCV. The best hyperparameters were selected for
The proposed system was implemented in Google Colab, a free GPU- each algorithm using stratified 5-fold cross-validation in which 80% of
based cloud service offered by Google Inc. The programming language the data is used for training purpose and the remaining is used for testing
7
Fig. 3. Score of features selected by MI.
Fig. 4. Score of features selected by ANOVA.
purpose. F-score was used as the metric to measure the performance of ANOVA methods. The results of K-fold cross-validation for all five
each hyperparameter combination. Table 2 shows the best values of classifiers are shown in Fig. 5 to Fig. 7.
parameters that were passed to the classifiers that achieved the highest From Fig. 5, we observe that with the full feature set, the RF classifier
F-score for full features, and for the top five features selected by MI and yields the highest accuracy at 83.33%, F-score at 74.81%, precision at
83% and recall at 71%. On the other hand, the NB classifier yields the
lowest accuracy at 71.67%, F-score at 58.71%, precision at 57% and
Table 2 recall at 47%.
Best value of hyperparameters for all features and selected features by MI and Fig. 6 shows classifier performance using the top five features
ANOVA. selected by the MI method. It is observed that RF again achieves the best
Model Parameters (All Parameters (MI) Parameters accuracy at 85%, F-score at 76.48%, precision at 86% and recall at 71%.
features) (ANOVA) The LR and SVM perform quite similarly with an approximate accuracy
LR C: 10, penalty: ‘l1’ C: 10, penalty: ‘l2’ C: 10, penalty: ‘l1’
of 82% and an F-score of 72%.
SVM C: 10, kernel: ‘inear’ C: 5, kernel: ‘linear’ C: 1, kernel: ‘rbf’ In Fig. 7, the classifier performance with the top five features
KNN n_neighbors: 5 n_neighbors: 5 n_neighbors: 5 selected by the ANOVA method is compared. It is observed that even in
NB var_smooothing:1e-09 var_smoothing: 1e-09 var_smoothing: 1e-09 this case, the RF classifier achieves the highest accuracy at 83.33% and
RF bootstrap: False, bootstrap: False, bootstrap: True,
F-score at 73.78%; while LR recorded the worst accuracy at 78.33% and
max_depth: 5, max_depth: 3, max_depth: 4,
min_samples_leaf: 3, min_samples_leaf: 3, min_samples_leaf: 1, F-score at 64.57%. It is further noted that similar accuracies and F-scores
n_estimators: 50 n_estimators: 200 n_estimators: 100 are reported by SVM, KNN and NB classifiers.
8
Fig. 5. Performance of different classifiers applied to Full Features.
Fig. 6. Performance of different classifiers applied to top five features selected by MI method.
The above findings show that the RF classifier, in general, gives as there will be cohort effects that introduce several different patterns.
better performance for our research problem as compared to the other Nonetheless, our model was successful in the classification of about 83%
four classifiers. Overall, the best performance is shown when the RF of the students correctly as cheaters and non-cheaters.
classifier was trained with the top five features selected by the MI
method. 7. Discussion
Cheating is a severe problem in both online and offline assessments

6.4. Unseen dataset results that depreciates the value of academic achievement. The problem is
more difficult to handle in online assessments due to the limited control
So far, we have tested our results on the same dataset as they have of the test environment and readily available opportunities that facili
been trained upon by the machine learning classifiers. To discover the tate cheating. In this paper, we focus on the use of the Internet, a specific
usefulness of our model, we apply our trained model to the new dataset form of forbidden aid, for cheating purposes. It is observed that students
i.e., Dataset2. From the above sections, we can see that the best model often exploit the available Internet to find answers during the assess
was achieved by the Random Forest classifier using the MI-selected ment. Based on the ADMP plan of our previous study (Garg & Goel,
features and the corresponding hyperparameters. We use the best 2022), we proposed a data analysis approach to preserve the integrity of
configuration that achieved the highest accuracy to train our model online assessments by detecting students indulging in Internet cheating.
using Dataset1 and test it on Dataset2. Fig. 8 shows the performance of We used machine learning methods for the detection of cheaters, by
the best model on both datasets. We observe that the accuracy of the analysing the assessment log data. Specifically, we first used feature
model on the unseen data was slightly reduced to 82.35% from 85% engineering to derive thirteen features from the raw datasets of the log
which was achieved from the train/test on the same dataset. Similarly, file. Each feature belongs to one of the four categories, namely,
the F-score also reduced to 66.66% from 76.48%. This is understandable
9
Fig. 7. Performance of different classifiers applied to top five features selected by ANOVA method.
Fig. 8. Performance on both datasets.
Question-type features, Question-difficulty features, Student attempt the datasets used in this study.
features and Student-Class attempt features. We created different ML Besides the performance of predictive models, the engineered fea
models using two feature selection algorithms (MI and ANOVA) and five tures also present some useful insights. We find that among the top five
classifiers - LR, SVM, NB, KNN and RF and evaluated them. In general, influencing features selected by MI and ANOVA, four features are
the best performance was obtained by the RF classifier that scored the common, namely Score-LOTS, Score-HighDiff, Time-LOTS and Time-
highest accuracy at 85% and an F-score at 76.48% with the features LowDiff. The fifth feature selected by MI is Time-HighDiff and by
selected by the MI feature selection method. On the other hand, we ANOVA is Flag. We also see that the Score-LOTS feature has been
observe that the other four classifiers behave differently from RF but are identified as the most influencing feature, by both MI and ANOVA. From
quite similar to each other with slight differences in their performance. Figs. 9a and Fig. 10a, it was observed that cheaters were able to solve
We also examined precision and recall values to estimate the cost of false much more LOTS questions as compared to honest students. We arrive at
positive and false negative. For our problem, we would rather consider the hypothesis that the cheaters were performing better as they might be
models with high precision as compared to high recall. The rationale for obtaining answers from the Internet considering that answers to LOTS
this consideration is that we do not wrongly classify an honest student as questions are easy to find on the Internet. This hypothesis is plausible as
a cheater. We find that the RF classifier with features selected by the MI cheaters were switching tabs on browsers while attempting these
method shows the best performance with precision at 86% and recall at questions. These tab switches and high scores represent suspicious and
71%. Therefore, we consider the RF classifier applied to the top five possibly illicit behaviour. Another influencing feature is Time-LOTS. It
features selected by the MI method as our best model. Next, the best was observed (Figs. 9b and 10b) that cheaters took approx. 20 s more to
model was further validated on a real-world dataset, where we achieved solve a LOTS question as compared to honest students. We hypothesize
an accuracy of 82.5%. Since we obtained good performance for both that the cheaters might be taking relatively more time because searching
datasets, we consider that the findings can be generalized well beyond the Internet for correct solutions takes time. We also found that features
10
Fig. 9. Scores (a) and Time taken (b) across different question categories in Dataset1.
Fig. 10. Scores (a) and Time taken (b) across different question categories in Dataset2.
such as Score-HOTS and Time-HOTS do not appear in the top influential browsers can only limit the use of the Internet on the assessment device;
features. We can assume that as HOTS questions are not easy to find on while there is always a possibility that a student may own multiple
the Internet, both cheaters and honest students showed similar perfor Internet-enabled devices for taking the assessment and performing
mances (Figs. 9a and 10a) and response times (Fig. 9b) on these ques cheating separately. Our method can be implemented in scenarios where
tions. This is an important finding that suggests that a good combination cheating takes place on a separate device as it is based on the test-taking
of both HOTS and LOTS questions is ideal for designing cheat-proof pattern irrespective of the utilized Internet resource. Another typical
assessments. pitfall of many research studies on academic cheating is the absence of
Another influencing feature is found to be Score-HighDiff, which ground truth that students are performing cheating behaviours. How
represents the score of any student on the high difficulty questions i.e., ever, this major drawback was diminished in our study where we used
the questions that most of the class could not solve. From Figs. 9a and three methods - simulation methods, self-reports and tab switch records
10a, it was observed that cheaters solved much more high difficulty to determine the ground truth and assess the true detection rate. The
questions when compared with honest students. This shows that proposed approach is cost-effective, easy to implement and performs
cheaters were solving questions that were deemed challenging for the automatic detection of cheaters in online assessments.
rest of the cohort. Considering the tab switch count of cheaters, it can be The current study and its findings have some limitations. The first
assumed that they may have achieved this by cheating from the Internet. one is the limited size of the dataset. In educational data mining and
Furthermore, on low difficulty questions, we see a significant difference learning analytics research, small datasets are commonly used as it is
between the average response times of honest students and cheaters i.e., challenging to reach the volume of educational data, which would fully
53 sec and 73 sec, respectively, in both datasets (Figs. 9b and 10b). satisfy the machine learning requirements besides using logs (Kabathova
Overall, we witness that cheaters tend to have higher response times due & Drlik, 2021). Moreover, unlike other application domains of machine
to their cheating activities. Implementation of timers on individual learning models, we could not combine different datasets as the indi
questions of the assessments can be a possible solution to restrict the vidual records are student and assessment-dependent. The dataset size
time for cheating activities. used in this study represents a typical classroom size of traditional off
Another influencing feature is found to be Flag that represents the line courses and small private online courses. Nonetheless, the results
outlying response times of any student compared to the class average. show that with special feature engineering methods, machine learning
We hypothesise that these outlying response times are linked to cheating with small datasets has huge potential for detecting dishonest behav
behaviours. We cannot necessarily say that the student is cheating when iours in online assessments. To further achieve confidence in our results,
showing prolonged or shortened response times; however, frequent we tested our algorithm on a real-world dataset that showed similar
occurrences of such behaviours can be deemed as suspicious for further performance, suggesting the findings can be generalized beyond the
scrutiny. We also found that except for Flag, all other features based on datasets used in this study. In future, we will explore deep learning
student attempts and student-class attempts showed limited contribu methods when the dataset size is larger. Another limitation of this study
tion to the model. is the existence of a few false alarms. Our algorithm lies on the conser
The existing cheating mitigation methods like the use of secure vative side with high precision, reducing the number of honest students
11
falsely classified as cheaters. Because of the false alarms, it is suggested Alexandron, G., Ruipérez-Valiente, J. A., Chen, Z., Muñoz-Merino, P. J., &
Pritchard, D. E. (2017). Copying@Scale: Using harvesting accounts for collecting
that this algorithm may be best implemented as a filter for human re
correct answers in a MOOC. Computers and Education, 108(March), 96–114. https://
view. Future researchers can also look into the combination of present doi.org/10.1016/j.compedu.2017.01.015
work with techniques like facial expression recognition, keystroke dy Alexandron, G., Ruipérez-Valiente, J. A., & Pritchard, D. E. (2019). Towards a general
namics, etc, to further improve the performance. The third limitation of purpose anomaly detection method to identify cheaters in massive open online
courses. In EDM 2019 - Proceedings of the 12th International Conference on Educational
this study is the fact that the approach is applicable for assessments that Data Mining (pp. 480–483). https://doi.org/10.35542/osf.io/wuqv5
are designed using Bloom’s Taxonomy. However, past research and our Amigud, A., Arnedo-Moreno, J., Daradoumis, T., & Guerrero-Roldan, A. E. (2017). Using
findings show that assessments designed using different levels of BT are learning analytics for preserving academic integrity. International Review of Research
in Open and Distance Learning, 18(5), 192–210. https://doi.org/10.19173/irrodl.
beneficial in not only holistic evaluation of the learning outcomes but v18i5.3103
also in controlling cheating. Amigud, A., Arnedo-Moreno, J., Daradoumis, T., & Guerrero-Roldan, A. E. (2018). An
integrative review of security and integrity strategies in an academic environment:
Current understanding and emerging perspectives. Computers and Security, 76,
8. Conclusion 50–70. https://doi.org/10.1016/j.cose.2018.02.021
Asep, H. S. G., & Bandung, Y. (2019). A design of continuous user verification for online
In this study, we used machine learning and feature engineering exam proctoring on M-learning. In Proceedings of the international conference on
electrical engineering and informatics, 2019-July(July), 284–289. https://doi.org/
methods for the detection of Internet cheating in online assessments. We 10.1109/ICEEI47359.2019.8988786.
derived thirteen features from the assessment log files and categorized Awasthi, S. (2019). Plagiarism and academic misconduct: A systematic review. DESIDOC
them into four categories, namely, Question-type features, Question- Journal of Library and Information Technology, 39(2), 94–100.
Balderas, A., & Caballero-Hernández, J. A. (2020). Analysis of learning records to detect
difficulty features, Student-Class attempt features and Student attempt student cheating on online exams: Case study during COVID-19 pandemic. ACM
features. We created different models using two feature selection International Conference Proceeding Series, 752–757. https://doi.org/10.1145/
methods (MI and ANOVA) and five ML classifiers (LR, SVM, KNN, NB 3434780.3436662
Butler-Henderson, K., & Crawford, J. (2020). A systematic review of online
and RF) and compared their classification performance. The experi examinations: A pedagogical innovation for scalable authentication and integrity.
mental results indicate that (1) Features based on question type and Computers and Education, 159(September). https://doi.org/10.1016/j.
question difficulty led to better classification performance; (2) Except compedu.2020.104024
Chang, C.-Y., & Mao, S.-L. (1999). The effects on students’ cognitive achievement when
for Flag, Student-class attempt features and Student-attempt features
using the cooperative learning method in earth science classrooms. School Science
showed limited contribution in the model and (3) Among all the and Mathematics, 99(7), 374–379. https://doi.org/10.1111/j.1949-8594.1999.
considered classifiers, the RF classifier outperforms with the top five tb17497.x
Chuang, C. Y., Craig, S. D., & Femiani, J. (2017). Detecting probable cheating during
features selected by MI method with an accuracy of 85%. The methods
online assessments based on time delay and head pose. Higher Education Research and
and models were tested on two online assessment datasets collected in Development, 36(6), 1123–1137. https://doi.org/10.1080/07294360.2017.1303456
different time periods. The findings offer a significant contribution to the Cook, R. D. (1977). Detection of influential observation in linear regression.
literature, as the study confirms the possibility of building an automated Technometrics, 19(1), 15–18. https://doi.org/10.1080/00401706.1977.10489493
Costley, J. (2019). Student perceptions of academic dishonesty at a cyber-University in
system that could detect probable cheaters in online assessments, as South Korea. Journal of Academic Ethics, 17(2), 205–217. https://doi.org/10.1007/
proposed in ADMP (Garg & Goel, 2022). The proposed approach helped s10805-018-9318-1
discover various patterns that are required to understand the behaviour Cramp, J., Medlin, J. F., Lake, P., & Sharp, C. (2019). Lessons learned from implementing
remotely invigilated online exams. Journal of University Teaching and Learning
of cheating students. Implications for online instructors and practi Practice, 16(1).
tioners for designing cheat-proof assessments are also provided. Our Crook, C., & Nixon, E. (2019). The social anatomy of ‘collusion’. British Educational
work also opens the possibility of more research built on the proposed Research Journal, 45(2), 388–406. https://doi.org/10.1002/berj.3504
El-Nakla, D., McNally, B., & El-Nakla, S. (2019). The importance of institutional support
methodology, thus opening new avenues for future researchers who are in maintaining academic rigor in e-learning assessment. In 2019 2nd International
focused on finding machine learning solutions for establishing integrity conference on new trends in computing sciences - Proceedings, 1–5. https://doi.org/
in online learning environments. 10.1109/ICTCS.2019.8923111
Gagné, A. (2020). Reflections on academic integrity and educational development during
COVID-19. Canadian Perspectives on Academic Integrity, 3(2), 16–17.
Funding Garg, M., & Goel, A. (2022). A systematic literature review on online assessment security:
Current challenges and integrity strategies. Computers and Security, 113, Article
102544. https://doi.org/10.1016/j.cose.2021.102544
This research did not receive any specific grant from funding Ghizlane, M., Hicham, B., & Reda, F. H. (2019). A new model of automatic and
agencies in the public, commercial, or not-for-profit sectors. continuous online exam monitoring. In Proceedings - 2019 4th International conference
on systems of collaboration, big data, internet of things and security, SysCoBIoTS 2019,
1–5. https://doi.org/10.1109/SysCoBIoTS48768.2019.9028027.
CRediT authorship contribution statement Golden, J., & Kohlbeck, M. (2020). Addressing cheating when using test bank questions
in online classes. Journal of Accounting Education, 52(xxxx), Article 100671. https://
Manika Garg: Conceptualization, Methodology, Writing – original doi.org/10.1016/j.jaccedu.2020.100671
Gomez, M. J., Calderón, M., Sánchez, V., Clemente, F. J. G., & Ruipérez-Valiente, J. A.
draft, Writing – review & editing. Anita Goel: Supervision, Writing – (2022). Large scale analysis of open MOOC reviews to support learners’ course
review & editing. selection. Expert Systems with Applications, 210(July), Article 118400. https://doi.
org/10.1016/j.eswa.2022.118400
Hira, Z. M., & Gillies, D. F. (2015). A review of feature selection and feature extraction
Declaration of Competing Interest methods applied on microarray data. https://doi.org/10.1155/2015/198363.
Hylton, K., Levy, Y., & Dringus, L. P. (2016). Utilizing webcam-based proctoring to deter
misconduct in online exams. Computers and Education, 92–93, 53–63. https://doi.
The authors declare that they have no known competing financial org/10.1016/j.compedu.2015.10.002
interests or personal relationships that could have appeared to influence Jaramillo-Morillo, D., Ruipérez-Valiente, J., Sarasty, M. F., & Ramírez-Gonzalez, G.
(2020). Identifying and characterizing students suspected of academic dishonesty in
the work reported in this paper.
SPOCs for credit through learning analytics. International Journal of Educational
Technology in Higher Education, 17(1), 45. https://doi.org/10.1186/s41239-020-
References 00221-2
Kabathova, J., & Drlik, M. (2021). Towards predicting student’s dropout in university
courses using different machine learning techniques. Applied Sciences (Switzerland),
Ahmad Muhammad, B., Qi, C., Wu, Z., & Kabir Ahmad, H. (2022). GRL-LS: A learning
11(7). https://doi.org/10.3390/app11073130
style detection in online education using graph representation learning. Expert
Kamalov, F., Sulieman, H., & Calonge, D. S. (2021). Machine learning based approach to
Systems with Applications, 201(June 2021), 117138. https://doi.org/10.1016/j.
exam cheating detection. PLoS ONE, 16(8 August). https://doi.org/10.1371/journal.
eswa.2022.117138.
pone.0254340
Alessio, H. M., Malay, N., Maurer, K., Bailer, A. J., & Rubin, B. (2018). Interaction of
King, M. R. (2023). A Conversation on artificial intelligence, chatbots, and plagiarism in
proctoring and student major on online test performance. International Review of
higher education. Cellular and Molecular Bioengineering, 16(1), 1–2. https://doi.org/
Research in Open and Distance Learning, 19(5), 166–185. https://doi.org/10.19173/
10.1007/s12195-022-00754-8
irrodl.v19i5.3698
12
Kharbat, F. F., & Abu Daabes, A. S. (2021). E-proctored exams during the COVID-19 Ramnial, H., Panchoo, S., & Pudaruth, S. (2016). Authorship attribution using stylometry
pandemic: A close understanding. Education and Information Technologies. https:// and machine learning techniques. In Advances in intelligent systems and computing
doi.org/10.1007/s10639-021-10458-7 (Vol. 384, pp. 113–125). https://doi.org/10.1007/978-3-319-23036-8_10.
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. Ranger, J., Schmidt, N., & Wolgast, A. (2020). The detection of cheating on E-exams in
https://doi.org/10.1038/nature14539 higher education—The performance of several old and some new indicators.
Li, H., Xu, M., Wang, Y., Wei, H., & Qu, H. (2021). A visual analytics approach to facilitate Frontiers in Psychology, 11(October), 1–16. https://doi.org/10.3389/
the proctoring of online exams. https://doi.org/10.1145/3411764.3445294. fpsyg.2020.568825
Ljubovic, V., & Pajic, E. (2020). Plagiarism detection in computer programming using Rogerson, A. M., & McCarthy, G. (2017). Using Internet based paraphrasing tools:
feature extraction from ultra-fine-grained repositories. IEEE Access, 8, 96505–96514. Original work, patchwriting or facilitated plagiarism? International Journal for
https://doi.org/10.1109/ACCESS.2020.2996146 Educational Integrity, 13(1). https://doi.org/10.1007/s40979-016-0013-y
McCabe, D. L., Treviño, L. K., & Butterfield, K. D. (2001). Cheating in academic Ruipérez-Valiente, J. A., Jaramillo-Morillo, D., Joksimović, S., Kovanović, V., Muñoz-
institutions: A decade of research. Ethics and Behavior, 11(3), 219–232. https://doi. Merino, P. J., & Gašević, D. (2021). Data-driven detection and characterization of
org/10.1207/S15327019EB1103_2 communities of accounts collaborating in MOOCs. Future Generation Computer
Mellar, H., Peytcheva-Forsyth, R., Kocdar, S., Karadeniz, A., & Yovkova, B. (2018). Systems, 125, 590–603. https://doi.org/10.1016/j.future.2021.07.003
Addressing cheating in e-assessment using student authentication and authorship Russell, S., & Norvig, P. (2009). Artificial intelligence: A modern approach. Prentice Hall
checking systems: Teachers’ perspectives. International Journal for Educational Press.
Integrity, 14(1). https://doi.org/10.1007/s40979-018-0025-x Salhofer, P. (2017). Analysing student behavior in CS courses. 2017 IEEE global
Mungai, P. K., & Huang, R. (2017). Using keystroke dynamics in a multi-level engineering education conference (EDUCON), April, 1426–1431.
architecture to protect online examinations from impersonation. 2017 IEEE 2nd Sangalli, V. A., Martinez-Munoz, G., & Canabate, E. P. (2020). Identifying cheating users
International conference on big data analysis, ICBDA 2017, 622–627. https://doi.org/ in online courses. IEEE global engineering education conference, EDUCON, 2020-April,
10.1109/ICBDA.2017.8078710. 1168–1175. https://doi.org/10.1109/EDUCON45650.2020.9125252.
Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E. B., & Turaga, D. (2017). Learning Singh, R., Timbadia, D., Kapoor, V., Reddy, R., Churi, P., & Pimple, O. (2021). Question
feature engineering for classification. IJCAI International Joint Conference on Artificial paper generation through progressive model and difficulty calculation on the
Intelligence, (August), 2529–2535. https://doi.org/10.24963/ijcai.2017/352 Promexa Mobile Application. Education and Information Technologies. https://doi.
Nguyen, J. G., Keuseman, K. J., & Humston, J. J. (2020). Minimize online cheating for org/10.1007/s10639-021-10461-y
online assessments during covid-19 pandemic. Journal of Chemical Education, 97(9), Steger, D., Schroeders, U., & Wilhelm, O. (2021). Caught in the act: Predicting cheating
3429–3435. https://doi.org/10.1021/acs.jchemed.0c00790 in unproctored knowledge assessment. Assessment, 28(3), 1004–1017. https://doi.
Noorbehbahani, F., Mohammadi, A., & Aminazadeh, M. (2022). A systematic review of org/10.1177/1073191120914970
research on cheating in online exams from 2010 to 2021. In Education and Thompson, E., Grove, H., Luxton-reilly, A., Whalley, J. L., & Robbins, P. (2008). Bloom’s
information technologies (Issue 0123456789). Springer US. https://doi.org/10.1007/ Taxonomy for CS assessment. Tenth Australasian computing education conference
s10639-022-10927-7. (ACE2008), 78(January), 1–8.
Okada, A., Whitelock, D., Holmes, W., & Edwards, C. (2019). e-Authentication for online Trezise, K., Ryan, T., de Barba, P., & Kennedy, G. (2019). Detecting contract cheating
assessment: a mixed-method study. British Journal of Educational Technology, 50(2), using learning analytics. Journal of Learning Analytics, 6(3), 90–104. https://doi.org/
861–875. https://doi.org/10.1111/bjet.12608 10.18608/jla.2019.63.11
Opgen-Rhein, J., Küppers, B., & Schroeder, U. (2018). An application to discover Ullah, A., Xiao, H., & Barker, T. (2019). A dynamic profile questions approach to mitigate
cheating in digital exams. ACM International Conference Proceeding Series, 10, impersonation in online examinations. Journal of Grid Computing, 17(2), 209–223.
3279740. https://doi.org/10.1007/s10723-018-9442-6
Pappas, E., Pierrakos, O., & Nagel, R. (2013). Using Bloom’s taxonomy to teach von Grunigen, D., de Azevedo e Souza, F. B., Pradarelli, B., Magid, A., & Cieliebak, M.
sustainability in multiple contexts. Journal of Cleaner Production, 48, 54–64. https:// (2018). Best practices in e-assessments with a special focus on cheating prevention.
doi.org/10.1016/j.jclepro.2012.09.039 N 2018 IEEE global engineering education conference (EDUCON), 893–899. https://doi.
Pham, T. T., Nguyen, D. N., Dutkiewicz, E., McEwan, A. L., Thamrin, C., Robinson, P. D., org/10.1109/EDUCON.2018.8363325.
& Leong, P. H. W. (2016). Feature engineering and supervised learning classifiers for Wang, X., Wang, W., He, Y., Liu, J., Han, Z., & Zhang, X. (2017). Characterizing Android
respiratory artefact removal in lung function tests. 2016 IEEE global communications apps’ behavior for effective detection of malapps at large scale. Future Generation
conference, GLOBECOM 2016 - Proceedings. https://doi.org/10.1109/ Computer Systems, 75, 30–45. https://doi.org/10.1016/j.future.2017.04.041
GLOCOM.2016.7841839.
13

Expert Systems With Applications: Manika Garg, Anita Goel

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Expert Systems With Applications: Manika Garg, Anita Goel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Expert Systems With Applications: Manika Garg, Anita Goel

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 225 (2023) 120111

Contents lists available at ScienceDirect

Expert Systems With Applications

Preserving integrity in online assessment using feature engineering and

Fig. 1. Sample of assessment log file.

1 Question Type Score-HOTS SHOTS

unanswered) respectively by student i(i = 1⋯N) on question j(j =

• Visit – This feature represents the average number of views per

The purpose of our study is to develop a machine learning model to

Fig. 2. The framework of building the cheating detection model.

Fig. 3. Score of features selected by MI.

Fig. 4. Score of features selected by ANOVA.

Fig. 5. Performance of different classifiers applied to Full Features.

Cheating is a severe problem in both online and offline assessments

Fig. 8. Performance on both datasets.

You might also like