Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Data-Driven FGCS

Método para la identificación de fraude en cursos en línea

Uploaded by

Daniel Jaramillo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data-Driven FGCS

Método para la identificación de fraude en cursos en línea

Uploaded by

Daniel Jaramillo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Future Generation Computer Systems 125 (2021) 590–603

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Data-driven detection and characterization of communities of


accounts collaborating in MOOCs

José A. Ruipérez-Valiente a , , Daniel Jaramillo-Morillo b , Srećko Joksimović c ,
Vitomir Kovanović c , Pedro J. Muñoz-Merino d , Dragan Gašević e
a
Department of Software Engineering and Artificial Intelligence, Complutense University of Madrid, Spain
b
Departamento de Telemática, Universidad del Cauca, Popayán, Colombia
c
Education Futures, University of South Australia, Australia
d
Department of Telematics Engineering, Universidad Carlos III de Madrid, Spain
e
Faculty of Information Technology, Monash University, Australia

article info a b s t r a c t

Article history: Collaboration is considered as one of the main drivers of learning and it has been broadly studied
Received 2 December 2020 across numerous contexts, including Massive Open Online Courses (MOOCs). The research on MOOCs
Received in revised form 7 June 2021 has risen exponentially during the last years and there have been a number of works focused
Accepted 4 July 2021
on studying collaboration. However, these previous studies have been restricted to the analysis
Available online 13 July 2021
of collaboration based on the forum and social interactions, without taking into account other
Keywords: possibilities such as the synchronicity in the interactions with the platform. Therefore, in this work
Learning analytics we performed a case study with the goal of implementing a data-driven approach to detect and
Educational data mining characterize collaboration in MOOCs. We applied an algorithm to detect synchronicity links based
Collaborative learning on their submission times to quizzes as an indicator of collaboration, and applied it to data from
Massive open online courses two large Coursera MOOCs. We found three different profiles of user accounts, that were grouped in
Artificial intelligence couples and larger communities exhibiting different types of associations between user accounts. The
characterization of these user accounts suggested that some of them might represent genuine online
learning collaborative associations, but that in other cases dishonest behaviors such as free-riding or
multiple account cheating might be present. These findings call for additional research on the study
of the kind of collaborations that can emerge in online settings.
© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction of teaching and learning, and transform the current infrastructure


into a modernized data-driven higher education. Many studies
Massive Open Online Courses (MOOCs) are online courses have focused on analyzing and characterizing student behav-
that cater to large numbers of students, are designed for open iors in these courses and thus, generated inputs that can help
participation and can be accessed by anyone via the Internet [1, improve the learning process in digitally-mediated educational
2]. MOOCs have become a promising worldwide educational environments [5,6].
medium which have attracted much attention from different On the other hand, MOOCs support the social constructivism
stakeholders, and many institutions have chosen to incorporate theory of learning that enables group interaction, mutual work,
them into their educational programs, including for academic discussion, and collaborative knowledge formation. In this way,
credit [3,4]. The entrance of MOOCs in the higher education collaboration is considered as one of the main drivers of learn-
sector has also facilitated the collection of large amounts of ing [7], and many learning theories promote the benefits of col-
data from students distributed around the globe, which in turn laborative learning, both in face-to-face and online courses. Then,
has helped thrive data analytics in education. The analysis of it is no surprise that there have been numerous researchers that
educational data can help improve the quality and effectiveness have studied collaboration in MOOCs through the use of com-
munication tools such as forums, or collaborative projects [8,9].
∗ Corresponding author. Teachers encourage student participation in the course through
E-mail addresses: jruipere@ucm.es (J.A. Ruipérez-Valiente), the technology and often use third-party tools and plugins to
dajaramillo@unicauca.edu.co (D. Jaramillo-Morillo),
Srecko.Joksimovic@unisa.edu.au (S. Joksimović), vitomir.kovanovic@unisa.edu.au
provide additional collaboration functionalities to students, such
(V. Kovanović), pedmume@it.uc3m.es (P.J. Muñoz-Merino), as social networks, messaging, or video conferencing tools [10].
dragan.gasevic@monash.edu (D. Gašević). Collaboration can also emerge through different student activities

https://doi.org/10.1016/j.future.2021.07.003
0167-739X/© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

such as commenting, responding, updating and sharing through RQ2 What are the behavioral characteristics of the detected as-
discussion forums, increasing student participation [10,11]. In sociations of accounts?
this way, learning can also arise from the connections between
students in a spontaneous way and not only from the interaction The remainder of the paper is organized as follows. Section 2
with content. However, few authors have studied students’ col- reviews the related work in the area of student behavioral model-
laborations beyond what is visible online, for example, through ing, collaboration in MOOCs, and academic dishonesty. Section 3
the analysis of the interaction with the courseware, such as the presents the methodology applied to conduct this research, while
course navigation, content visualization, or the submission of the Section 4 describes the results regarding behavioral characteri-
scheduled exams. Therefore, we find that this approach that tries zation of the different accounts and associations. Section 5 dis-
to reveal traces of collaboration that happen in the background, cusses results comparing with the literature and, finally, Section 6
is currently missing in the literature. concludes the paper.
Studies on collaboration typically take place in controlled en-
vironments or online classrooms where there are small numbers 2. Background
of students. However, MOOCs are a unique playground for exam-
ining how students collaborate at a larger scale [12]. Apart from In this background, we focus on presenting an overview of
having large amounts of data to analyze, MOOC students have the three research directions that are more closely related to our
very heterogeneous profiles, beliefs, and reasons to participate in work. First, in Section 2.1 we review studies that have applied
the courses [13]. Previous work studied how students behave in techniques from educational data mining and learning analytics
an online course and much of the work highlights the benefits to model student behavior. Then, in Section 2.2 we focus on
of collaboration in learning environments. However, it has also the studies that have analyzed collaboration behavior in MOOCs.
been found that not all collaborative student behaviors are good. Finally, in Section 2.3 we examine studies that tackled academic
Numerous unethical behaviors have been found, such as helping dishonesty behaviors in MOOCs.
friends to pass their exams, or even using different accounts to
obtain feedback through multiple attempts to questions [6]. For 2.1. Analysis of student behavior in MOOCs
example, Hellas et al. [14], Lan et al. [15], and Waters et al. [16]
identified potential unethical collaborations through the analysis There is a high diversity in the kind of work published within
of similarities in the scheduling of the activities taken, and the the context of student modeling in MOOCs. Much of it has been
start and end times of take-home exams. Therefore, it is impor- focused on modeling students’ motivations to participate in these
tant to better understand how students are actually collaborating courses and their preferences [21–23]. A number of studies have
in MOOCs. specifically focused on students’ motivation with gamification
Besides, we found several previous studies on collaborative features, for example, to analyze their perceptions toward earning
learning in MOOC environments focused on analyzing tools for badges in a gamified MOOC [24] or to propose metrics to infer
course collaboration and the behavior of students in the dis- which students are earning badges intentionally [25]. These stud-
cussion forums [10,17–19]. However, we did not find any work ies aim to better understand the motivations of MOOC learners in
that performed a data-driven detection and characterization of order to adapt the materials and better cater to learners’ needs
collaborations based on students’ interaction data. This refers to and interests.
‘invisible collaborations’ that cannot be detected by simply look- Another predominant purpose of modeling students’ behavior
ing at online social interaction in forums or similar collaborative has been to predict learners’ attrition in MOOCs. For example, Ha-
tools. In this paper, we present a novel data-driven approach to lawa et al. [26] presented a dropout predictor based on the
characterize students’ collaborations in MOOCs. This work builds interaction activity of students with the MOOC platform that
on top of an algorithm to detect collaborations that we developed can provide a trustworthy dropout risk factor. Ramesh et al.
in previous work [20], and that operationalizes collaboration [22] also presented a framework for modeling and understanding
as the synchronization of students when they submitted their student engagement in online courses based on trace data, using a
quizzes to the MOOC platform. With respect to our previous probabilistic model to connect student behavior with course com-
study [20], this work takes place within the same context of pletion. These studies have sought the possibility to implement
Coursera MOOCs, using the same data set, and considering sim- systems that can help improve MOOC completion.
ilar variables (Sections 3.1, 3.2, and 3.3). Then, we re-use the Moreover, another key research line in MOOCs has been the
algorithm to detect collaborators from previous work [20] in the investigation of which behaviors affect learning outcomes. For ex-
same context where it was previously applied and using the same ample, Al-Shabandar et al. [27] conducted two experiments to an-
parameters that we previously validated (Section 3.4). The new alyze which behavioral features were related to engagement lev-
methodological contribution comes afterwards, by proposing a els and positive learning outcomes. In addition, Ruipérez-Valiente
data-driven characterization of those accounts that were detected et al. [28] conducted a study on a Khan Academy instance build-
as collaborators (Section 3.5), which is completely novel in the ing a prediction model of learning gains that included different
literature. We present insights about the types of accounts and activity indicators and behavioral data. They found a number of
characterize the different emerging associations, while also con- behaviors positively correlated with learning gains (e.g., students
necting these findings with the current literature and theory. who follow recommendations made by course instructors), while
The methodology presented in this paper provides new ways to others were negatively correlated (e.g., unreflective behaviors).
use digital trace data to understand e-learning collaboration and Results from these kind of studies can help understand instructors
potentially provide feedback to instructors and students. Further- and researchers which behaviors can have a positive or negative
more, because collaborations are not unique to MOOC courses, impact on learning outcomes, and thus enable the possibility of
the depicted methodology can be re-used in new research ap- promoting or discouraging certain behaviors.
plied to other online learning contexts. Specifically, we have the A large body of clustering studies in MOOCs have applied these
following Research Questions (RQs): techniques to find different behavioral profiles of students based
on how they interacted with the activities [25,29–31]; there are
RQ1 What are the types of students’ accounts based on their nuances between these studies, for example [25] aimed to infer
interaction with the MOOC platform? profiles of engagement with respect to the gamification features
591
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

of Khan Academy, Chen et al. [31] focused on extracting self- Several studies have looked into the effects that collabora-
regulated learning strategies patterns, and both [29] and [30] tion may have on different learning outcomes in MOOCs. In this
focused on extracting different subpopulations of learners based sense, Brooks et al. [38] investigated whether participating in
on how they engaged with the activities. Other studies have a MOOC with friends or colleagues can improve both course
applied clustering for alternative purposes; for example, Li and completion and student social interaction during the course. In
Li [32] used clustering approaches to provide personalized rec- this study, they sent surveys to students to analyze those who
ommendations of MOOCs to users based on their characteristics enrolled with friends, and the results suggested that enrolling
or [33] applied it to study different profiles of participation in in a MOOC with peers correlated positively with course com-
MOOC discussion forums. Moreover, clustering has also been pletion rate, level of achievement, and use of the discussion
used within MOOC studies for group formation purposes. For forum. They demonstrated that there was a positive effect on
example, Lynda et al. [34] used it to group learners with similar student academic achievement and an increased online interac-
profiles for the peer-review process and Sanz-Martínez et al. [35] tion when students enrolled with friends or colleagues. Li et al.
used it to group alike learners for collaborative learning activities. [39] investigated the benefits of collaborations in MOOCs through
As we see, the majority of the studies have used clustering either an inverted classroom case study. Their results suggested that
to find profiles of students in MOOCs, for recommendation pur- students in MOOCs prefer to study in groups, and that social
poses, or for group formation in order to develop some sort of facilitation within study groups can make learning difficult con-
activity between peers. However, to the best of our knowledge, cepts a more enjoyable experience. The students reported a high
clustering has not been applied within MOOCs for the purpose of overall satisfaction with this study group learning approach and
characterizing collaborations. the research revealed that students liked to be in sync with
The studies mentioned in this subsection have demonstrated the group while watching the MOOC videos and completing the
diverse purposes to perform behavioral modeling of MOOC learn- assessments. However, neither of these two studies analyzed the
ers. However, even though student collaboration is one of the actual behaviors that these students performed in the MOOC
outstanding opportunities in MOOCs, few papers reported re- platforms while collaborating together.
sults regarding behavioral modeling that is performed to detect Collaboration in MOOC discussion forums has also been a
or characterize collaboration in MOOCs; our research study is common topic in the literature [18,40]. For example, Cohen et al.
focused in this direction. [18] used learning analytics methods to retrieve and analyze
data of students’ interaction with the course forums. The authors
2.2. Collaboration in MOOCs
showed that 20% of the students were collaborating in the forums
throughout the course and they were responsible for 50% of the
Although numerous studies have focused on analyzing how
total posts. Similarly, Ezen-Can et al. [40] presented a study of
students behave in MOOC environments, only few of them have
MOOC discussion forums with the aim of automatically extracting
delved into students’ collaborations. In this direction, Claros et al.
the structure of discussions posts to understand how students
[36] presented several reflections about monitoring and assess-
collaborate with each other.
ment processes from two collaborative learning systems: The first
Most studies on collaboration in MOOCs explored how stu-
one was defined with the aim of engaging students in a social
dents interacted through a collaboration tool or what benefits are
process around the composition of interactive multimedia learn-
gained from these collaborations. However, our approach is very
ing objects, while the second one sought to help the instructors
different from these studies, as we use a data-driven algorithm to
in the design of collaborative learning scenarios with a set of
detect and characterize students’ accounts that are collaborating
services embedded into Moodle. By experimenting with these
when there is no specific encouragement to collaborate or addi-
two collaborative learning approaches, the authors provided rec-
ommendations on how to apply these approaches to MOOCs in tional tools to do so. We seek to know how students collaborate
order to reduce instructors’ workload. However, they did not and whether these collaborations are learning-oriented or geared
analyze the collaborations and interactions between students that towards effortlessly obtaining a certificate; no approaches like
took place in the courses. this one have been reported in the literature thus far.
On the other hand, the majority of MOOC platforms offer
limited technical functionality for collaborative work. After ex- 2.3. Academic dishonesty in MOOCs
amining the collaboration support across Coursera, edX, Udacity,
and MiriadaX MOOC platforms, Staubitz et al. [17] encouraged While collaboration has been depicted as a great opportunity
future work to improve features to support collaborative learning to improve online learning [41], there is also a delicate line be-
in MOOCs. Based on the analysis, the authors implemented a set tween healthy collaborations and academic dishonesty. Previous
of tools that can support collaboration on the OpenHPI MOOC work has been exploring this issue, for example [16] presented
platform. This set of tools consisted of a general virtual space for a framework for detecting collaboration between students in
collaborative online learning, which supports study groups, topic- online or take-home tests, which depending on the course rules
centered learning, and teams in both public and private working could be labeled as academic dishonesty. The authors developed
groups. For online communication, a combination of synchronous a method to detect collaborations by making use of the SPARFA
and asynchronous tools was added, such as a lab collaboration (SPARse Factor Analysis) framework. With this, Lan et al. [15]
space that provides learning groups with the opportunity to proposed two Bayesian hypothesis tests to detect collaboration
share artifacts. Staubitz and Meinel [37] continued this line of in educational data sets. The first test examines the number of
work by examining the practical implications of some forms of matches between couples of students given by SPARFA and uses
collaborative learning that were implemented in the OpenHPI this information to infer the probability of collaboration. The
platform. The most important conclusion of their study was that second test examines the sequence of joint responses by couples
the number of participants contributing to the forum increased of students using a specific model of collaboration and assesses
considerably when instructors participated in the collaborative the probability that such patterns will emerge independently.
process. Their results also confirmed that forum participation in However, this method has not been tested in MOOCs.
MOOCs actually works better with a large number of participants, Academic dishonesty in MOOCs has received much attention
as both students and instructors are more active because there in the literature, where several authors have proposed algorithms
are more interactions in the forums. for the detection of CAMEO (Copying Answers using Multiple
592
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

Existences Online) behaviors [6,42–44]. CAMEO is one of the 3.2. Data collection
reported methods of cheating in MOOCs, where harvester (fake)
accounts are used to get correct answers using the automatic We used Coursera’s raw student interaction data, which in-
feedback of the system, which are then used by a master account cluded actions and clicks performed by the student while in-
to achieve the grade that allows the student to get a certifi- teracting with the MOOCs. Coursera provides raw SQL exports,
cate. Bao [42], Northcutt et al. [43], and Alexandron et al. [6] clickstream logs, and demographic data for session-based courses.
presented algorithms for identifying student submissions that The SQL exports of the course can be imported into a relational
database and queried via traditional SQL statements.
were performed applying this CAMEO method; the algorithms
A total of 53,831 and 89,896 students enrolled in PHIL and
were based on several heuristics that make use (among other
MUSIC MOOC respectively. Since the focus of the study is to
things) of the IP addresses of the students and the timestamps detect collaboration across the course, we filtered out those stu-
of the submissions. Moreover, Ruipérez-Valiente et al. [44] pre- dents that did not persist through it. We operationalized this by
sented a supervised machine learning algorithm that detected selecting a sub-sample of only those students that submitted all
CAMEO without using IP addresses by using a previously labeled the quizzes in a course. The final amount of students that passed
sample of CAMEO submissions. This algorithm used as input this criteria and are included in the study are 2359 (4.38% from
several features about the submissions, students, and the design total) and 5159 (5.73% from total) students from the PHIL and
of the problem to predict the likelihood of a submission being MUSIC courses, respectively.
completed using CAMEO.
Following the line of data-driven detection of academic dis- 3.3. Considered variables
honesty, Ruipérez-Valiente et al. [20] proposed an algorithm that
detects collaboration links between students in online learning We implemented scripts to perform feature engineering based
environments, which is the one that we use in this study. Specif- on the raw data provided by Coursera. We decided to imple-
ment metrics related to different dimensions: the academic en-
ically, the study presented a method developed to detect links
gagement (grades and submissions) and behavioral engagement
between students based on the students’ temporal closeness or
with the platform (general activity levels, interaction with videos
synchronization when submitting their quizzes [20]. The study
and discussion forums). The rationale to select these dimensions
found that the detected students needed significantly less activity was based on having different aspects to characterize the col-
with the courseware to get a certificate of completion. However, laborations. The initial selection of features was based on the
the authors concluded the paper indicating that more work was experience of the co-authors in MOOC research. For the academic
needed in the future to characterize students’ behaviors based on engagement we implemented the following features:
the interaction data with the platform to determine whether stu-
dents were involved into any behaviors that can be characterized • FinalGrade: The final numeric course grade (between 0
as dishonest, which is our goal in this study. and 100).
Overall, we have detected a consistent gap in the literature • GotCertificate: Boolean variable indicating whether a
that warrants a need to propose a data-driven method to charac- given student obtained a certificate in a given course or not.
terize collaborations in MOOCs. This can be particularly important • SubmissionCount: The total number of submissions to
to differentiate between fruitful collaborations and dishonest be- graded assignments that a particular student attempted.
haviors that can lead to free-riding [45,46]. In this manuscript, we • SubmissionUnique: The number of submissions to differ-
ent graded assignments that a particular student attempted.
address this gap by implementing the aforementioned method
to detect collaborations [20], and then we perform a novel data-
• SubmissionAverage: The average number of submissions
per graded assignment attempted.
driven characterization of the different associations that we have
detected. Then, for the behavioral engagement, we implemented the
following features for the general activity levels, videos, and
3. Methodology discussion forums:

• ActiveDaysCount: The total number of days that a partic-


3.1. Context of the study ular student was active in the course.
• ActiveWeeksCount: The total number of weeks that a
The data used in the study comes from two MOOCs offered particular student was active in the course.
on Coursera platform by a large research university in the United • DistinctVideoCount: The total number of unique lecture
Kingdom. First, Introduction to Philosophy (PHIL), which presents videos accessed or downloaded by a given student.
the main areas of research in contemporary philosophy, and • VideoSeekCount: The total number of video seek events
Fundamentals of Music Theory (MUSIC), which introduces students generated by a given student.
to the theory of music providing basic skills to read and write on • VideoPauseCount: The total number of pause events gen-
Western music notation. erated by a given student.
From an instructional design perspective, both courses imple- • DistinctThreadCount: The total number of unique dis-
mented auto-graded quizzes ever week, lasting seven and five cussion topics accessed by a given student.
weeks respectively. Both courses had one graded quiz per week, • DistinctThreadsPosted: The total number of threads of
with around 6–12 (PHIL) and 10–14 (MUSIC) questions per quiz. discussion posted in the forum.
Since our algorithm relies on finding synchronous submissions to • DistinctCommentsPosted: The total number of
quizzes, the fact that both MOOCs have weekly quizzes and large comments posted in threads of discussion.
numbers of students, were our primary reasons to select them. Fig. 1 shows a boxplot visualization with the distribution of
The passing grade of PHIL was 50 points and the one for MUSIC all these features per course and divided for those that acquired
65 points, over a total of 100 possible points in both cases. The a certificate or not. Moreover, we also computed the variables
students did not receive any specific instructions to encourage SubmissionTimes for the detection algorithm, and Order for
collaboration, and therefore we assume that students either knew the community characterization. These variables are defined as
each other beforehand or met while taking the course. follows:
593
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

Fig. 1. Boxplot visualization of the continuous variables considered for this study separated by GotCertificate and computed for each course separately.

• SubmissionTimes: The list of timestamps of all submis- 3.4.2. Algorithm


sions to course graded problems by a given student. The algorithm that we implemented is based on the previous
• Order: For a pair of collaborator accounts, this variable work by Ruipérez-Valiente et al. [20] and consists on identifying
ranges from −1 to 1 indicating the order in which the user accounts on the MOOC platform that always submit their
submissions were done. A value of 1 signals that the first assignments very close in time. The algorithm provides a system-
account always submitted the quizzes before the second atic approach to detect synchronicity between students, which
account and analogously, a value of −1 indicates that the can be an indicator of collaboration, and can be easily applied to
first account always submitted the quizzes after the second any online environment where students have to complete certain
account. The values in between indicate relative difference learning activities.
between these two extremes. The algorithm is based on the comparison of the timestamps
of all quiz submissions done by a student with respect to the rest
of the students of the course and calculating how close they are
3.4. Overview of the detection of collaborators in time, thus obtaining a distance matrix DS. The algorithm uses
a dissimilarity matrix DS ∈ RNxN as follows:
3.4.1. Definition of collaboration in this study ⎛
ds1,1 ds1,2 ds1,3 ··· ds1,N

As we have seen in the related work, collaboration and col-
⎜ ds2,1 ds2,2 ds2,3 ··· ds2,N ⎟
⎝ .. .. .. .. ⎟
laborative learning have been defined and operationalized in DS = ⎜ (1)
many different ways. In this study, we focused on the previously . . . ··· .

reported notion of temporal synchronicity as a state in which the dsN ,1 dsN ,2 dsN ,3 ··· dsN ,N
activities of a collaborating group are synchronized across time, Each entry dsi,j is a real number representing the dissimilarity
that is, when group members are working on the same activity between students i and j based on the differences in their assign-
at the same time, we have that a collaboration is emerging [47]. ment submission times and where N is the number of students in
A systematic literature concluded that the temporal analysis in the course. Each element of matrix DS is calculated by a chosen
collaborative learning can help increase scholar understanding in dissimilarity function diss(sp⃗ i , sp
⃗ j ) ∈ R which operates on vectors
terms of theory and potential methodologies [48], which presents of student submission timestamps, with sp ⃗ i defined as:
a strong alignment with our work. In our case scenario, we
detected this synchronicity via students’ timestamps when they ⃗ i = [spi,1 spi,2 · · · spi,M ], i ∈ {1 · · · N }
sp (2)
submitted their quizzes. The rationale is that the statistical likeli- where spi,1 would be the timestamp of the submission to quiz 1 of
hood of two or more accounts submitting their quizzes at almost the student i computed based on the variable SubmissionTimes
the same time every week is very low, specially given that these and M is the number of quizzes in that course. After the DS matrix
courses do not have due dates. For example, given that MUSIC with the distances between all course participants is computed,
course had seven quizzes, the probability of finding by chance a we establish a threshold to classify a couple of students a collab-
community of four students that always submitted their quizzes orators. Then, we extract from matrix DS all unique entries di,j
around the same time window of five minutes is extremely low where the value of the cell is below said threshold.
given that the tests did not have due dates. We refer to this
as ‘invisible collaborations’ that cannot be detected by simply 3.4.3. Collaborators detected
looking at online social interaction in forums or other social tools. In this study, the dissimilarity measure is the mean absolute
These are the underlying conceptual foundations of the algorithm deviation (MAD), since it provides a comprehensive value to un-
that we detail next and the rationale why we selected it. derstand how closely two students submit their exams. The MAD
594
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

Fig. 2. Boxplot visualization that shows differences in the selected indica- Fig. 3. Bar plot of the relative variable importance after running the clustering
tors for those accounts detected as collaborators and the rest, separated by method with all the continuous considered variables. Blue denotes that the
GotCertificate and computed for each course separately. variable was selected for the final clustering analysis.

measure is defined as follows: We used the relative variable importance as provided by IBM
M
SPSS Statistics Two-Step to evaluate the importance of each pre-
1 ∑ dictor, where for certain variable i we have that:
⃗ i , sp
dissMAD (sp ⃗ j) = |spi,k − spj,k | (3)
M −log10 (sigi )
k=1
VIi = (4)
We used a MAD threshold of 30 min, which is based on exper- max(−log10 (sigj ))
j∈Ω
imenting with different thresholds and dissimilarity measures in
where Ω denotes the set of features introduced to the clustering
our previous study [20]. Based on this procedure, we detected the
algorithm, and sigi is the significance or p-value computed from
following collaborators:
applying a t-test or ANOVA when appropriate [49].
• MUSIC: 30 couples, two three-member communities, one
four-member community, three five-member community, 3.5.2. Selected variables
and one 14-member community. Overall, 99 different stu- To avoid over-fitting of the relatively small data set, we de-
cided to perform a feature selection to optimize the modeling.
dent accounts.
We made an initial run of the IBM SPSS Statistics Two-Step [49]
• PHIL: 11 couples and one four-member community. Overall,
with all of the continuous considered variables in Section 3.3; the
26 different student accounts.
algorithm was run separately for each one of the MOOCs. Then,
Fig. 2 shows a comparison of the selected indicators between we plotted their relative variable importance as shown in Fig. 3.
those accounts detected as collaborators and those that are not We decided to keep the variable with the highest importance
detected. The differences between the two types of accounts are for each one of the dimensions that we indicated before; based
statistically significant, therefore confirming that we are detect- on what we see in Fig. 3 and our own judgment as experts
in this area, we selected FinalGrade, SubmissionCount, Ac-
ing a different subpopulation of accounts.
tiveDaysCount, and DistinctVideoCount. We did not select
any of the variables related to forum activity because all of them
3.5. Overview of the community characterization
have low importance, and as we see in Fig. 1, the majority of
learners did not interact with the forum.
3.5.1. Clustering method and metrics
We used the IBM SPSS Statistics Two-Step clustering method 3.5.3. Characterization of the couples and communities
[49]. As part of the options of the algorithm, we selected the Once we have detected those accounts that are collabora-
Euclidean distance as distance measure, we let the algorithm tors, our first RQ is to characterize these accounts. To solve this
decide the optimum number of clusters automatically (range 2– problem, clustering techniques are normally applied when we do
15), and we used as clustering criterion the Bayesian Information not have a clear idea of the underlying groups in a population,
Criterion (BIC). We pre-scaled the input variables by computing and subjects are then clustered on the basis of some inherent
the z-scores of each variable (i.e. z = σ where µ is the mean
x−µ
similarity among them [51]. Therefore, we apply the clustering
and σ the standard deviation of x). The algorithm automatically methodology in Section 3.5.1 to find different types of student
performs the following two steps: accounts based on their engagement with the learning platform.
This clustering process is applied separately to PHIL and MUSIC
• First, it identifies the appropriate number of clusters through collaborators. The silhouette coefficient value for PHIL is 0.7 and
agglomerative hierarchical clustering. In order to select the for MUSIC 0.6, which can be considered as good values [50],
appropriate number of clusters, it will maximize the sil- and thus we conclude that the final clusters are valid. This kind
houette coefficient value, as described by Kaufman and of clustering approaches to find different profiles of students in
Rousseeuw [50]. MOOCs have been used in previous studies successfully [25,29–
• Second, it applies k-means with the identified optimal num- 31].
ber of clusters and Euclidean distance as dissimilarity metric Then, we represent the student collaborations on a network
to assign each one of the students to a cluster. graphic, where the nodes represent students, the edges link two
595
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

Fig. 4. Clustering results showing a boxplot visualization of the input variables separated by cluster and course.

students detected as collaborators, and the color of the node cod- value of the FinalGrade variable. However, there were
ifies the cluster assignment. This way, we represent collaborators important differences in comparison to cluster 1 regarding
in communities depending on how many accounts they were the rest of the variables. Most importantly, we found that in
collaborating with. Finally, we analyze the indicators and clusters terms of DistinctVideoCount, accounts in cluster 1 had
of the detected associations, connecting them with previously a very high use of videos (most of the videos were seen
reported literature in order to perform a theory-driven validation by the users of the accounts in this cluster), whereas in
of our findings. cluster 2 this was quite the opposite case, where the users
of most accounts watched very few videos. Additionally,
4. Results the value of SubmissionCount and ActiveDaysCount
variables were also lower than in cluster 1. Therefore, the
4.1. RQ1. Types of accounts based on the clustering analysis users of the accounts in this cluster achieved high grades
and obtained certificates, and they were able to accomplish
We applied the clustering methodology as described in Sec- this by watching very few videos, being active fewer days
tion 3.5 to classify student accounts based on their interactions and with fewer submissions than the users of the accounts
with the MOOC platform. Fig. 4 shows a boxplot with the clus- in cluster 1. Therefore, our hypothesis is that either the
tering results where each input indicator is separated by cluster students running these accounts already had prior knowl-
(on the x-axis) and by course (top row for MUSIC and bottom row edge regarding the topic of the course and they just solved
for PHIL). The highest relative variable importance for clustering the required activities to get the certificates, or they might
lied in the variables FinalGrade and DistinctVideoCount. have been performing some illicit actions as part of the
SubmissionCount had the lowest importance. As shown in the collaboration that had facilitated their way into obtaining a
plot, the variance of SubmissionCount was the highest of all, certificate without much effort.
and thus it was not the one defining the clusters. The three • Cluster 3: This group is composed of 23.1% of the PHIL
clusters obtained are described below: course accounts and 16.16% of the MUSIC course accounts.
The last cluster of user accounts is clearly distinguishable
• Cluster 1: This group is composed of 34.6% of the PHIL from the other two clusters by its FinalGrade, which was
course accounts and 41.41% of the MUSIC course accounts. much lower than in the other two with the median value of
The accounts that belong to this cluster had a high Fi- 50%. This means that most accounts in this cluster did not
nalGrade and the highest median values for the Active- achieve a certificate of completion. The value of Active-
DaysCount and DistinctVideoCount variables. Addition- DaysCount was also the lowest one of all clusters, with very
ally, the variable SubmissionCount had a very high vari- few days active. It is also interesting to see that the median
ance, thus there were different types of accounts regard- value of SubmissionCount was higher than those of the
ing the amount of submissions. Overall, since this clus- other clusters in PHIL and higher than that of cluster 2 in
ter had the highest values for the two activity variables the case of MUSIC. Therefore, although these accounts did
(ActiveDaysCount and DistinctVideoCount), and also not receive certificates and were active only very few days,
a high value of FinalGrade variable, these accounts put they did make many submissions, in fact, this cluster has
effort and invested time on the course achieving high grades the highest median value of submissions in PHIL course.
and obtaining certificates of completion. Finally, for the DistinctVideoCount variable, in the case
• Cluster 2: A total of 42.3% accounts of the PHIL course of PHIL, the median value was 0 and none of those accounts
belonged to this cluster and 42.42% of the accounts of MUSIC watched any videos; in the case of MUSIC the variable had
course. This cluster contains accounts that also had a high a high variance and the median value was higher than that
596
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

Fig. 5. Network graph of the couples and bigger communities detected by the algorithm and colored based on their cluster assignment. Each node represents an
account, and the edge between two of them indicates the collaborating relationship.

Table 1
Examples of couples for each of the cluster associations found.
Association Cluster MAD Order Final Grade Sub. Count Act. Days Dist. Video
Count Count
Fruitful 1 100 92 26 35
2.65 +0.14
collaboration 1 100 12 16 32
1 81 74 7 37
Free-riding 17.07 +1
2 98.6 16 19 1
Illicit 2 97.1 7 5 0
2.64 +0.71
collaboration 2 91.4 18 5 1
1 94 28 11 38
CAMEO helper 1.21 −1
3 49 58 5 0
CAMEO 2 96.4 7 14 0
1.27 −1
premeditated 3 48.5 32 4 0

for cluster 2. Our hypothesis is that this cluster of accounts 4.2. RQ2. Behavioral characteristics of the detected associations of
represents the harvesting accounts that have been reported accounts
in previous research about CAMEO [6,43]; these accounts
4.2.1. Couples of accounts
were created for the mere purpose of harvesting correct
This subsection describes the associations between the cou-
solutions by using exhaustive search (i.e., each quiz item has ples of accounts regarding their cluster assignment. Table 1 ex-
several attempts available and students receive feedback on emplifies each cluster association with the variables of one the
the correctness after the submission). The correct solutions detected couples per association:
can be used later in the main account that would receive a • Association 1 ‘‘Fruitful collaboration’’ (cluster 1 and cluster
certificate. This hypothesis is plausible since the accounts in 1 — PHIL 3/11 and MUSIC 5/30): This association repre-
cluster 3 did not achieve a certificate, were not very active sents two students from cluster 1 working together. As
in the course but still made many attempts to the quizzes. we reported in the previous subsection, the users of the
accounts from cluster 1 put considerable amounts of effort
on the platform to achieve certificates, with high values of
Finally, Fig. 5 shows networks of the couples and bigger com-
ActiveDaysCount and DistinctVideoCount. Therefore,
munities that were detected by the algorithm. In these networks, this association might represent two students that were tak-
the circle (node) represents each one of the accounts, and the ing the course seriously, and were collaborating reciprocally
line (edge) linking the accounts indicates that those two accounts with each other in order to achieve better grades. In the
example of this association in Table 1, the two accounts
were detected as collaborators. Additionally, the color of each cir-
obtained the highest possible grade.
cle represents the cluster assignment. For example, on the top-left
• Association 2 ‘‘Free-riding collaboration’’ (cluster 1 and clus-
network of the PHIL course, we see four accounts collaborating ter 2 — PHIL 1/11 and MUSIC 11/30): This association rep-
together, three from cluster 3 and one from cluster 1, and all of resents one student of cluster 1 and one of cluster 2, which
them are connected with each other. This way, we are able to see might be a genuine association between two real students;
the different cluster associations in the couples and communities. however, this relationship is not equitable. According to the
chosen clustering variables, cluster 1 has a higher platform
Next Sections 4.2.1 and 4.2.2 report the findings for the couples
interaction than cluster 2, but in both clusters, high grades
and communities detected, respectively, based on their cluster are achieved. In this association, the student of cluster 1
assignment and associations between accounts. would put effort in their work on the platform, whereas
597
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

student of cluster 2 did not make much effort but still would Table 2
get a certificate with the help of the student of cluster Description of the extracted indicators for each member of the two selected
communities of accounts.
1. The value of the Order variable for the ‘‘Free-riding’’
Community Cluster Final Sub. Act. Days Dist. Video
collaboration was close to 1. That means that the account of Grade Count Count Count
cluster 1 almost always submitted the assignments before
2 92.14 14 8 15
the peer, and we can see exactly that in the example of 1 91.79 14 8 38
Table 1. 1
2 91.55 16 7 7
• Association 3 ‘‘Illicit collaboration’’ (cluster 2 and cluster 2 88.57 14 12 13
2 — PHIL 1/11 and MUSIC 5/30): In this association both 1 56.47 71 8 20
accounts belong to cluster 2, therefore this case represents 2 69.86 21 6 2
two accounts that did not demonstrate much effort in the 2 2 79 5 10 0
3 38.55 19 1 0
course in terms of videos watched or active days, but still 2 80 27 25 4
were able to receive certificates of accomplishment.
• Association 4 ‘‘CAMEO helper’’ (cluster 1 and cluster 3 —
PHIL 1/11, MUSIC 6/30): This association represents one
account from cluster 1 and one from cluster 3. In this case, account from cluster 1 that watched 20 videos, the rest of
we have one account that achieved a certificate investing a the accounts watched none or very few of them. For this
significant effort, and the second one that could potentially community we found that, all 25 submissions made by the
be a harvesting account based on previous literature [43,44], 5 accounts, were done in a interval of time of only 68 min
since it did not achieve a certificate, watched only few during the same day.
videos, and made many submission attempts.
• Association 5 ‘‘CAMEO premeditated’’ (cluster 2 and cluster
3 — PHIL 5/11, MUSIC 3/30): This association represents one 5. Discussion
account from cluster 2 that was able to achieve a certificate
with little effort and one from cluster 3 that could poten- The section is divided in two parts, first, Section 5.1 discusses
tially be a harvesting account [6]. In both ‘‘CAMEO helper’’ the results of the different types of associations that have been
and ‘‘CAMEO premeditated’’, the Order variable tended to found and Section 5.2 the potential implications.
be close to −1, meaning that the account from cluster 3
almost always submitted the quiz first to get the correct 5.1. Different types of associations
responses. We can see this in both examples of Table 1.
• Association 6 (cluster 3 and cluster 3 — PHIL 0/11, MUSIC In the current study, we detected different collaboration be-
0/30): We found no associations of two accounts of cluster 3. haviors among accounts, and we hypothesized that some of them
This makes sense as we generally label accounts from cluster could be strongly related to academic dishonesty in MOOCs, while
3 as harvesting accounts and hence it would not have a lot others might be beneficial for students. We first applied the
of sense to find two of them coupled (unless the student algorithm described in Section 3.4 to detect collaborators, and
dropped the course). then implemented the clustering approach in Section 3.4 to char-
acterize the collaborations. We remark that the main underlying
4.2.2. Communities of more than two accounts idea for this characterization was that couples or communities
In the case of the communities of accounts, it was harder to of students detected by our method had always submitted their
present an overall view, since the size and associations between assignments very close in time to each other; therefore, this time
the different members of the community varied from one case to closeness represents a suspicious and possibly an illicit behavior.
another. Therefore, it was difficult to provide a systematic general One important finding was that despite the fact that we applied
approach to describe all communities. Instead, we delve into the the cluster analysis to both MUSIC and PHIL courses separately,
specifics of two community examples. The extracted indicators we obtained the same cluster types for courses of different topics,
for each member of the selected communities can be seen in suggesting that this finding could generalize beyond the data
Table 2: set used in the current study. However, the transferability of
the clustering results to other courses should be analyzed more
• Community 1: The first community in Table 2 belongs to deeply since just two courses were used in this study. In addition,
PHIL and is composed by three accounts from cluster 2 we can observe some differences on the values of variables in
and one account from cluster 1. The account of cluster 1 both courses for some clusters, e.g., regarding the grade values.
watched all the videos in the course, whereas the rest of This could be due to the fact that the course difficulty in MUSIC
accounts watched fewer videos. They had similar values for and PHIL are different. Therefore, the course characteristics such
FinalGrade, ActiveDaysCount and SubmissionCount. as the difficulty of the topic should be taken into account when
Additionally, we can support our hypothesis with Fig. 6, changing the context.
where each quiz is represented on the x-axis and the time The clustering method detected three different clusters with
difference between the submissions of the accounts for that different characteristics. The accounts in cluster 1 received a cer-
quiz on the y-axis. The plot shows that for Community 1, tificate by investing a great effort, they watched most videos, and
the submissions of all accounts for each quiz were always were active many days. The accounts in cluster 2 made a small
done within a 5 min timeframe (except for the submission effort, they almost did not watch videos, were active a moderate
of account 1 to Quiz 1). They always met one day each week amount of days, and made few submissions. Still, they managed
(either a Monday or a Tuesday) and solved together the to get high scores and received certificates. Finally, the accounts
weekly quiz. in cluster 3 were active few days and did not watch any videos,
• Community 2: The second community represented in Ta- but they still made many quiz submissions and did not receive
ble 2 is more complex than community 1 and belongs to certificates. Thus, we were able to identify three different types
MUSIC. There is one account from cluster 1, three from of students’ collaborations in the form of couples (associations 1,
cluster 2 and one from cluster 3. With the exception of the 2, and 3 presented in Table 1). We did not consider associations
598
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

Fig. 6. Time difference between the submissions of each one of the accounts of the community for each quiz. For example, in the case Community 2, the Accounts
3 and 4 submitted first the Quiz 1, and Accounts 1 and 2 submitted 10 and 12 min later respectively.

4 and 5 as real collaborations; instead, we hypothesized these to get answers to exam questions without studying the contents
be CAMEO [6,42,43], and hence both accounts in associations of of the course. In this case, students might have been applying
type 4 and 5 were likely run by the same student. The discussion ‘‘gaming the system’’ strategies, where a learner attempts to
delves into these findings now. succeed in an educational environment by exploiting properties
Associations 1 ‘‘Fruitful collaboration’’ are composed of ac- of the system’s help and feedback rather than by attempting to
counts that potentially worked together and had high degrees learn the material in order to accomplish a passing grade without
of commitment according to the variables ActiveDaysAccount investing the necessary effort [55]. This kind of behavior can be
and DifferentVideoCount. These associations can potentially severe for the learning process, since in several studies authors
represent two students who made an effort in the course by found gaming the system behaviors to be associated with poor
watching videos, they tried to learn and understand the contents, learning outcomes [56,57]. This can also affect the future beliefs
and met to submit their assignments together, potentially solving and attitudes of these students, as they might come to think that
together in a collaborative way the quizzes in a sort of equi- they are able to accomplish goals without putting much effort.
table relationship. The motivation here can be the ambition to Finally, we have associations 4 ‘‘CAMEO helper’’ (cluster 1 and
improve the grades, and we might argue that this relationship cluster 3) and associations 5 ‘‘CAMEO premeditated’’ (cluster 2
does not represent a severe problem for the learning process of and cluster 3). As cluster 3 had low level of interaction with
these students. The two accounts would work together to achieve the content on the platform, many submission attempts and
high grades and in the same way, they showed an effort on
low scores without receiving certificates, the hypothesis that we
the platform. This is the only type of association that demon-
described was that these were harvesting accounts as described
strated a behavior that has the more positive characteristics of
extensively in the CAMEO literature [6,42,43]. Since accounts
collaborative learning [52].
from cluster 3 were present in both associations, most probably
Associations 2 ‘‘Free-riding’’ might represent an inequitable
these were CAMEO associations. Therefore, we can conclude that
collaboration. In this case, there was a less balanced interaction
in these two association types, both accounts were managed
where students from cluster 1 potentially have a passive attitude
by the same student. First, the association between a cluster 1
and pass the answers to the students from cluster 2 (potentially a
account and a cluster 3 account might represent a slightly less
friend or acquaintance), so that this latter account could obtain a
certificate without investing much effort in the course, practicing severe situation, because the cluster 1 account invested an effort
the behavior known as free-riding [45,46]. Indeed, the literature to study on the platform and might be using the harvesting
has reported that one typical behavior toward cheating is that, account to secure and achieve a passing grade without struggle.
one copies from the other (‘active’), and the other allows the This association is closer to the idea of applying CAMEO as a
others to copy (‘passive’) [53]. This definition resembles quite ‘‘helper-mode’’ that was reported by Alexandron et al. [6]. The
well this situation where the student of cluster 1 would usually second scenario is an association of group 2 and group 3 accounts,
submit an assignment before the student in cluster 2 as the which could represent a more severe situation, since the student
variable Order has values close to 1. Additionally, letting others is managing to receive a certificate without investing any effort
to copy from you is regarded as less severe than actually copying and seems to be closer to the ‘‘premeditated-mode’’ that was
from others [54]. In the case of this specific association, the reported by Alexandron et al. [6].
impact on the learning process of the students from cluster 2 is We also detected a number of communities with more than
obviously more severe. two accounts collaborating together. However, it is difficult to
In a similar way, in associations 3 ‘‘Illicit collaboration’’, the systematically characterize associations in each community be-
accounts had very limited interaction with the platform but both cause there were numerous accounts. Hence, we described two
obtained a certificate. Thus, we can assume that these associ- examples in Table 2 from the set of communities that we found.
ations performed some kind of strategy that allowed them to While in community 1 there were some associations that could
599
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

represent genuine collaborative behaviors, the accounts in com- draw from the study. Previous work [6] found that some course
munity 2 exhibited elements of explicit dishonest behavior. design aspects, such as randomization, could greatly help to deter
Therefore, this is an analysis that needs to be performed for each academic dishonesty. Another easy possibility to control CAMEO,
community separately. would be to link each student account to a physical person,
More work will be needed to assess if students knew each instead of allowing the registration of multiple accounts [5]. There
other prior to starting the course or if they met online in study are numerous design options that can help minimize these issues
groups [58], and then decided to engage into an ‘unethical col-
and future work should invest time on creating helpful guidelines
laboration.’
for online course designers and practitioners.
The findings offer a significant novel contribution to the liter-
5.2. Implications
ature, as for the first time, this study confirms that it is possible
While this work started with the aim of detecting and charac- to characterize collaborations emerging in MOOC environments
terizing collaborations that may arise in MOOCs, we have found without having previous knowledge about the existence of such
numerous behaviors that can be considered dishonest where stu- collaborations. All previous work has been centered on study-
dents exhibit a deliberate behavior with no intention of learning ing collaborations that were self-reported, controlled or visible
the course contents. This study can be a good complement to pre- through collaborative tools such as forums. Our study has also
vious work that focused on CAMEO [6,42,43]. This kind of dishon- shown that the majority of the students have used this anony-
est collaborations probably have high prevalence due to the certi- mous environment to collaborate in dishonest ways, a practice
fication provided by MOOC-based online programs. For example, that has facilitated their way into a completion certificate. The
the literature has shown that students performed CAMEO more results that we have reported could have implications on the
frequently on those questions that had higher weight towards the design of dashboards for teachers to help them understand the
final grade of a MOOC [6]. types of collaborations that are taking place. These dashboards
Our work has focused on characterizing a number of collab- can provide opportunities to teachers to assess the quality of the
orations following a data-driven approach. However, we believe
collaborations, intervene, and provide feedback as appropriate in
that there are some limitations in the findings reported in this
each case scenario. These findings also open the possibility of new
paper. We did not have a clear threshold value in our detection
research built on top of the methodology proposed in the paper.
methodology, and there might be other types of collaborations
that have not been captured by the algorithmic approach used in While we have applied it to MOOCs, the methodology could easily
this study; therefore, having a different threshold of the accounts be adapted to other online learning environments taking into
that we categorized as collaborators would impact the precision account the specific contextual characteristics, thus, opening new
and recall of the algorithm. Additionally, although the collabora- research horizons on collaborative learning.
tors discovered provided solid evidence since the differences are
statistically significant, we do not have a ground truth that can
help us refine the algorithm and evaluate its real quality. In fact, 6. Conclusions
there could be other potential explanations to the results that we
reported. Furthermore, the context may be a strong determinant Nowadays, we frequently find that the design of MOOCs is
for the existence of different collaborations and behaviors [59]. no longer focused on having students collaborate together to
The subject matter and design of the course [60], the platform construct knowledge. Still, many social or collaborative tools such
where it took place [61], and the audience to whom the course is as forums or peer-review activities are maintained. In addition,
addressed [62], could have an important influence that could lead
teachers encourage student participation in the course through
to collaborations with different behaviors than these presented
technology platforms and resources or tools available on the plat-
here. Therefore, a wider study with different contexts would be
forms [10,63]. Moreover, collaborations in MOOCs might emerge
necessary in order to generalize the findings that we report.
spontaneously because people can meet in the forums or on
In the future we plan to add new data sources in our analysis
in order to improve the insights and characterization. For exam- virtual working groups, or because friends decide to take a course
ple, forum interactions could bring information about how stu- together. However, while collaborations in MOOCs are generally
dents interact in the forum and we could contrast this informa- considered positive for the learning process, this work has re-
tion with these results. Additionally, we could detect healthy in- vealed that not all students’ collaborations can be considered
teractions from a text mining analysis among a group of students as good or beneficial. This phenomenon is not new, and in tra-
that belong to cluster 1, giving more insights about a healthy ditional classroom courses, researchers and practitioners have
and fruitful collaboration. Moreover, mixed-methods studies that frequently reported inequitable or dishonest collaborations [14–
involve collection of qualitative data such as interviews and focus 16]. This study has extended the state of the art by implementing
groups could help validate some of the inferences made in this a data-driven characterization of different collaboration types in
paper. While students who were involved in dishonest behaviors MOOCs.
might be reluctant to disclose details of their behavior, qualitative Collaboration can be an important factor in students’ out-
studies at least could be beneficial to corroborate findings about
comes in any type of course, since learning can arise from the
healthy collaboration links identified in this study.
spontaneous connections between students and in many of the
The aim of this work was to shed some new light on the
works we found, the advantages of collaboration are highlighted.
understanding of students’ collaborations online, behaviors, mo-
However, the majority of the associations that we detected have
tivations, and needs. This can also help to better understand
the role that online collaboration can have in learning outcomes shown a low interest in learning the courseware and explicit
and provide the teacher with tools that allow them to somehow dishonest behaviors. Therefore, we argue that there is still the
improve the design and development of MOOCs to promote more need to more profoundly study the types of collaborations that
collaboration. However, the high prevalence of collaborations can emerge in MOOCs and other types of online courses, to
where students were clearly passing a course thanks to some really understand which of those can be positive for the learning
kind of ‘free-riding’ limits the positive conclusions that we can outcomes of students.
600
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

CRediT authorship contribution statement [14] A. Hellas, J. Leinonen, P. Ihantola, Plagiarism in take-home exams: Help-
seeking, collaboration, and systematic cheating, in: Proceedings of the
José A. Ruipérez-Valiente: Conceptualization, Methodology, 2017 ACM Conference on Innovation and Technology in Computer Science
Education - ITiCSE ’17, ACM Press, 2017, pp. 238–243.
Formal analysis, Investigation, Writing – original draft, Writing
[15] A.S. Lan, A.E. Waters, C. Studer, R.G. Baraniuk, Sparse factor analysis for
– review & editing, Supervision, Project administration. Daniel
learning and content analytics, J. Mach. Learn. Res. (2013) 1959–2008,
Jaramillo-Morillo: Formal analysis, Writing – original draft, Writ- arXiv:1303.5685.
ing – review & editing. Srećko Joksimović: Conceptualization, [16] A.E. Waters, C. Studer, R.G. Baraniuk, Bayesian pairwise collaboration
Writing – original draft, Writing – review & editing, Supervision. detection in educational datasets, in: 2013 IEEE Global Conference on
Vitomir Kovanović: Conceptualization, Writing – original draft, Signal and Information Processing, 2013, pp. 989–992, ISSN: null.
Writing – review & editing, Supervision. Pedro J. Muñoz-Merino: [17] T. Staubitz, T. Pfeiffer, J. Renz, C. Willems, C. Meinel, Collaborative learning
Conceptualization, Writing – original draft, Writing – review & in a MOOC environment, ICERI2015 Proc. (2015) 8237–8246.
editing, Supervision. Dragan Gašević: Conceptualization, Writing [18] A. Cohen, U. Shimony, R. Nachmias, T. Soffer, Active learners’ characteriza-
tion in MOOC forums and their generated knowledge, Br. J. Educ. Technol.
– original draft, Writing – review & editing, Supervision.
50 (1) (2019) 177–198.
[19] A.C.A. Holanda, P.A. Tedesco, E.H.T. Oliveira, T.C.S. Gomes, MOOCOLAB - a
Declaration of competing interest customized collaboration framework in massive open online courses, in: V.
Kumar, C. Troussas (Eds.), Intelligent Tutoring Systems, in: Lecture Notes
The authors declare that they have no known competing finan- in Computer Science, Springer International Publishing, Cham, 2020, pp.
cial interests or personal relationships that could have appeared 125–131.
to influence the work reported in this paper. [20] J.A. Ruipérez-Valiente, S. Joksimović, V. Kovanović, D. Gašević, P.J. Muñoz
Merino, C. Delgado Kloos, A data-driven method for the detection of
Acknowledgments close submitters in online learning environments, in: Proceedings of
the 26th International Conference on World Wide Web Companion, in:
WWW ’17 Companion, International World Wide Web Conferences Steer-
Authors want to acknowledge support from PROF-XXI project, ing Committee, Republic and Canton of Geneva, Switzerland, 2017, pp.
Spain (609767-EPP-1-ES-EPPKA2-CBHE-JP), the European Com- 361–368.
mission and the Spanish Ministry of Economy and Competitive- [21] S. Zheng, M.B. Rosson, P.C. Shih, J.M. Carroll, Understanding student
ness through the Juan de la Cierva Formación program motivation, behaviors and perceptions in MOOCs, in: Proceedings of the
(FJCI-2017-34926). 18th ACM Conference on Computer Supported Cooperative Work & Social
Computing, in: CSCW ’15, ACM, New York, NY, USA, 2015, pp. 1882–1895.
References [22] A. Ramesh, D. Goldwasser, B. Huang, H. Daumé III, L. Getoor, Learning la-
tent engagement patterns of students in online courses, in: Proceedings of
the Twenty-Eighth AAAI Conference on Artificial Intelligence, in: AAAI’14,
[1] S. Downes, Connectivism and Connective Knowledge: Essays on Meaning
AAAI Press, Québec City, Québec, Canada, 2014, pp. 1272–1278.
and Learning Networks, National Research Council Canada, 2012, pp.
[23] C. Alario-Hoyos, I. Estévez-Ayres, M. Pérez-Sanagustín, C.D. Kloos, C.
1–616.
[2] T. Liyanagunawardena, S. Williams, A. Adams, The impact and reach Fernández-Panadero, Understanding learners’ motivation and learning
of MOOCs:a developing countries’ perspective, ELearning Pap. 33 (2013) strategies in MOOCs, The International Review of Research in Open and
38–46. Distributed Learning 18 (3) (2017).
[3] M. Perez-Sanagustin, J. Maldonado, N. Morales, Estado del arte de adop- [24] A. Ortega-Arranz, E. Er, A. Martínez-Monés, M.L. Bote-Lorenzo, J.I.
cion de MOOCs en la Educacion Superior en America Latina y Europa, Asensio-Pérez, J.A. Muñoz Cristóbal, Understanding student behavior and
Techreport WPD1.1, MOOC-Maker Constr. Manag. Capacit. MOOCs High. perceptions toward earning badges in a gamified MOOC, Univ. Access
Educcation, 2016. Inform. Soc. 18 (3) (2019) 533–549.
[4] C. Impey, Higher education online and the developing world, J. Educ. [25] J.A. Ruipérez-Valiente, P.J. Muñoz-Merino, C. Delgado Kloos, Detecting and
Human Develop. 9 (2) (2020). clustering students by their gamification behavior with badges: A case
[5] D. Jaramillo-Morillo, J. Ruipérez-Valiente, M.F. Sarasty, G. Ramírez- study in engineering education, Int. J. Eng. Educ. 33 (2-B) (2017) 816–830.
Gonzalez, Identifying and characterizing students suspected of academic [26] S. Halawa, D. Greene, J. Mitchell, Dropout prediction in MOOCs using
dishonesty in SPOCs for credit through learning analytics, Int. J. Educ. learner activity features, Proc. Second Eur. MOOC Stakeholder Summit 37
Technol. Higher Educ. 17 (1) (2020) 45. (1) (2014) 58–65.
[6] G. Alexandron, J.A. Ruipérez-Valiente, Z. Chen, P.J. Muñoz Merino, D.E. [27] R. Al-Shabandar, A.J. Hussain, P. Liatsis, R. Keight, Analyzing learners
Pritchard, Copying@Scale: Using harvesting accounts for collecting correct behavior in MOOCs: An examination of performance and motivation using
answers in a MOOC, Comput. Educ. 108 (2017) 96–114. a data-driven approach, IEEE Access 6 (2018) 73669–73685, Conference
[7] J. van der Linden, G. Erkens, H. Schmidt, P. Renshaw, Collaborative learning, Name: IEEE Access.
in: New Learning, Springer, 2000, pp. 37–54. [28] J.A. Ruipérez-Valiente, P.J. Muñoz Merino, C.D. Kloos, Improving the predic-
[8] A.-M. Nortvig, R.B. Christiansen, Institutional collaboration on MOOCs in tion of learning outcomes in educational platforms including higher level
education—A literature review, Int. Rev. Res. Open Distrib. Learn. 18 (6) interaction indicators, Expert Syst. 35 (6) (2018) e12298.
(2017).
[29] R. Ferguson, D. Clow, Examining engagement: analysing learner subpop-
[9] L. Bacon, L. MacKinnon, The challenges of creating successful collabo-
ulations in massive open online courses (MOOCs), in: Proceedings of the
rative working and learning activities in online engineering courses, in:
Fifth International Conference on Learning Analytics and Knowledge, 2015,
Proceedings of the 14th LACCEI International Multi-Conference for Engi-
pp. 51–58.
neering, Education, and Technology: ‘‘Engineering Innovations for Global
[30] M. Khalil, M. Ebner, Clustering patterns of engagement in massive open
Sustainability’’, Latin American and Caribbean Consortium of Engineering
online courses (MOOCs): the use of learning analytics to reveal student
Institutions, 2016.
categories, J. Comput. Higher Educ. 29 (1) (2017) 114–132.
[10] J. Chauhan, An insight to collaboration in MOOC, Int. J. Adv. Eng. Res.
Develop. 4 (7) (2017). [31] B. Chen, Y. Fan, G. Zhang, Q. Wang, Examining motivations and self-
[11] J. Chauhan, S. Taneja, A. Goel, Enhancing MOOC with augmented reality, regulated learning strategies of returning MOOCs learners, in: Proceedings
adaptive learning and gamification, in: 2015 IEEE 3rd International Con- of the Seventh International Learning Analytics & Knowledge Conference,
ference on MOOCs, Innovation and Technology in Education (MITE), 2015, 2017, pp. 542–543.
pp. 348–353. [32] Y. Li, H. Li, MOOC-FRS: A new fusion recommender system for MOOCs,
[12] J. Reich, Rebooting MOOC research, Science 347 (6217) (2015) 34–35, in: 2017 IEEE 2nd Advanced Information Technology, Electronic and
Publisher: American Association for the Advancement of Science Section: Automation Control Conference (IAEAC), IEEE, 2017, pp. 1481–1488.
Education Forum. [33] H. Tang, W. Xing, B. Pei, Exploring the temporal dimension of forum
[13] V. Kovanović, S. Joksimović, O. Poquet, T. Hennis, P. de Vries, M. Hatala, participation in MOOCs, Distance Educ. 39 (3) (2018) 353–372.
S. Dawson, G. Siemens, D. Gašević, Examining communities of inquiry in [34] H. Lynda, B.-D. Farida, B. Tassadit, L. Samia, Peer assessment in MOOCs
massive open online courses: The role of study strategies, Internet Higher based on learners’ profiles clustering, in: 2017 8th International Conference
Educ. 40 (2019) 20–43. on Information Technology (ICIT), IEEE, 2017, pp. 532–536.

601
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

[35] L. Sanz-Martínez, A. Martínez-Monés, M.L. Bote-Lorenzo, J.A. Munoz- [58] C. Lampe, D.Y. Wohn, J. Vitak, N.B. Ellison, R. Wash, Student use of facebook
Cristóbal, Y. Dimitriadis, Automatic group formation in a MOOC based for organizing collaborative classroom activities, Int. J. Comput. Supported
on students’ activity criteria, in: European Conference on Technology Collabor. Learn. 6 (3) (2011) 329–347.
Enhanced Learning, Springer, 2017, pp. 179–193.
[59] S. Joksimović, O. Poquet, V. Kovanović, N. Dowell, C. Mills, D. Gašević, S.
[36] I. Claros, A. Garmendía, L. Echeverría, R. Cobos, Towards a collaborative
Dawson, A.C. Graesser, C. Brooks, How do we model learning at scale? A
pedagogical model in MOOCs, in: 2014 IEEE Global Engineering Education
systematic review of research on MOOCs, Rev. Educ. Res. 88 (1) (2018)
Conference (EDUCON), IEEE, 2014.
43–86.
[37] T. Staubitz, C. Meinel, Collaborative learning in MOOCs approaches and
experiments, in: 2018 IEEE Frontiers in Education Conference (FIE), (ISSN: [60] D. Gašević, S. Dawson, T. Rogers, D. Gasevic, Learning analytics should
2377-634X) 2018, pp. 1–9. not promote one size fits all: The effects of instructional conditions in
[38] C. Brooks, C. Stalburg, T. Dillahunt, L. Robert, Learn with friends: The predicting academic success, Internet Higher Educ. 28 (2016) 68–84.
effects of student face-to-face collaborations on massive open online
[61] B. Thoms, E. Eryilmaz, How media choice affects learner interactions in
course activities, in: Proceedings of the Second (2015) ACM Conference
distance learning classes, Comput. Educ. 75 (2014) 112–126.
on Learning @ Scale - L@S ’15, ACM Press, 2015, pp. 241–244.
[39] N. Li, H. Verma, A. Skevi, G. Zufferey, J. Blom, P. Dillenbourg, Watching [62] S. Joksimović, A. Manataki, D. Gašević, S. Dawson, V. Kovanović, I.F.
MOOCs together: investigating co-located MOOC study groups, Distance De Kereki, Translating network position into performance: importance of
Educ. 35 (2) (2014) 217–233. centrality in different network configurations, in: Proceedings of the Sixth
[40] A. Ezen-Can, K.E. Boyer, S. Kellogg, S. Booth, Unsupervised modeling for International Conference on Learning Analytics & Knowledge, 2016, pp.
understanding MOOC discussion forums: A learning analytics approach, 314–323.
in: Proceedings of the Fifth International Conference on Learning Analytics
[63] M. Zapata-Ros, El diseño instruccional de los MOOC y el de los nuevos
and Knowledge, in: LAK ’15, Association for Computing Machinery, New
cursos abiertos personalizados, Rev. Educ. Dist. (RED) (45) (2015).
York, NY, USA, 2015, pp. 146–150.
[41] C. Haythornthwaite, Facilitating collaboration in online learning, J.
Asynchronous Learn. Netw. 10 (1) (2006) 7–24.
[42] Y. Bao, Detecting Multiple-Accounts Cheating in MOOCs (Ph.D. thesis),
TU Delft, 2017, URL: http://resolver.tudelft.nl/uuid:64ee5526-8c9e-4013- José A. Ruipérez-Valiente completed his B.Eng. and
9019-c63a63413ca2. M.Eng. in Telecommunications at Universidad Católica
[43] C.G. Northcutt, A.D. Ho, I.L. Chuang, Detecting and preventing ‘‘multiple- de San Antonio de Murcia (UCAM) and Universidad
account’’ cheating in massive open online courses, Comput. Educ. 100 Carlos III of Madrid (UC3M) respectively, graduating in
both cases with the best academic transcript of the
(2016) 71–80.
class. Afterwards, he completed his M.Sc. and Ph.D.
[44] J.A. Ruipérez-Valiente, P.J. Munoz-Merino, G. Alexandron, D.E. Pritchard,
in Telematics at UC3M while conducting research at
Using machine learning to detect ‘multiple-account’ cheating and analyze
Institute IMDEA Networks in the area of learning ana-
the influence of student and problem features, IEEE Trans. Learn. Technol.
lytics and educational data mining. He completed two
12 (1) (2017) 112–122.
postdoctoral periods, one at MIT and a second one at
[45] R. Swaray, An evaluation of a group project designed to reduce free-riding the University of Murcia with the prestigious Spanish
and promote active learning, Assess. Eval. Higher Educ. 37 (3) (2012) fellowship Juan de la Cierva. He is currently an Associate Professor of Software
285–292. Engineering and Artificial Intelligence at Complutense University of Madrid.
[46] O. Viberg, A. Mavroudi, Y. Fernaeus, C. Bogdan, J. Laaksolahti, Reducing
free riding: CLASS – a system for collaborative learning assessment, in: E.
Popescu, A. Belén Gil, L. Lancia, L. Simona Sica, A. Mavroudi (Eds.), Method-
ologies and Intelligent Systems for Technology Enhanced Learning, 9th Daniel Jaramillo-Morillo completed his B.Eng in Elec-
International Conference, Workshops, Springer International Publishing, tronic and Telecommunications and his M.Eng in
2020, pp. 132–138. Telematic Engineering at the Universidad del Cauca in
[47] V. Popov, A.v. Leeuwen, S.C.A. Buis, Are you with me or not? Temporal 2017. He was a Young Researcher with a scholarship
synchronicity and transactivity during CSCL, J. Comput. Assisted Learn. 33 from Colciencias (Colombia) in 2017 and is currently
(5) (2017) 424–442. a Ph.D Student in Telematic Engineering. He is re-
[48] J. Lämsä, R. Hämäläinen, P. Koskinen, J. Viiri, E. Lampi, What do we do searcher and administrator of a learning platform at
when we analyse the temporal aspects of computer-supported collab- the Universidad del Cauca.
orative learning? A systematic literature review, Educ. Res. Rev. (2021)
100387.
[49] SPSS Statistics IBM, Twostep cluster analysis, 2021, Online; accessed
21 May 2021, https://www.ibm.com/docs/en/spss-statistics/27.0.0?topic=
features-twostep-cluster-analysis.
[50] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to
Cluster Analysis, Vol. 344, John Wiley & Sons, 2009. Srecko Joksimovic completed in 2017 a Ph.D. in Learn-
[51] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O.P. Patel, A. Tiwari, E.M. Joo, ing Analysis and Information Technology from the
W. Ding, C.-T. Lin, A review of clustering techniques and developments, University of Edinburgh and Simon Fraser University
respectively. He is a Senior Lecturer in Data Science
Neurocomputing 267 (2017) 664–681.
at the Education Futures, University of South Australia.
[52] M. Laal, M. Laal, Collaborative learning: what is it?, Procedia - Social and
His research is centered around augmenting abilities of
Behavioral Sciences 31 (2012) 491–495.
individuals to solve complex problems in collaborative
[53] J. Eisenberg, To cheat or not to cheat: effects of moral perspective and
settings. Srecko is particularly interested in evaluating
situational variables on students’ attitudes, J. Moral Educ. 33 (2) (2004)
the influence of contextual, social, cognitive, and af-
163–178. fective factors on groups and individuals as they solve
[54] J. Yardley, M.D.R. Ph.D, S.C. Bates, J. Nelson, True confessions?: Alumni’s complex real-world problems.
retrospective reports on undergraduate cheating behaviors, Ethics Behav.
19 (1) (2009) 1–14.
[55] R. Baker, J. Walonoski, N. Heffernan, I. Roll, A. Corbett, K. Koedinger, Why
students engage in "gaming the system" behavior in interactive learning
Vitomir Kovanovic is Research Fellow at the School
environments, J. Interactive Learn. Res. 19 (2) (2008) 185–224. of Education, University of South Australia and a Data
[56] M. Cocea, A. Hershkovitz, R.S.J.d. Baker, The impact of off-task and gaming Scientist at the Teaching Innovation Unit, University of
behaviors on learning: Immediate or aggregate? in: Proceedings of the South Australia. His research focuses on the develop-
2009 Conference on Artificial Intelligence in Education: Building Learning ment of novel learning analytics systems using learners’
Systems that Care: from Knowledge Representation to Affective Modelling, trace data records collected by learning management
IOS Press, NLD, 2009, pp. 507–514. systems with the goal of understanding and improving
[57] R.S.J.d. Baker, A.T. Corbett, I. Roll, K.R. Koedinger, Developing a gen- student learning. He obtained his Ph.D. in Informatics,
eralizable detector of when students game the system, User Model. at the University of Edinburgh, United Kingdom in
User-Adapted Interact. 18 (3) (2008) 287–314. 2017.

602
J.A. Ruipérez-Valiente, D. Jaramillo-Morillo, S. Joksimović et al. Future Generation Computer Systems 125 (2021) 590–603

Dr. Pedro J. Muñoz-Merino is Associate Professor at Dragan Gašević is Professor of Learning Analytics in
Universidad Carlos III de Madrid. His main areas of the Faculty of Information Technology and Director of
expertise are on data analysis, educational data mining, the Centre for Learning Analytics at Monash Univer-
learning analytics and adaptive systems. He teaches sity. Before the current post, he was Professor and
on data science topics at his university and at other Chair in Learning Analytics and Informatics in the
institutions such as INAP (National Institute of Public Moray House School of Education and the School of
Administration). Pedro is a Telecommunications and Informatics and Co-Director of Centre for Research in
Telematics Engineer from the Universidad Politecnica Digital Education at the University of Edinburgh. He
de Valencia and a Ph.D. in Telematics Engineering from is B.S. in Computer engineering and informatics at
the Universidad Carlos III de Madrid. Military Technical Academy, M.S. in Software systems
and electrical engineering and Ph.D. in Information
systems at University of Belgrade.

603

You might also like