5 min read

REU Blogpost: Final Report


This is a summary of the work I’ve done for the summer. It is a work in progress.

MOOCs (Massive Open Online Courses) are educational resources that teach anyone with an internet connection a wide variety of topics, from Statistics to Modern Poetry. In 2013, the Presidents Council of Advisors on Science and Technology stated that although many questions and challenges remain about MOOCs, this innovation has the potential to increase access to high-quality higher education at low cost. Evidently, MOOC is a transformative technology that allows better access to education than ever before.

However, a big problem in MOOC literature is the extremely high dropout rate from these courses. A widely known rate is 90% and this rate ranges from 65% to 97%. In this research project, I will be looking at factors leading to dropouts by using K-Means clustering to group the students into dropouts and non-dropouts.

EDA (Exploratory Data Analysis)

How many unique students are there in the course?

  • We found the number of students who are common in all four datasets: 7659
  • This ensures maximum number of features to use for clustering.

What are the total number of students who dropped the course?

Definition of dropout student

(Borrowed from Halawa et al “Dropout prediction in MOOCs using learner activity features”)

  • Absent from the course for a period exceeding more than one month


  • Viewed fewer than 50 percent of the videos in the course

In our dataset, no student was absent from the course for more than one month. Therefore, we only look at the second condition to find dropout students.

  • 4998 Dropout Students
  • Dropout Rate: 65%

At what number of days did the dropouts fall off the curve?

  • K-Means Clustering
  • We set the optimal number of k=2 (supported by Silhouette Analysis)
  • 682 students in Cluster 1
  • 4314 students in Cluster 2
  • Effort level of Cluster 1 Students falls off after approximately 2 weeks.
  • Effort level of Cluster 2 Students is consistently low.
  • Table of Avg / Std of Final Grades and Effort show that more effort correlates with higher final grades.

What are the types of modules that the users are interacting with?

  • We removed course module since its count is 1.0, meaning that every other module falls under the umbrella of a course module.
  • Problem module is the most popular- many students focused on solving questions.
  • Video was the least popular- Perhaps, this means students already knew material and didn’t need to learn from video lecture.

Is there a discrepancy in Mean effort per Week for Dropouts vs Non-Dropouts?

  • As expected, non-dropout students put in significantly higher average effort than dropout students.
  • There are many outliers in this dataset, suggesting high variability in weekly effort level.

Proporational Barchart of different effort levels per week

  • Categorized effort level into High, Medium, and Low Effort using Five-Number Summary:
  • 3rd Quartile < High Effort <= Maximum
  • 1st Quartile <= Medium Effort <= 3rd Quartile
  • Minimum <= Low Effort < 1st Quartile

  • Plot aligns with our expectations: dropout students put in less “High Effort” and more “Low Efort” levels.

Clustering in Stages

  • Looked at two weekly features, effort level and number of times student played video.
  • These features are allocated in three different “stages”
  1. First stage: Two features measured during Weeks 1~3
  2. Second Stage: Two features measured during Weeks 1~6
  3. Third Stage: Two features measured during Weeks 1~10 (all of available data)

  • Function of Similarity matrix: Find proportion of overlapping students among three different groups, or stages (each group has six clusters of students)
  • Results: We found very high overlap between First Group (Cluster 6), Second Group (Cluster 3), and Third Group (Cluster 4)
  • Now we move on to comparing the students in these 18 clusters to our ground truth of dropout and non-dropout students.

Barchart of Dropouts among different clusters

  • Overall outcome of Clustering in Stages: The K-Means clustering algorithm found the one cluster in each group that had an extremely high proportion of dropout students.
  • We assume that the algorithm clustered the dropout students into one cluster (Cluster 6 in Group 1, Cluster 3 in Group 2, and Cluster 4 in Group 3).
  • Future Work: Further analyze characteristics in the above mentioned clusters to discover characteristics of dropouts.

Clustering of Different Averages

  • Performed K-means clustering (K = 2) with two distinct features- effort level & number of times student played video.
  1. First Step (“Full”): Get two features (10 week average of effort level + 10 week average of number of times student played video)

  2. Second Step (“Halves”): Get four features (Two 5 week averages of effort level + Two 5 week averages of number of times student played video)

  3. Third Step (“Quarterly”): Get eight features (Four 2.5 week averages of effort level + Four 2.5 week averages of number of times student played video)

  • Key Takeaway: Negligible difference between Quarterly (8 features) and Halves (4 features), so we don’t necessarily need to divide time period into quarters.

Instructor Dashboard

  • Introduces the user to the app.
  • Gives details on each tab so that user knows what to expect from each tab
  • Shows number of students in the class and number of dropout students
  • Displays final grades table, final grades distribution, and normalized module usage chart
  • Allows user to select dataset and apply filters corresponding to each dataset.
  • Purpose is to customize search query for user.
  • Displays plots related to effort level
  • Interactive plot allows for highlighting in red of each student’s effort level (each line represents one student)


All in all, I have performed exploratory data analysis that confirms my expectations (dropout students put in less effort than non-dropout students), conducted K-Means Clustering, and built an interactive web application in R that shows all my summer work. My future work includes investigating the results of the clustering method and incorporating them into my interactive web application. I hope to write a paper on this entire process by the end of the year! Thank you for a great summer.