4 min read

REU Blogpost: What I have learned so far in R

Introduction: Seven Weeks of R

It has been seven weeks since I started my summer research in Educational Data Mining. That means seven weeks of using R to import, tidy, transform, and visualize data (Grolemund and Wickham, R for Data Science). While I really wanted to work in R, I ran into many obstacles, major and minor. There were many data manipulation and visualization issues that I had to deal with in order to extract insights from the data. My intent in this blog post is to share some of the tricks I learned in R to manipulate and visualize data inside the tidyverse

Data Manipulation and Visualization Tricks

  1. Throughout the summer, the most important skill I gained is validation of my code. For example, I checked whether group_by and summarise did what I intended it to do, as in this code chunk:

Test Code

library(tidyverse)
test <- data.frame(anon_screen_name = c("jason", "baik", "joon", "woo"),
                   cluster = c(1, 1, 2, 2))

test_dropout <- c("jason", "joon")

test %>% 
     group_by(cluster) %>% 
     summarise(dropout_prop = mean(anon_screen_name %in% test_dropout))
## # A tibble: 2 x 2
##   cluster dropout_prop
##     <dbl>        <dbl>
## 1       1          0.5
## 2       2          0.5
  • I wanted to test whether my dplyr code actually calculated the proportion of matches between values in a column inside my dataframe (anon_screen_name) and a separate vector(test_dropout).
  • The test revealed it did!

Original Code

bar_first <- new_clust_first_kmeans %>% 
  mutate(cluster = as.character(cluster)) %>% 
  group_by(cluster) %>% 
  summarise(dropout_prop = mean(anon_screen_name %in% total_dropout))

When in doubt of my code, I learned to test the same code on a dummy dataset and see if the output matches my expectations. That way, I can be confident that my code will bring me an accurate output on my original dataset

  1. dplyr::distinct() is a super useful verb for selecting distinct / unique rows. According to the documentation, distinct retains only unique/distinct rows from an input tbl. This is similar to unique.data.frame, but considerably faster.

Table with Duplicate Rows

Cleaned Table

dplyr::distinct did the trick!

dplyr::distinct did the trick!

  1. tidyr::replace_na() allows you to replace missing values. After I saw the below table, I immediately turned to replace_na() to replace all the NA columns with 0.
replace_na(list("1" = 0,
                  "2" = 0,
                  "3" = 0,
                  "4" = 0,
                  "5" = 0,
                  "6" = 0,
                  "7" = 0,
                  "8" = 0,
                  "9" = 0,
                  "10" = 0))

Messy Table

Cleaned Table

  1. The col.names parameter in kableExtra::kable allows you to change the column names.

Table with Raw Column Names

Table with Customized Column Names

  1. The scales::percent_format allows you to change the labels on your plots to percentages!
  1. When drawing boxplots, I found too many outliers and loved to somehow jitter just these points. The outlier.alpha parameter in geom_boxplot changes the transparency of only the outliers
  1. dplyr::case_when is a super useful function that replaces my old trick of nested ifelse statements.

An Incomprehensible Example of ifelse

# Code from another project:
odd_man = ifelse(odd_man %in% c("1-0", "2-1", "3-2", "4-3"), "one_man", 
                                 ifelse(odd_man %in% c("2-0", "3-1", "4-2", "3-0"), "two_plus_man",
                                        "all_other_shots"))) 

A Readable Example of case_when

case_when(
           between(first_submit, week_seq[1], week_seq[2]) ~ 1,
           first_submit <= week_seq[3] & first_submit >= as.Date("2014-07-02") ~ 2,
           first_submit <= week_seq[4] & first_submit >= as.Date("2014-07-09") ~ 3,
           first_submit <= week_seq[5] & first_submit >= as.Date("2014-07-16") ~ 4,
           first_submit <= week_seq[6] & first_submit >= as.Date("2014-07-23") ~ 5,
           first_submit <= week_seq[7] & first_submit >= as.Date("2014-07-30") ~ 6,
           first_submit <= week_seq[8] & first_submit >= as.Date("2014-08-06") ~ 7,
           first_submit <= week_seq[9] & first_submit >= as.Date("2014-08-13") ~ 8,
           first_submit <= week_seq[10] & first_submit >= as.Date("2014-08-20") ~ 9,
           first_submit <= week_seq[11] & first_submit >= as.Date("2014-08-27") ~ 10,
           TRUE ~ 0
  1. ggplot2::scale_fill_hue has a labels parameter that allows you to change the labels of the legend.
scale_fill_hue(labels = c("High", "Low", "Medium"))

scale_fill_hue(labels = c("High", "Low", "Medium"))

  1. broom::augment is a helpful function when tidying models. According to the documentation, augment adds columns to a dataset, containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have . prefix to prevent existing columns from being overwritten.

In my case, I used augment in clustering by adding on a new column called cluster with the number of cluster.

  1. Last but not least, the compound assignment pipe operator, %<>%, is very convenient as it saves you time and space. Less code means less errors.

Instead of

df <- df %>% mutate(time = c(a,b,c))

You can do

df %<>% mutate(time = c(a,b,c))

Extras: #rstats Tip of the Day