Introduction: Seven Weeks of R
It has been seven weeks since I started my summer research in Educational Data Mining. That means seven weeks of using R to import, tidy, transform, and visualize data (Grolemund and Wickham, R for Data Science). While I really wanted to work in R, I ran into many obstacles, major and minor. There were many data manipulation and visualization issues that I had to deal with in order to extract insights from the data. My intent in this blog post is to share some of the tricks I learned in R to manipulate and visualize data inside the tidyverse
Data Manipulation and Visualization Tricks
- Throughout the summer, the most important skill I gained is validation of my code. For example, I checked whether
summarisedid what I intended it to do, as in this code chunk:
library(tidyverse) test <- data.frame(anon_screen_name = c("jason", "baik", "joon", "woo"), cluster = c(1, 1, 2, 2)) test_dropout <- c("jason", "joon") test %>% group_by(cluster) %>% summarise(dropout_prop = mean(anon_screen_name %in% test_dropout))
## # A tibble: 2 x 2 ## cluster dropout_prop ## <dbl> <dbl> ## 1 1 0.5 ## 2 2 0.5
- I wanted to test whether my
dplyrcode actually calculated the proportion of matches between values in a column inside my dataframe (
anon_screen_name) and a separate vector(
- The test revealed it did!
bar_first <- new_clust_first_kmeans %>% mutate(cluster = as.character(cluster)) %>% group_by(cluster) %>% summarise(dropout_prop = mean(anon_screen_name %in% total_dropout))
When in doubt of my code, I learned to test the same code on a dummy dataset and see if the output matches my expectations. That way, I can be confident that my code will bring me an accurate output on my original dataset
dplyr::distinct()is a super useful verb for selecting distinct / unique rows. According to the documentation,
distinctretains only unique/distinct rows from an input tbl. This is similar to
unique.data.frame, but considerably faster.
Table with Duplicate Rows
tidyr::replace_na()allows you to replace missing values. After I saw the below table, I immediately turned to
replace_na()to replace all the NA columns with 0.
replace_na(list("1" = 0, "2" = 0, "3" = 0, "4" = 0, "5" = 0, "6" = 0, "7" = 0, "8" = 0, "9" = 0, "10" = 0))
kableExtra::kableallows you to change the column names.
Table with Raw Column Names
Table with Customized Column Names
scales::percent_formatallows you to change the labels on your plots to percentages!
- When drawing boxplots, I found too many outliers and loved to somehow
jitterjust these points. The
geom_boxplotchanges the transparency of only the outliers
dplyr::case_whenis a super useful function that replaces my old trick of nested
An Incomprehensible Example of ifelse
# Code from another project: odd_man = ifelse(odd_man %in% c("1-0", "2-1", "3-2", "4-3"), "one_man", ifelse(odd_man %in% c("2-0", "3-1", "4-2", "3-0"), "two_plus_man", "all_other_shots")))
A Readable Example of case_when
case_when( between(first_submit, week_seq, week_seq) ~ 1, first_submit <= week_seq & first_submit >= as.Date("2014-07-02") ~ 2, first_submit <= week_seq & first_submit >= as.Date("2014-07-09") ~ 3, first_submit <= week_seq & first_submit >= as.Date("2014-07-16") ~ 4, first_submit <= week_seq & first_submit >= as.Date("2014-07-23") ~ 5, first_submit <= week_seq & first_submit >= as.Date("2014-07-30") ~ 6, first_submit <= week_seq & first_submit >= as.Date("2014-08-06") ~ 7, first_submit <= week_seq & first_submit >= as.Date("2014-08-13") ~ 8, first_submit <= week_seq & first_submit >= as.Date("2014-08-20") ~ 9, first_submit <= week_seq & first_submit >= as.Date("2014-08-27") ~ 10, TRUE ~ 0
labelsparameter that allows you to change the labels of the legend.
broom::augmentis a helpful function when tidying models. According to the documentation, augment adds columns to a dataset, containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have . prefix to prevent existing columns from being overwritten.
In my case, I used
augment in clustering by adding on a new column called
cluster with the number of cluster.
- Last but not least, the compound assignment pipe operator,
%<>%, is very convenient as it saves you time and space. Less code means less errors.
df <- df %>% mutate(time = c(a,b,c))
You can do
df %<>% mutate(time = c(a,b,c))