231 min read

A Quick Jab at K-Means Clustering

Introduction

Motivated by #tidytuesday, I will be working with the entire trove of #rstats tweets from its inception on September 7th, 2008 to the most recent one on December 12th, 2018.

This tweet by Robert inspired me to use K-Means Clustering, an unsupervised learning method I know a little about, but would like to learn more. I grouped twitter accounts by “popularity measure”, a quick and easy way (I came up with it) to quantify the popularity of a twitter account:

Popularity Measure = Likes + Rewteets + Number of Followers

Caution: Don’t expect my analysis to be perfect! It will have its flaws. Please feel free to leave any feedback in the comments below. Thank you.

Analysis

To start off, I filter-ed only for #rstats tweets in 2018, tweets by accounts with more than 400 followers, and select-ed the following columns: screen_name, favorite_count(Likes), retweet_count, followers_count. Hence, I have rstats_reduced.csv

First, I load packages and import my dataset

library(tidyverse)
library(cluster) # For Silhouette Method
library(broom)
library(plotly) # For Interactive Visualization
theme_set(theme_light()) # Set default theme

rstats <- read_csv("rstats_reduced.csv")

For the rest of this article, I follow a clear tutorial in the UC Business Analytics R Programming Guide by Bradley Boehmke.

  1. Make rows as observations and columns as variables
# Find the mean of Likes and Retweets
rstats_clustering <- rstats %>% 
  group_by(screen_name) %>% 
  mutate(favorite_count_avg = mean(favorite_count),
         retweet_count_avg = mean(retweet_count)) %>% 
  filter(row_number() == 1) %>% 
  select(screen_name, followers_count, favorite_count_avg, retweet_count_avg) %>% 
  filter(favorite_count_avg <= 20, retweet_count_avg <= 20)
  1. Check for missing values
rstats_clustering %>% 
  map_dbl(~sum(is.na(.)))
##        screen_name    followers_count favorite_count_avg 
##                  0                  0                  0 
##  retweet_count_avg 
##                  0

None of these columns have NA values, so I don’t need to worry about removing or estimating missing values.

  1. Scaling the data (scale turns our dataset into a matrix)
rstats_clustering <- rstats_clustering %>% 
  as_data_frame() %>% 
  remove_rownames() %>% 
  column_to_rownames(var = "screen_name") %>% 
  scale()

Now, I am done with data preparation and are ready to run the K-Means Algorithm… Not so fast. I need to determine the optimal number of clusters beforehand.

Elbow & Silhouette Method (Code borrowed from DataCamp class: “Cluster Analysis in R”)

# Set seed
set.seed(42)

# Use map_dbl to run many models with varying value of k (centers)
rstats_tot_withinss <- map_dbl(1:10,  function(k){
  rstats_kmeans <- kmeans(rstats_clustering, centers = k)
  rstats_kmeans$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10 ,
  tot_withinss = rstats_tot_withinss
)

# Plot elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10) +
  labs(x = "K",
       y = "Total within-cluster sum of squares") +
  ggtitle("Elbow Method",
          subtitle = "Since there is no definite elbow, let's use Silhouette Method")

Silhouette Method

# Use map_dbl to run many models with varying value of k (centers)
sil_width <- map_dbl(2:10,  function(k){
  rstats_clustering_model <- pam(x = rstats_clustering, k = k)
  rstats_clustering_model$silinfo$avg.width
})

sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10) +
  labs(x = "K",
       y = "Average Silhouette Widths") +
  ggtitle("Optimal Number of Clusters",
          subtitle = "K = 2 has the highest Silhouette Score")

Finally, let’s run the K-Means Algorithm with K = 2.

rstats_clustering_kmeans <- kmeans(rstats_clustering,
                                   centers = 2,
                                   nstart = 25)

Now let’s visualize our clustering analysis using a scatterplot of Retweets vs Likes with points colored by cluster. The plotly package allows users to hover over each point and view the underlying data. Let’s show the Twitter handle, average likes, average retweets and cluster assignments in this plotly graph.

clustering_plot <- rstats_clustering %>% 
  as_tibble() %>% 
  mutate(cluster = rstats_clustering_kmeans$cluster,
         screen_name = rownames(rstats_clustering)) %>% 
  ggplot(aes(x = favorite_count_avg,
             y = retweet_count_avg,
             color = factor(cluster),
             text = paste('Twitter Handle: ', screen_name,
                  '<br>Average Likes:', round(favorite_count_avg, 1), 
                  '<br>Average Retweets:', round(retweet_count_avg, 1),
                  '<br>Cluster:', cluster))) +
  geom_point(alpha = 0.3) +
  labs(x = "Average Likes",
       y = "Average Retweets",
       color = "Cluster"
       ) +
  ggtitle("K-Means (K=2) Clustering of Twitter Screen Names")
  
ggplotly(clustering_plot, tooltip = "text")

The End

In this article, I imported the #rstats tweets dataset, prepared it for cluster analysis, determined the optimal number of clusters, and visualized those clusters with plotly. The interactive graph clearly shows two distinct clusters of twitter accounts according to average retweets, average likes, and number of followers.

Even though the K-Means algorithm isn’t a perfect algorithm, it can be very useful for exploratory data analysis. It divides twitter accounts into two groups according to my “popularity measure” and allows for further analysis one group (Cluster 1 or Cluster 2) at a time.