## Introduction

Motivated by #tidytuesday, I will be working with the entire trove of #rstats tweets from its inception on September 7th, 2008 to the most recent one on December 12th, 2018.

This tweet by Robert inspired me to use K-Means Clustering, an unsupervised learning method I know a little about, but would like to learn more. I grouped twitter accounts by “popularity measure”, a quick and easy way (I came up with it) to quantify the popularity of a twitter account:

Popularity Measure = Likes + Rewteets + Number of Followers

*Caution: Don’t expect my analysis to be perfect! It will have its flaws. Please feel free to leave any feedback in the comments below. Thank you.*

## Analysis

To start off, I `filter`

-ed only for #rstats tweets in 2018, tweets by accounts with more than 400 followers, and `select`

-ed the following columns: `screen_name`

, `favorite_count`

(Likes), `retweet_count`

, `followers_count`

. Hence, I have `rstats_reduced.csv`

First, I load packages and import my dataset

```
library(tidyverse)
library(cluster) # For Silhouette Method
library(broom)
library(plotly) # For Interactive Visualization
theme_set(theme_light()) # Set default theme
rstats <- read_csv("rstats_reduced.csv")
```

For the rest of this article, I follow a clear tutorial in the UC Business Analytics R Programming Guide by Bradley Boehmke.

- Make rows as observations and columns as variables

```
# Find the mean of Likes and Retweets
rstats_clustering <- rstats %>%
group_by(screen_name) %>%
mutate(favorite_count_avg = mean(favorite_count),
retweet_count_avg = mean(retweet_count)) %>%
filter(row_number() == 1) %>%
select(screen_name, followers_count, favorite_count_avg, retweet_count_avg) %>%
filter(favorite_count_avg <= 20, retweet_count_avg <= 20)
```

- Check for missing values

```
rstats_clustering %>%
map_dbl(~sum(is.na(.)))
```

```
## screen_name followers_count favorite_count_avg
## 0 0 0
## retweet_count_avg
## 0
```

None of these columns have NA values, so I don’t need to worry about removing or estimating missing values.

- Scaling the data (
`scale`

turns our dataset into a matrix)

```
rstats_clustering <- rstats_clustering %>%
as_data_frame() %>%
remove_rownames() %>%
column_to_rownames(var = "screen_name") %>%
scale()
```

Now, I am done with data preparation and are ready to run the K-Means Algorithm… Not so fast. I need to determine the optimal number of clusters beforehand.

Elbow & Silhouette Method (Code borrowed from DataCamp class: “Cluster Analysis in R”)

```
# Set seed
set.seed(42)
# Use map_dbl to run many models with varying value of k (centers)
rstats_tot_withinss <- map_dbl(1:10, function(k){
rstats_kmeans <- kmeans(rstats_clustering, centers = k)
rstats_kmeans$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10 ,
tot_withinss = rstats_tot_withinss
)
# Plot elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10) +
labs(x = "K",
y = "Total within-cluster sum of squares") +
ggtitle("Elbow Method",
subtitle = "Since there is no definite elbow, let's use Silhouette Method")
```

Silhouette Method

```
# Use map_dbl to run many models with varying value of k (centers)
sil_width <- map_dbl(2:10, function(k){
rstats_clustering_model <- pam(x = rstats_clustering, k = k)
rstats_clustering_model$silinfo$avg.width
})
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10) +
labs(x = "K",
y = "Average Silhouette Widths") +
ggtitle("Optimal Number of Clusters",
subtitle = "K = 2 has the highest Silhouette Score")
```

Finally, let’s run the K-Means Algorithm with K = 2.

```
rstats_clustering_kmeans <- kmeans(rstats_clustering,
centers = 2,
nstart = 25)
```

Now let’s visualize our clustering analysis using a scatterplot of Retweets vs Likes with points colored by cluster. The `plotly`

package allows users to hover over each point and view the underlying data. Let’s show the Twitter handle, average likes, average retweets and cluster assignments in this `plotly`

graph.

```
clustering_plot <- rstats_clustering %>%
as_tibble() %>%
mutate(cluster = rstats_clustering_kmeans$cluster,
screen_name = rownames(rstats_clustering)) %>%
ggplot(aes(x = favorite_count_avg,
y = retweet_count_avg,
color = factor(cluster),
text = paste('Twitter Handle: ', screen_name,
'<br>Average Likes:', round(favorite_count_avg, 1),
'<br>Average Retweets:', round(retweet_count_avg, 1),
'<br>Cluster:', cluster))) +
geom_point(alpha = 0.3) +
labs(x = "Average Likes",
y = "Average Retweets",
color = "Cluster"
) +
ggtitle("K-Means (K=2) Clustering of Twitter Screen Names")
ggplotly(clustering_plot, tooltip = "text")
```

## The End

In this article, I imported the #rstats tweets dataset, prepared it for cluster analysis, determined the optimal number of clusters, and visualized those clusters with `plotly`

. The interactive graph clearly shows two distinct clusters of twitter accounts according to average retweets, average likes, and number of followers.

Even though the K-Means algorithm isn’t a perfect algorithm, it can be very useful for **exploratory data analysis**. It divides twitter accounts into two groups according to my “popularity measure” and allows for further analysis one group (Cluster 1 or Cluster 2) at a time.