View on GitHub

Nilesh Ingle

Github Projects

4. Twitter API

Posted: May 2, 2016; Updated: May 2, 2016

In this project the 5000 tweets were pulled from #fridayreads using Twitter API in R. The library 'twitterR' was used. After downloading the tweets the data was cleaned largely using 'regex'. Several special cases were removed by individual commands. The downloaded tweets and code is uploaded on my Github respository.

Questions asked:
  1. Which are the top authors?
  2. Which are the top books read?
  3. Which are the top trending hastags?
  4. Can you identify bots in the tweets?
  5. Any top re-tweeters?
  6. Frequency of tweets with time?

Possible answers:
  1. Top authors were: Laura McNeil, Tayeb Salih, P. S. Winn
  2. Top books read: I Sister Dear, Season of Migration to North, Mystic Valley
  3. Top hastags: #MSGfeelsProud, #amreading (excluding #fridayreads which was highest)
  4. Any bots: Yes, atleast two bots were identified
  5. Top retweeter: DeoMil_LLC
  6. Frequency of tweets: The tweets were higher between 3:00 pm to 4:00 pm on Friday as compared to time between 5:00 pm to 10:00 pm

Sample code:

# Code to download tweets
consumer_key <- 'xxxx'
consumer_secret <- 'xxxx'
access_token <- 'xxxx'
access_secret <- 'xxxx'
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

# download tweets to a variable
tgif <- searchTwitter("#FridayReads", n=5000, since='2016-04-08')

# Cleaning the data
df2 <- as.data.frame(df$text)
df3 <- df2[grep(" [Bb][Yy] ", df2[,1]),]
df3 <- as.data.frame(df3)
df4 <- df3
colnames(df3) <- c('books')
df3$books <- as.character(df3$books)
df3$books <- gsub("^RT @.*:\\s+", "", df3$books)
df3$books <- gsub("https://.*?$", "", df3$books)

df3$books <- gsub("#(.*?):", "", df3$books)
df3$books <- gsub("#(.*?) ", "", df3$books)

df3$book_name <- gsub("by.*", "\\1", df3$books)
df3$author <- gsub(".*by", "\\1", df3$books)
df3$author <- gsub("#.*", "", df3$author)
df3$author <- as.character(df3$author)
df3$author <- gsub("@", "", df3$author)
df3$book_name <- gsub("#.*", "", df3$book_name)

#n <-gsub("^(\\w+\\s\\w+).*$", "\\1", df3$author)
#df3$au <- gsub("((\\w+\\W+){0,9}\\w+).*", "\\1", df3$author)
df3$author <- gsub("((\\w+\\W+){1}\\w+).*$", "\\1", df3$author)  #wont work with "^"

df3$book_name <- gsub("&", "", df3$book_name)
df3$author <- gsub("&", "", df3$author)

df3$book_name <- gsub("@the_author_", "", df3$book_name)
df3$book_name <- gsub("@(\\w+|\\W+)\\s", "", df3$book_name)

#df3$book_name <- gsub("<.*", "", df3$book_name) #won't work as these are non-english characters
df3$book_name <- gsub("\\[.*?\\]\\s", "", df3$book_name) #remove '[video]' 

df3 <- df3[!grepl("RT", df3$book_name),] #delete eows with 'RT'
#df3 [df3 ==""] <- NA #fill empty cells with NA

Figure gallery:
Plot_1
Figure 1: Top authors.

Plot_2
Figure 2: Top books.

Plot_3
Figure 3: Top hastags.

Plot_4
Figure 4: Author word cloud.

Plot_5
Figure 5: Book word cloud.

Plot_6
Figure 6: Hashtag word cloud.

Plot_7
Figure 7: Raw word cloud (excluding common words).

Plot_8
Figure 8: Retweetrs > 200 tweets.

Plot_9
Figure 9: All tweets.

Plot_10
Figure 10: Top re-tweeters.

Plot_11
Figure 11: Frequency of tweets with time.