Refresher: Basics of String Manupilation
We start by taking a piece of text and turning it into something that carries the meaning of the initial text but is less noisy and thus perhaps easier to “understand” by a computer.
text <- "The Eton-educated, non-binary British Iraqi had always struggled with their identity, until they discovered drag. Yet the 29 year old says the performances come at a high price"
# Transforming to lower case
text %>% str_to_lower()
[1] "the eton-educated, non-binary british iraqi had always struggled with their identity, until they discovered drag. yet the 29 year old says the performances come at a high price"
# Split by '.' (=sentence)
text %>% str_split('\\.')
[[1]]
[1] "The Eton-educated, non-binary British Iraqi had always struggled with their identity, until they discovered drag"
[2] " Yet the 29 year old says the performances come at a high price"
text %>% str_replace_all('o', 'O')
[1] "The EtOn-educated, nOn-binary British Iraqi had always struggled with their identity, until they discOvered drag. Yet the 29 year Old says the perfOrmances cOme at a high price"
# Split by ' ' (=word)
text %>% str_remove_all('[[:punct:]]') %>% str_split(' ') %>% unlist()
[1] "The" "Etoneducated" "nonbinary" "British" "Iraqi" "had" "always" "struggled" "with" "their" "identity" "until"
[13] "they" "discovered" "drag" "Yet" "the" "29" "year" "old" "says" "the" "performances" "come"
[25] "at" "a" "high" "price"
text %>% str_to_lower() %>% str_remove_all('[[:punct:]]') %>% str_split(' ')
[[1]]
[1] "the" "etoneducated" "nonbinary" "british" "iraqi" "had" "always" "struggled" "with" "their" "identity" "until"
[13] "they" "discovered" "drag" "yet" "the" "29" "year" "old" "says" "the" "performances" "come"
[25] "at" "a" "high" "price"
Trump Tweets Processing many short texts and simple stats
An introduction to NLP would not be the same without Donald’s tweets. Let’s use these tweets for some more basic NLP and let’s try to gather some insights…maybe
Let’s try to use some very simple statistics on twitter data, thanks to Trump Twitter Archive
Note: We here already use precompiled data. However, you could use the rtweet
package and instead work with own data on tweets of interest.
# we will load some json files
library(jsonlite)
library(tidyjson)
# download and open some Trump tweets from trump_tweet_data_archive
tmp <- tempfile()
download.file("https://github.com/bpb27/trump_tweet_data_archive/raw/master/condensed_2018.json.zip", tmp)
trying URL 'https://github.com/bpb27/trump_tweet_data_archive/raw/master/condensed_2018.json.zip'
Content type 'application/zip' length 384688 bytes (375 KB)
==================================================
downloaded 375 KB
trump_tweets <- stream_in(unz(tmp, "condensed_2018.json"))
Found 1 records...
Imported 1 records. Simplifying...
trump_tweets %>% glimpse()
Rows: 3,510
Columns: 8
$ source <chr> "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "T…
$ id_str <chr> "1079888205351145472", "1079830268708556800", "1079830267274108930", "1079763923845419009", "1079763419908243456", "1079762413589807104", "10797487300588707…
$ text <chr> "HAPPY NEW YEAR! https://t.co/bHoPDPQ7G6", "....Senator Schumer, more than a year longer than any other Administration in history. These are people who have…
$ created_at <chr> "Mon Dec 31 23:53:06 +0000 2018", "Mon Dec 31 20:02:52 +0000 2018", "Mon Dec 31 20:02:52 +0000 2018", "Mon Dec 31 15:39:15 +0000 2018", "Mon Dec 31 15:37:14…
$ retweet_count <int> 33548, 17456, 21030, 29610, 30957, 1123, 25463, 22079, 15152, 22119, 17467, 20873, 61837, 32084, 25782, 44918, 30800, 25809, 34547, 30224, 29381, 27736, 300…
$ in_reply_to_user_id_str <chr> NA, "25073877", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ favorite_count <int> 136012, 65069, 76721, 127485, 132439, 4217, 112735, 91523, 72758, 101470, 79534, 97178, 233722, 131013, 123780, 150249, 118323, 109368, 144416, 129861, 1207…
$ is_retweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
library(lubridate) # For workin with times
trump_tweets %<>%
mutate(created_at = paste(substr(created_at,27,30),
substr(created_at,5,7),
substr(created_at,9,10),
substr(created_at,12,20)) %>%
as_datetime())
Notye: We will not use the times of tweet for now, but feel free to discover, and maybe reconstruct something inspired by THIS AMAZING PAPER!!!
# Lets filter out retweets
trump_tweets %<>%
filter(is_retweet == FALSE)
# LEts tokenize. Notice that there are special tokens for tweets which keep usefull special characters
trump_token <-trump_tweets %>%
select(id_str, text) %>%
unnest_tokens(word, text, token = "tweets")
trump_token %<>%
anti_join(stop_words, by = 'word')
trump_token %>% count(word, sort = TRUE) %>% head(100)
Lets see who trump mentions
trump_token %>%
filter(word %>% str_detect('@')) %>%
count(word, sort = TRUE)
Your turn
The link below holds a datasewt with ~10k #OKBoomer tweets from the days 10-21 Nov 2019.
https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/tweets_boomer.zip
What to do: * Use elements from the above code to make a list of the most common hashtags (you have to get the hashtags from the text, not using the column containing them already) * Also try to have a look at hashtags over time: Take out the 10 most common hashtags - excluding #OKBoomer - and plot their occurrence over the days in the data
Plan of attack:
- Convert the timestamp into a datetime
- Calculate the occurence of the specific hashtags (itentified by a trailing
#
) in the chosen timespan (here: Days)
- Plot (days on x, n on y)
Go!
