### Load standardpackages
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)

This session

In this applied session, you will:

  1. Refresh basic string manipulation skills
  2. Learn how to tokenize texts and analyze these tokens
  3. Apply these skills on twitter data

Refresher: Basics of String Manupilation

We start by taking a piece of text and turning it into something that carries the meaning of the initial text but is less noisy and thus perhaps easier to “understand” by a computer.

text <- "The Eton-educated, non-binary British Iraqi had always struggled with their identity, until they discovered drag. Yet the 29 year old says the performances come at a high price"
# Transforming to lower case
text %>% str_to_lower()
[1] "the eton-educated, non-binary british iraqi had always struggled with their identity, until they discovered drag. yet the 29 year old says the performances come at a high price"
# Split by '.' (=sentence)
text %>% str_split('\\.')
[[1]]
[1] "The Eton-educated, non-binary British Iraqi had always struggled with their identity, until they discovered drag"
[2] " Yet the 29 year old says the performances come at a high price"                                                 
text %>% str_replace_all('o', 'O')
[1] "The EtOn-educated, nOn-binary British Iraqi had always struggled with their identity, until they discOvered drag. Yet the 29 year Old says the perfOrmances cOme at a high price"
# Split by ' ' (=word)
text %>% str_remove_all('[[:punct:]]') %>% str_split(' ') %>% unlist()
 [1] "The"          "Etoneducated" "nonbinary"    "British"      "Iraqi"        "had"          "always"       "struggled"    "with"         "their"        "identity"     "until"       
[13] "they"         "discovered"   "drag"         "Yet"          "the"          "29"           "year"         "old"          "says"         "the"          "performances" "come"        
[25] "at"           "a"            "high"         "price"       
text %>% str_to_lower() %>% str_remove_all('[[:punct:]]') %>% str_split(' ') 
[[1]]
 [1] "the"          "etoneducated" "nonbinary"    "british"      "iraqi"        "had"          "always"       "struggled"    "with"         "their"        "identity"     "until"       
[13] "they"         "discovered"   "drag"         "yet"          "the"          "29"           "year"         "old"          "says"         "the"          "performances" "come"        
[25] "at"           "a"            "high"         "price"       

The R NLP ecosystem

  • Most language analysis approaches are based on the analysis of texts word-by-word.
  • Here, their order might matter (word sequence models) or not (bag-of-words models), but the smallest unit of analysis is usually the word.
  • This is usually done in context of the document the word appeared in. Therefore, on first glance three types datastructures make sense:
  1. Tidy: Approach, where data is served in a 2-column document-word format (e.g., tidytext)
  2. Token lists: Creation of special objects, saved as document-token lists or corpus (e.g., tm, quanteda)
  3. Matrix: Long approach, where data is served as document-term matrix, term-frequency matrix, etc.
  • Different forms of analysis (and the packages used therefore) favor different structures, so we need to be fluent in transfering original raw-text in * These formats, as well as switching between them. (for more infos, check here).

Tidy Text Formats

  • While there exist other ecosystems to do txt analysis (e.g., tm, quanteda), I will here almost exclusively use tidytext, which is very simple yet powerful, very well documented, and works very neathly with tidymodels and the rest of the tidyverse ecosystem.
library(tidytext)
  • While we will for later applications we will use different formats, we here will limit ourselves to word token, which can do most of the simple jobs.
  • Here, we apply tidy principles to text, make word-token per document our unit of analysis.
  • Therefore, every row repreesents a word per document. This sounds like a lot of redundancy, but makes it very easy to work with compared to more complez matrix and list formats. Here, we can do our usual sumarries and visualizations pretty much out-of-the-box.
# Tidytext wants a tibble as point of departure
text_tbl <- tibble(id = 1, text = text)
# We now unnest the tokens. Notice it is by default deleting all punctuation and transforming the text to lower chars.
text_tidy <- text_tbl %>% unnest_tokens(word, text, token = 'words')
  • Overall, in NLP we are trying to represent meaning structure.
  • That means that we want to focus on the most important and “meaning-bearing elements” in text, while reducing noise.
  • Words such as “and”, “have”, “the” may have central syntactic functions but are not particularly important from a semantic perspective.
# Tidytext comes with a stopword lexicon
stop_words
text_tidy %<>%
  anti_join(stop_words, by = 'word')
text_tidy
# We now unnest the tokens. Notice it is by default deleting all punctuation and transforming the text to lower chars.
sentences_tidy <- text_tbl %>% unnest_tokens(word, text, token = 'sentences')
sentences_tidy

Your turn!

Take the following text and transform it into a list of lists with with each element being a tokenized sentence. Remove stopwords, lower all tokens and keep only (1) alpha-numeric word tokens, (2) charactewr tokens.

I’ve been called many things in my life, but never an optimist. That was fine by me. I believed pessimists lived in a constant state of pleasant surprise: if you always expected the worst, things generally turned out better than you imagined. The only real problem with pessimism, I figured, was that too much of it could accidentally turn you into an optimist.

source: https://www.theguardian.com/global/2019/nov/21/glass-half-full-how-i-learned-to-be-an-optimist-in-a-week

Trump Tweets Processing many short texts and simple stats

An introduction to NLP would not be the same without Donald’s tweets. Let’s use these tweets for some more basic NLP and let’s try to gather some insights…maybe

donald_tweets

Let’s try to use some very simple statistics on twitter data, thanks to Trump Twitter Archive

Note: We here already use precompiled data. However, you could use the rtweet package and instead work with own data on tweets of interest.

# we will load some json files
library(jsonlite)
library(tidyjson)
# download and open some Trump tweets from trump_tweet_data_archive
tmp <- tempfile()
download.file("https://github.com/bpb27/trump_tweet_data_archive/raw/master/condensed_2018.json.zip", tmp)
trying URL 'https://github.com/bpb27/trump_tweet_data_archive/raw/master/condensed_2018.json.zip'
Content type 'application/zip' length 384688 bytes (375 KB)
==================================================
downloaded 375 KB
trump_tweets <- stream_in(unz(tmp, "condensed_2018.json"))

 Found 1 records...
 Imported 1 records. Simplifying...
trump_tweets %>% glimpse()
Rows: 3,510
Columns: 8
$ source                  <chr> "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "Twitter for iPhone", "T…
$ id_str                  <chr> "1079888205351145472", "1079830268708556800", "1079830267274108930", "1079763923845419009", "1079763419908243456", "1079762413589807104", "10797487300588707…
$ text                    <chr> "HAPPY NEW YEAR! https://t.co/bHoPDPQ7G6", "....Senator Schumer, more than a year longer than any other Administration in history. These are people who have…
$ created_at              <chr> "Mon Dec 31 23:53:06 +0000 2018", "Mon Dec 31 20:02:52 +0000 2018", "Mon Dec 31 20:02:52 +0000 2018", "Mon Dec 31 15:39:15 +0000 2018", "Mon Dec 31 15:37:14…
$ retweet_count           <int> 33548, 17456, 21030, 29610, 30957, 1123, 25463, 22079, 15152, 22119, 17467, 20873, 61837, 32084, 25782, 44918, 30800, 25809, 34547, 30224, 29381, 27736, 300…
$ in_reply_to_user_id_str <chr> NA, "25073877", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ favorite_count          <int> 136012, 65069, 76721, 127485, 132439, 4217, 112735, 91523, 72758, 101470, 79534, 97178, 233722, 131013, 123780, 150249, 118323, 109368, 144416, 129861, 1207…
$ is_retweet              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
library(lubridate) # For workin with times
trump_tweets %<>%
  mutate(created_at = paste(substr(created_at,27,30),
                      substr(created_at,5,7),
                      substr(created_at,9,10),
                      substr(created_at,12,20)) %>% 
           as_datetime())

Notye: We will not use the times of tweet for now, but feel free to discover, and maybe reconstruct something inspired by THIS AMAZING PAPER!!!

# Lets filter out retweets
trump_tweets %<>%
  filter(is_retweet == FALSE)
# LEts tokenize. Notice that there are special tokens for tweets which keep usefull special characters
trump_token <-trump_tweets %>%
  select(id_str, text) %>%
  unnest_tokens(word, text, token = "tweets")
trump_token %<>%
  anti_join(stop_words, by = 'word')
trump_token %>% count(word, sort = TRUE) %>% head(100)

Lets see who trump mentions

trump_token %>%
  filter(word %>% str_detect('@')) %>%
  count(word, sort = TRUE)

Your turn

alt text

The link below holds a datasewt with ~10k #OKBoomer tweets from the days 10-21 Nov 2019.

https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/tweets_boomer.zip

What to do: * Use elements from the above code to make a list of the most common hashtags (you have to get the hashtags from the text, not using the column containing them already) * Also try to have a look at hashtags over time: Take out the 10 most common hashtags - excluding #OKBoomer - and plot their occurrence over the days in the data

Plan of attack:

  • Convert the timestamp into a datetime
  • Calculate the occurence of the specific hashtags (itentified by a trailing #) in the chosen timespan (here: Days)
  • Plot (days on x, n on y)

Go!

Endnotes

Main reference

  • R for Data Science (Grolemund & Wickham)
    • Chapter 14: To refresh simple string manipulations
  • Julia Silge and David Robinson (2020). Text Mining with R: A Tidy Approach, O’Reilly. Online available here
    • Chapter 1: Introduction to the tidy text format

Packages & Ecosystem

further: * rtweet: R interface to the twitter API.

Suggestions for further study

Session Info

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.7.10 tidyjson_0.3.1   jsonlite_1.7.2   tidytext_0.3.0   magrittr_2.0.1   forcats_0.5.1    stringr_1.4.0    dplyr_1.0.5      purrr_0.3.4      readr_1.4.0     
[11] tidyr_1.1.3      tibble_3.1.0     ggplot2_3.3.3    tidyverse_1.3.0  knitr_1.31      

loaded via a namespace (and not attached):
 [1] httr_1.4.2         splines_4.0.3      prodlim_2019.11.13 modelr_0.1.8       assertthat_0.2.1   cellranger_1.1.0   yaml_2.2.1         ipred_0.9-10       pillar_1.5.1      
[10] backports_1.2.1    lattice_0.20-41    glue_1.4.2         pROC_1.17.0.1      rvest_0.3.6        hardhat_0.1.5      colorspace_2.0-0   recipes_0.1.15     Matrix_1.3-2      
[19] plyr_1.8.6         timeDate_3043.102  pkgconfig_2.0.3    broom_0.7.5        DiceDesign_1.9     haven_2.3.1        scales_1.1.1       gower_0.2.2        lava_1.6.8.1      
[28] parsnip_0.1.5      generics_0.1.0     xgboost_1.3.2.1    ellipsis_0.3.1     withr_2.4.1        nnet_7.3-15        cli_2.3.1          survival_3.2-7     crayon_1.4.1      
[37] readxl_1.3.1       tokenizers_0.2.1   janeaustenr_0.1.5  fs_1.5.0           fansi_0.4.2        SnowballC_0.7.0    MASS_7.3-53.1      xml2_1.3.2         dials_0.0.9       
[46] class_7.3-18       tools_4.0.3        data.table_1.14.0  hms_1.0.0          lifecycle_1.0.0    munsell_0.5.0      reprex_1.0.0       compiler_4.0.3     rlang_0.4.10      
[55] debugme_1.1.0      grid_4.0.3         yardstick_0.0.7    rstudioapi_0.13    gtable_0.3.0       DBI_1.1.1          R6_2.5.0           utf8_1.1.4         stringi_1.5.3     
[64] Rcpp_1.0.6         vctrs_0.3.6        rpart_4.1-15       dbplyr_2.1.0       tidyselect_1.1.0   xfun_0.21         
