Learnings from the DataKind DataDive

2 minute read

I spent this weekend volunteering at a data hackathon with DataKind. The project was for a nonprofit whose mission is to counter vaccine misinformation and hesitancy. Our goal was to analyze public perception of the Covid-19 vaccine, and my particular focus was on how this is influenced by news media.

Given the limited time available during the project, I thought the highest-impact work would be to develop a web scraping tool that gathers news headline data from a variety of news outlets. That way, the nonprofit partner would be able to develop a database of information on which to apply the techniques developed by other participants (sentiment analysis, NER, topic modeling, etc.) over the course of the coming months.

I’ve done similar web scraping work before, but I built this script to be a bit more robust. In my experience, web scraping functions often break since websites change frequently. To avoid the nonprofit from having to debug when this happens, I created a simple function that wraps the scraping functions in a try except clause. That way, if something goes wrong with one outlet, an error notification is printed but the other five functions are executed normally. It’s not the fanciest implementation, but I’m hoping this will allow the non-technical nonprofit team to gather sufficient data over the coming months with as little frustration as possible!

Once this tool was done, I had a little bit of time for analysis. Given the tiny dataset (255 titles scraped just that day), the results weren’t super informative. But, while putting together the notebook, I learned a nifty trick about using R in Jupyter.

I wanted to just quickly use one function from the quanteda package in R, couldn’t find an equivalent in Python, and didn’t have time to develop it from scratch. Surprisingly, it’s actually simple to use R in notebooks using the rpy2 library.

Just load it:

%load_ext rpy2.ipython

Then begin each R code block with %%R -i data_frame_to_work_with – and voila, the code block can be used just like any R code block. For example:

%%R -i keyness_data

library(quanteda)
library(ggplot2)

keyness_corpus <- corpus(keyness_data$article_title)
docnames(keyness_corpus) <- keyness_data$publisher
docvars(keyness_corpus, "publisher") <- keyness_data$publisher
keyness_dfm <- dfm(keyness_corpus)

keyness_model <- textstat_keyness(keyness_dfm, target = "breitbart")
keyness_data <- textplot_keyness(keyness_model)$data

ggplot(keyness_data, 
       aes(x = reorder(feature, keyness),
           y = keyness)) +
  geom_bar(stat = "identity", 
           aes(fill = right)) +
  coord_flip() +
  theme_bw() +
  theme(axis.title.y = element_blank(),
        text = element_text(size = 8)) +
  labs(x = "Chi-Squared",
         title = "Top Terms Associated with Breitbart vs. Other Outlets",
         fill = "")

Upon reading the docs, it can get much more involved for complicated use cases – but for simple analyses, I thought this was super neat!