Twitter COVID Misinformation Network
Keywords: parallel computing, network analysis
Research Problem
In recent years, there’s a surge in literatures about misinformation or disinformation that are spreading on social media and online forums such as Reddit and Twitter. For instance, Dr. Kathleen Carley at CMU, in her recent work on COVID-19 related misinformation on Twitter found that the level of disinformation is unprecedented compared to “what we’ve seen in the past" and they also found a lot of them are actually caused by bots. According to the news release, "Of the top 50 influential retweeters, 82% were bots. Of the top 1,000 retweeters, 62% were bots."
How is misinformation about coronavirus spread on Twitter? In my research project, I aim to answer this question by taking advantages of large-scale computing tools such as PyWren and PySpark GraphFrames. The project can be broken down into two parts. The first part is collecting tweets using PyWren by mapping a scraper across a list of queries about misinformation. The second part of the project involves running large-scale network analysis of the retweet network of users who retweet misinformation using GraphFrames.
Team
I worked independently on this project.
Project Workflow
One of the difficulties of the project lies in how to define misinformation. To find a reasonable way to collect tweets that might contain misinformation, I eyeballed the search results of different queries on Twitter’s webpage and decided to use keywords such as “coronavirus lie” or “5G responsible corona” to identify tweets that might contain misinformation. This method seemed to work. I started with 2 queries “coronavirus lie” and “COVID lie” and as I increased the search queries in my code, I ended up with similar names of influencers in the network (through calculating degree centrality and in-degree centrality).
In the first part of the project, I used PyWren as a parallel solution to scrape large numbers of tweets in an efficient way (such as reducing wall time). I used PyWren because the scraping process as I laid out is embarrassingly parallel. I created a list of queries that contain keywords for misinformation and for each query in the list, I use PyWren to map it to a scraper function. Since what I’m interested in is the retweet network of Twitter users, I only need a few tweet fields such as user screen name and the user screen name of the person who is being retweeted. In my simple retweet network, the source is defined as the one who retweet a tweet containing misinformation; and the destination node is defined as the author of the original tweet that is being matched by the queries. The scraper function I defined thus returned exactly two pieces of information, i.e. the information about the nodes, and the information about the edges. The nodes contain the names of the user who retweets, the name of the user that is being retweeted, and the type of the users (user vs. user being retweeted). The edge thus contains the information about the direction between source node and the destination node.
In the second part of the project, I used PySpark’s GraphFrames to process the two csv files (one for node and the other for edges) that I uploaded to AWS S3 storage. To find out who the influencers are in the misinformation retweeted network, I computed the degree centrality, and the in-degree and out-degree numbers. Based on how I modeled the source and destination nodes, the higher influencers (with the highest in-degree numbers) should be the users who have the most retweets or who get retweeted most frequently. I identify these users as being perhaps most responsible for the spread of misinformation on Twitter. I also measured the page-rank numbers to get a sense of these users’ popularity. Lastly, taking advantage of the scalability of PySpark, I ran a motif search to find patterns where different users who were retweeted by the same user. What I found was that the retweeting of misinformation (in the way that I identified the term) was carried out actually by a very small group of users (who may or may not be bots).
Scalability
In the way that I wrote my code, I have not fully “exploited” the scalability of Lambda functions with PyWren and to scale up the project can be easily achieved. For instance, in the first part, I only used 8 queries to identify misinformation. Given more time, we can perhaps find a more precise way to query misinformation using keyword search. I imagine we can easily include more queries to scale up the returned tweets. Another quick way to scale up the project is to change the parameters of the maximum tweets returned by each scraper function. In my current code, I set the parameter to be 200 tweets. This can also be scaled up to return more results as needed. In my current code, I ended up with 1438 nodes and 719 edges total. This might still be possible to run network analysis on local machines (i.e. without PySpark). However, running large scale graph analysis would be increasingly difficult as we collect more data and using GraphFrame would make it much faster to run queries on large data using standard graph algorithms. As such, GraphFrame is a better solution to run large-scale network analysis when my project becomes larger.
Limitations
One bottleneck that might cause inefficiency in my current code is that I was not able to directly create two CSV files on S3 when running PyWren. What I ended up doing was that I called back the results from S3 to my local memory and then write the data into CSV files on my local machines. And then, I uploaded them to S3 buckets for PySpark network analysis. Since I know I can upload a variety of object onto S3 buckets, I think it should be possible to achieve this directly in the cloud thus avoiding sending the results back to the local machine.
Access the Github repository for this project here.