General Info
Individual Erasmus+ traineeship project at Eurecat - Technology Centre of Catalonia in Barcelona (ES).
- Repo: hydrateTweet-crunch
- Language: Python
- Start date: 2021/03/11
- End date: 2021/07/16
Description & Objectives
This repository has been created to hydrate, filter (or crunch), and analyse the tweets (their IDs can be found at echen102/COVID-19-TweetIDs) sent during the pandemic, to study the emotional impact of Covid19 on people.
Using the code it is possible to perform different operations, such as:
- sort tweets by language and month, while keeping only relevant information;
- analyse users’ emotions and standardise the results obtained;
- analyse unique users and their number of tweets across the entire dataset;
- download users’ profile images;
- infer users’ gender, age, and if their twitter profile belongs to an organization;
- filter inferred users based on certain constraints;
- aggregate tweets of the same language per week (instead of day);
- extract locations and send requests to perform geocoding;
- group users based on their location to perform further analysis.
Results
- Retrieved and managed over 6TB of data from Twitter (using twarc) — more than 1 Billion tweets.
- Inferred demographic and geographical data of 6 million users using state-of-the-art library (m3inference, geopy), and OpenStreetMap.
- Contributed on GitHub to m3inference, an open source state-of-the-art library, by improving exception handling of corner cases — pull request.
- Developed a pipeline to efficiently extract relevant data about the users, perform emotion analysis techniques (lexicon-based), and save useful statistics.
Final Mark
This project was not evaluated through a mark. Nonetheless, my work was supervised by Cristian Consonni and David Laniado. Furthermore, the results of this project were presented in my Bachelor’s thesis and discussed in my final dissertation at the University of Trento.