feat(tasks): add fetch and NLP processing time logging to dataset status

fix(reddit_api): handle rate limit wait time conversion error
docs: update references for digital ethnography and further work on evaluation
2026-04-14 17:35:43 +01:00 · 2026-04-14 17:35:21 +01:00 · 2026-04-14 15:16:56 +01:00
4 changed files with 121 additions and 5 deletions
--- a/report/main.tex
+++ b/report/main.tex
@@ -83,7 +83,7 @@ A defining feature of this project is its focus on a geographically grounded dat
 \subsection{What is Digital Ethnography?}
 Digital Ethnography is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities.
-There are multiple methods to carry out digital ethnography, such as online participant observation through automated or manual methods, digital interviews via text or video or tracing digital footprints. 
+There are multiple methods to carry out digital ethnography, such as online participant observation through automated or manual methods, digital interviews via text or video or tracing digital footprints. \cite{dominguez2007virtual}
 Compared to traditional ethnography, digital ethnography is usually faster and more cost-effective due to the availability of large swathes of data across social media sites such as Reddit, YouTube, and Facebook and lack of need to travel. Traditional ethnography often relied on in-person interviews and in-person observation of communities
@@ -120,6 +120,10 @@ Not everyone in an online community participates in the same way. Some users pos
 This distinction between active and passive participation (passive users are often referred to as "lurkers") is important in digital ethnography, because looking only at posts and comments can give a misleading picture of how large or engaged a community actually is.
 This uneven distribution of participation is well documented in the literature. The "90-9-1" principle describes a consistent pattern across many online communities, whereby approximately 90\% of users only consume content, 9\% contribute occasionally, and 
 just 1\% are responsible for the vast majority of content creation \cite{sun2014lurkers}.
 \subsubsection{Temporal Activity Patterns}
 Looking at when a community is active can reveal quite a lot about its nature and membership. A subreddit that peaks at 2am UTC might have a mostly American userbase, while one that is consistently active across all hours could suggest a more globally distributed community. Beyond timezones, temporal patterns can also capture things like how a community responds to external events, like a sudden spike in posting activity often corresponds to something newsworthy happening that is relevant to the community.
@@ -174,6 +178,8 @@ Natural Language Processors will be central to many aspects of the virtual ethno
 One key limitation is how the models will likely find it difficult to interpret context-dependent language. Online communities will often use sarcasm, irony or culturally specific references, all of which will be challenging to for NLP models to correctly interpret. For example, a sarcastic comment might be incorrectly classified as positive, despite conveying negativity.
 Emojis and emoticons are a common feature of online communication and can carry significant emotional meaning. However, NLP models may struggle to accurately interpret the sentiment conveyed by emojis, especially when they are used in combination with text or in a sarcastic manner. \cite{ahmad2024sentiment}
 In addition, the simplification of complex human interactions and emotions into discrete categories like "happy" or "sad" will more than likely overlook some nuance and ambiguity, even if the model is not inherently "wrong". As a result, the outputs of NLP models should be interpreted as indicative patterns rather than definitive representations of user meaning.
 \subsubsection{Computational Constraints}
@@ -931,6 +937,7 @@ One issue arose using dependency injection for the dataset manager. Since from t
 The NLP module is responsible for adding new columns to the dataset that contain the NLP outputs, three types of NLP analysis are performed: emotion classification, topic classification, and named entity recognition. It is instantiated once per dataset during the enrichment phase and runs on the provided Pandas DataFrame.
 \subsubsection{Emotion Classification}
 \label{sec:emotion-classification}
 For emotional classification, initially a pre-trained VADER sentiment analysis model was used, which provides a very simple way to classify text into positive, negative, and neutral emotions. Though for ethnographic analysis, a more complex emotional model that can capture more nuance is needed, therefore the VADER model was later replaced with a fine-tuned transformer-based model that can classify text into a wider range of emotions. 
 GoEMOTION \cite{demszky2020goemotions} was considered as a potential model for emotional classification, as it is extremely nuanced and can capture a wide range of emotions, however it had over 27 emotion classes, which was too many for the purposes of this project, as it would have been difficult to visualise and analyse such a large number of emotion classes. 
@@ -1276,7 +1283,7 @@ The dashboard was described as user-friendly, with the tabbed interface making i
 Several suggestions were made for improving the system by the participants, which are discussed in more detail below.
 \paragraph{Deeper Emotional Analysis}
-The current five-emotion model was seen as a good starting point, but ultimately lacking in nuance. They noted that out of the five existing emotions (joy, sadness, anger, fear, disgust), four of them were negative emotions, and there was a lack of nuanced positive emotions such as hope, pride, relief, etc. In the beginning stages of the project, the GoEmotions model \cite{demszky2020goemotions}, which has 27 emotion classes, was considered but ultimately rejected due to database and schema complexity. However given the feedback, it's worth reconsidering for a much more nuanced emotional analysis.
+The current five-emotion model was seen as a good starting point, but ultimately lacking in nuance. They noted that out of the five existing emotions (joy, sadness, anger, fear, disgust), four of them were negative emotions, and there was a lack of nuanced positive emotions such as hope, pride, relief, etc. In the beginning stages of the project, the GoEmotions model \cite{demszky2020goemotions}, which has 27 emotion classes, was considered but ultimately rejected due to timeline constraints and complexity. However given the feedback, it's worth reconsidering for a much more nuanced emotional analysis.
 \paragraph{Improved Corpus Explorer}
 The corpus explorer was seen as a useful feature, however it was noted that it could be improved in a few ways:
@@ -1295,8 +1302,78 @@ The current implementation of topics is based on a fixed list that is defined up
 \paragraph{Emotion Colour Grading}
 Currently, in the corpus explorer and other areas where emotions are visualised, the posts aren't coloured at all. It was suggested that it would be useful to have some kind of colour grading based on the emotions, so posts that are joyful might be yellow, or posts that are angry might be red. This would allow users to quickly scan through the posts and get a sense of the emotional tone of the dataset. Though if the GoEmotions model is implemented, this might not be feasible as there are 27 emotion classes, so it would require a more complex colour scheme.
 \paragraph{Popularity Indicators}
 The dashboard currently provides no indication of how much engagement a post received. Adding reply counts or upvote scores alongside each post would allow researchers to distinguish between posts that generated significant discussion and those that did not, which is relevant for participation inequality analysis.
 \subsection{NLP Accuracy}
 The accuracy of the NLP models used in the system was evaluated using a small manually annotated dataset.  By taking 50 random examples of posts from the Cork dataset and manually annotating their topic and emotion, then comparing these annotations to the model's predictions, the accuracy of the models can be estimated. Keep in mind that this is a small sample size and is tied to a specific dataset, with specific pre-defined topics, so it may not be representative of the overall accuracy of the models across different datasets and topics.
 To do this, this command was run on the Docker database containter to extract 50 random posts from the Cork dataset:
 \begin{verbatim}
 docker exec crosspost_db psql -U postgres -d mydatabase -x -c 
 "SELECT 
    id, 
    title, 
    content, 
    topic, 
    topic_confidence, 
    emotion_joy, 
    emotion_sadness, 
    emotion_anger, 
    emotion_fear, 
    emotion_disgust
 FROM events 
 WHERE dataset_id = 1 
 ORDER BY RANDOM() 
 LIMIT 50;" > output.txt
 \end{verbatim}
 The \texttt{WHERE dataset\_id = 1} clause specifies that we only want posts from the Cork dataset. The NLP outputs (topic and emotion predictions) were removed using \texttt{grep} on the output file, to analyse without bias.
 Then the output was manually annotated with topic and emotional labels, using the same topic list and emotion classes as the model. The model's predictions were then compared to the manual annotations to calculate the accuracy for both topic classification and emotion classification. The metric will be accuracy, which is the number of correct predictions divided by the total number of predictions.
 The results of this evaluation are as follows:
 \begin{itemize}
    \item \textbf{Dominant Emotional Classification Accuracy}: 68\% (34 out of 50 posts were correctly classified with the dominant emotion).
    \item \textbf{Topic Classification Accuracy}: 64\% (32 out of 50 posts were correctly classified with the correct topic).
 \end{itemize}
 \subsubsection{Emotional Classification Limitations}
 The emotional classification was notably limited in some regards. The decision described in Section \ref{sec:emotion-classification} to remove the "neutral" and "surprise" emotion classes from the emotional analysis was made after observing that the two classes were dominating the dataset. However, restricting the neutral class led to some posts being misclassified as another emotion which may not have been accurate, for example, take the content of the eleventh post in the output file (Record 11):
 \begin{quote}
    \textit{26+7=1}
 \end{quote}
 This post was classified as "anger" with a confidence of 0.22, however this post is arguably neutral and the model's classification is inaccurate. Due to this, the neutral class was reintegrated into the emotional analysis.
 In addition, some confusion arose between the "disgust" and "anger" emotion classes, as they can be quite similar in some contexts. For instance, take the content of the third post in the output file:
 \begin{quote}
    \textit{That's exactly what Ruanair do in Cork, they never introduce a new route, they come in on existing routes, eliminate the competition and then either close the route or move it. They have done this on two polish routes, Cork to Dublin, Newcastle etc.}
 \end{quote}
 The model classified this post as "disgust" with a confidence of 0.35 and "anger" with a confidence of 0.38. This is a borderline case, and even two different human annotators could disagree on whether this post is more "disgust" or "anger", so it's understandable that the model would struggle with this. This highlights the limitations of the emotional classification, as emotions can be quite nuanced and subjective, and a model may not always capture the true emotional tone of a post accurately.
 \subsubsection{Topical Classification Limitations}
 The topic classification also had some limitations, particularly with posts that contained multiple topics. For example, take the content of the 26th post in the output file:
 \begin{quote}
    \textit{We're staying in the city centre so walkable to most places. I checked electrics website earlier. Looked nice. Ended up booking Joules for Thursday then for Friday, we will try a new place called "conways yard" that was recommended here. In hoping to watch the England match there so I'd imagine if have to get there well before kick off (8pm) to get a seat bear a TV.}
 \end{quote}
 This post was classified with the topic "Rugby" with a topic confidence of 0.47, which is quite high by most standards. However this could arguably be classified as "City Center" or even "Pubs" due to the mention of the city centre and the pub "Conway's Yard". This highlights a limitation of the topic classification, which is that it can struggle with posts that contain multiple topics, as it is only able to assign one dominant topic to each post. 
 To address this, making the topic classification more similar to the emotional classification might be benefical. That is, instead of just assigning one dominant topic to each post, the model could assign a confidence score for each topic class, which would allow posts to be classified with multiple topics if they have high confidence scores for multiple topics.
 In addition, ensuring a well-curated topic list that is specific to the dataset can help improve the accuracy of the topic classification, as it reduces the chances of posts being misclassified into irrelevant topics and reduces possible overlap between topics.
 \subsection{Performance Benchmarks}
 The benchmarks for the performance of the system were measured in terms of the time taken for each stage of the data pipeline, including both fetching and NLP processing. The benchmarks were measured in many configurations, such as different dataset sizes, different numbers of sources for fetching, pre-gathered or auto-fetched.
 \subsubsection{NLP Performance}
 \subsubsection{Auto-fetching Performance}
 \subsection{Limitations}
 \newpage
--- a/report/references.bib
+++ b/report/references.bib
@@ -40,3 +40,31 @@
 title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
 year = {2020}
 }
@article{dominguez2007virtual,
  author    = {Domínguez, Daniel and Beaulieu, Anne and Estalella, Adolfo and Gómez, Edgar and Schnettler, Bernt and Read, Rosie},
  title     = {Virtual Ethnography},
  journal   = {Forum Qualitative Sozialforschung / Forum: Qualitative Social Research},
  year      = {2007},
  volume    = {8},
  number    = {3},
  url       = {http://nbn-resolving.de/urn:nbn:de:0114-fqs0703E19}
 }
@article{sun2014lurkers,
  author  = {Sun, Na and Rau, Pei-Luen Patrick and Ma, Liang},
  title   = {Understanding Lurkers in Online Communities: A Literature Review},
  journal = {Computers in Human Behavior},
  year    = {2014},
  volume  = {38},
  pages   = {110--117},
  doi     = {10.1016/j.chb.2014.05.022}
 }
@article{ahmad2024sentiment,
  author  = {Ahmad, Waqar and others},
  title   = {Recent Advancements and Challenges of NLP-based Sentiment Analysis: A State-of-the-art Review},
  journal = {Natural Language Processing Journal},
  year    = {2024},
  doi     = {10.1016/j.nlp.2024.100059}
 }
--- a/server/connectors/reddit_api.py
+++ b/server/connectors/reddit_api.py
@@ -232,7 +232,10 @@ class RedditAPI(BaseConnector):
                )
                if response.status_code == 429:
-                    wait_time = response.headers.get("X-Ratelimit-Reset", backoff)
+                    try:
                        wait_time = int(response.headers.get("X-Ratelimit-Reset", backoff))
                    except ValueError:
                        wait_time = backoff
                    logger.warning(
                        f"Rate limited by Reddit API. Retrying in {wait_time} seconds..."
--- a/server/queue/tasks.py
+++ b/server/queue/tasks.py
@@ -1,3 +1,5 @@
 from time import time
 import pandas as pd
 import logging
@@ -46,6 +48,7 @@ def fetch_and_process_dataset(
    try:
        for metadata in source_info:
            fetch_start = time()
            name = metadata["name"]
            search = metadata.get("search")
            category = metadata.get("category")
@@ -57,8 +60,11 @@ def fetch_and_process_dataset(
            )
            posts.extend(post.to_dict() for post in raw_posts)
        fetch_time = time() - fetch_start
        df = pd.DataFrame(posts)
        nlp_start = time()
        dataset_manager.set_dataset_status(
            dataset_id, "processing", "NLP Processing Started"
        )
@@ -66,9 +72,11 @@ def fetch_and_process_dataset(
        processor = DatasetEnrichment(df, topics)
        enriched_df = processor.enrich()
        nlp_time = time() - nlp_start
        dataset_manager.save_dataset_content(dataset_id, enriched_df)
        dataset_manager.set_dataset_status(
-            dataset_id, "complete", "NLP Processing Completed Successfully"
+            dataset_id, "complete", f"Completed Successfully. Fetch time: {fetch_time:.2f}s, NLP time: {nlp_time:.2f}s"
        )
    except Exception as e:
        dataset_manager.set_dataset_status(
Author	SHA1	Message	Date
Dylan De Faoite	76591bc89e	feat(tasks): add fetch and NLP processing time logging to dataset status	2026-04-14 17:35:43 +01:00
Dylan De Faoite	e35e51d295	fix(reddit_api): handle rate limit wait time conversion error	2026-04-14 17:35:21 +01:00
Dylan De Faoite	d2fe637743	docs: update references for digital ethnography and further work on evaluation	2026-04-14 15:16:56 +01:00