docs(report): add future work section

fix(report): correct typos
fix(connector): reduce ThreadPoolExecutor max_workers
2026-04-16 16:54:18 +01:00 · 2026-04-16 16:41:27 +01:00 · 2026-04-16 16:37:27 +01:00 · 2026-04-16 16:23:36 +01:00 · 2026-04-16 16:08:59 +01:00 · 2026-04-16 15:59:24 +01:00
3 changed files with 108 additions and 169 deletions
--- a/report/main.tex
+++ b/report/main.tex
@@ -1,11 +1,10 @@
-\documentclass{article}
+\documentclass[12pt, a4paper]{article}
 \usepackage{graphicx}
 \usepackage{setspace}
 \usepackage{hyperref}
 \usepackage{fvextra}
 \begin{document}
 \bibliographystyle{plain}
 \begin{titlepage}
@@ -50,79 +49,44 @@
 \newpage 
 \section{Introduction}
-This project presents the design and implementation of a web-based analytics engine for the exploration and analysis of online discussion data. Built using \textbf{Flask and Pandas}, and supplemented with \textbf{Natural Language Processing} (NLP) techniques, the system provides an API for extracting structural, temporal, linguistic, and emotional insights from social media posts. A React-based frontend delivers interactive visualizations and user controls, the backend architecture implements analytical pipeline for the data, including data parsing, manipulation and analysis.
+This project presents the design and implementation of a web-based analytics engine for the exploration and analysis of online discussion data. Built using \textbf{Flask and Pandas}, and supplemented with \textbf{Natural Language Processing} (NLP) techniques, the system provides an API for extracting structural, temporal, linguistic, and emotional insights from social media posts. A web-based frontend delivers interactive visualizations, the backend architecture implements analytical pipeline for the data, including data parsing, manipulation and analysis.
 \vspace{0.5cm}
-Beyond its technical objectives, the system is conceptually informed by approaches from \textbf{digital ethnography} and computational social science. Traditional Ethnography is the practice of studying individual or group culture from the point of view of the subject of the study. Digital ethnography seeks to understand how social relations, topics and norms are constructed in online spaces.
+Beyond its technical objectives, the system is based on the concepts and ideas of \textbf{digital ethnography} and computational social science. Traditional Ethnography is the practice of studying individual or group culture from the point of view of the subject of the study. Digital ethnography seeks to understand how social relations, topics and norms are constructed in online spaces.
 \subsection{Motivation}
 There are many beneficiaries of a digital ethnography analytic system: social scientists gain a deeper understanding of contemporary culture and online communities; businesses and marketers can better understand consumer behaviour and online engagement; educators and designers can improve digital learning environments and user experiences; and policymakers can make informed decisions regarding digital platforms, online safety, and community regulation.
 \subsection{Goals \& Objectives}
 \begin{itemize}
    \item \textbf{Collect data ethically}: enable users to link/upload text, and interaction data (messages etc) from specified online communities. Potentially an automated method for importing (using APIs or scraping techniques) could be included as well.
    \item \textbf{Organise content}: Store gathered material in a structured database with tagging for themes, dates, and sources.
    \item \textbf{Analyse patterns}: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
    \item \textbf{Visualise insights}: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
 \end{itemize}
 \subsection{The Cork Dataset}
 A defining feature of this project is its focus on a geographically grounded dataset centred on \textbf{Cork, Ireland}. The system analyses publicly available discussions relating to Cork drawn from multiple online platforms:
 \begin{itemize}
    \item The \textbf{r/Cork} subreddit
    \item The \textbf{r/Ireland} subreddit using a Cork-specific search filter
    \item \textbf{YouTube} videos retrieved using Cork-related search queries
    \item The \textbf{Boards.ie Cork section}
 \end{itemize}
 \newpage
 \section{Background}
 \subsection{What is Digital Ethnography?}
-Digital Ethnography is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities.
+\textit{Digital Ethnography} is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities.
 There are multiple methods to carry out digital ethnography, such as online participant observation through automated or manual methods, digital interviews via text or video or tracing digital footprints.
 Compared to traditional ethnography, digital ethnography is usually faster and more cost-effective due to the availability of large swathes of data across social media sites such as Reddit, YouTube, and Facebook and lack of need to travel. Traditional ethnography often relied on in-person interviews and in-person observation of communities. \cite{coleman2010ethnographic}
 \subsubsection{Traditional Ethnography}
-Ethnography originated in the late nineteenth and early twentieth centuries as a method for understanding cultures through long-term fieldwork. The goal was not just to describe behaviour, but to show how people made sense of that world. Over time, ethnography grew beyond anthropology into sociology, media studies, education, and human computer interaction, becoming a broadly used qualitative research approach. Traditional ethnography was closely tied to physical locations: villages, workplaces or towns. However, as communication technologies developed and social life increasingly took place through technological mediums, it was no longer tied to a physical place. Researchers questioned whether social interactions could still be studied properly if they were no longer tied to physical places.
+Ethnography originated in the late nineteenth and early twentieth centuries as a method for understanding cultures through long-term fieldwork. The goal was not just to describe behaviour, but to show how people made sense of that world. Over time, ethnography grew beyond anthropology into sociology, media studies, education, and human computer interaction, becoming a broadly used qualitative research approach. Traditional ethnography was closely tied to physical locations: villages, workplaces or towns. However, as communication technologies developed and social life increasingly took place through technological mediums, it was no longer tied to a physical place.
 \subsubsection{Transition to Digital Spaces}
 The rise of the internet in the late twentieth century massively changed social interaction. Online forums, emails, SMS and social media platforms became central to human communication. All types of groups and identities were constructed. As a result, ethnographic methods were adapted to study these emerging digital environments. Early work in this area was referred to as "virtual ethnography" or "digital ethnography", where online spaces began to mixed and intertwine with traditional cultural spaces.
-Digital ethnography gives us new challenges to overcome in comparison to traditional ethnography. The field is distributed across platforms, devices and online-offline interactions. For example, a digital ethnographer studying influencer culture might examine Instagram posts, comment sections, private messages, algorithms, and also conduct interviews or observe offline events. This transitions requires flexibility, since researchers can no longer rely solely on face-to-face interactions.
+There are new challenges to overcome in comparison to traditional ethnography. The field is distributed across platforms, devices and online-offline interactions. For example, a digital ethnographer studying influencer culture might examine Instagram posts, comment sections, private messages, algorithms, and also conduct interviews or observe offline events. In some ways, digital ethnography is easier than traditional ethnography. Cost is reduced, as there is no need to travel or spend long periods in a field, it's less invasive as there is no need to interact with subjects directly, and there is a much larger amount of data available for analysis. \cite{cook2023ethnography}
 \subsection{Online Communities}
 There are many different types of online communities, often structured in various ways, with many different types of users, norms and power dynamics. These communities can range from large-scale social networking platforms and discussion forums to niche interest. Each type of community fosters different forms of interaction, participation, and identity construction.
-Participation within these communities is usually not evenly distributed. The majority of users are passive consumers (lurkers), a smaller percentage contribute occasionally, and a very small core group produces most of the content. This uneven contribution structure has significant implications for digital ethnography, as visible discourse may disproportionately reflect the perspectives of highly active members rather than the broader community. This is particularly evident in some reputation-based systems such as Reddit, which allows for the opinions of a few to rise above the rest.
+Participation within these communities is usually not evenly distributed. The majority of users are passive consumers (lurkers) \cite{sun2014lurkers}, a smaller percentage contribute occasionally, and a very small core group produces most of the content. This uneven contribution structure has significant implications for digital ethnography, as visible discourse may disproportionately reflect the perspectives of highly active members rather than the broader community. This is particularly evident in some reputation-based systems such as Reddit, which allows for the opinions of a few to rise above the rest.
 Examples of digital spaces include:
 \begin{itemize}
    \item \textbf{Social media platforms} (e.g., Facebook, Twitter, Instagram) where users create profiles, share content, and interact with others.
    \item \textbf{Online forums and communities} (e.g., Reddit, Boards.ie) where users engage in threaded discussions around specific topics or interests.
    \item \textbf{Video platforms} (e.g., YouTube) where users share and comment on video content, often fostering communities around specific channels or topics.
    \item \textbf{Messaging apps} (e.g., WhatsApp, Discord) where users engage in private or group conversations, often with a more informal and intimate tone.
 \end{itemize}
 \subsection{Digital Ethnography Metrics}
 This section describes common keywords and metrics use to measure and quantify online communities using digital ethnography.
 \subsubsection{Sentiment Analysis}
 Sentiment Analysis involves capturing the emotions associated with a specific post, topic or entity. This type of analysis can be as simple as classifying a post as "positive" or "negative", or classifying a post into a set of pre-existing emotions such as anger, joy or sadness.
 \subsubsection{Active vs Passive Participation}
 \label{sec:passive_participation}
 Not everyone in an online community participates in the same way. Some users post regularly and leave comments while others might simply read content without ever contributing anything themselves. Some might only contribute occasionally.
 This distinction between active and passive participation (passive users are often referred to as "lurkers") is important in digital ethnography, because looking only at posts and comments can give a misleading picture of how large or engaged a community actually is.
-This uneven distribution of participation is well documented in the literature. The "90-9-1" principle describes a consistent pattern across many online communities, whereby approximately 90\% of users only consume content, 9\% contribute occasionally, and 
+This uneven distribution of participation is well documented in the literature. The "90-9-1" principle describes a consistent pattern across many online communities, whereby approximately 90\% of users only consume content, 9\% contribute occasionally, and just 1\% are responsible for the vast majority of content creation \cite{sun2014lurkers}.
 just 1\% are responsible for the vast majority of content creation \cite{sun2014lurkers}.
 \subsubsection{Temporal Activity Patterns}
 Looking at when a community is active can reveal quite a lot about its nature and membership. A subreddit that peaks at 2am UTC might have a mostly American userbase, while one that is consistently active across all hours could suggest a more globally distributed community. Beyond timezones, temporal patterns can also capture things like how a community responds to external events, like a sudden spike in posting activity often corresponds to something newsworthy happening that is relevant to the community.
@@ -158,7 +122,7 @@ Digital ethnography traditionally relied on manual reading of texts and intervie
 NLP techniques can be used to automatically process and analyse large volumes and applying ethnographic methods at scale. For example, NLP can be used to identify common themes and topics in a subreddit, track how these themes evolve over time, and even detect the emotional tone of discussions. This allows researchers to gain insights into the dynamics of online communities that would be impossible to achieve through manual analysis alone.
 \subsubsection{Sentiment Analysis}
-\textbf{Sentiment Analysis} involves determining the emotional tone behind a piece of text. It is commonly used to classify text as positive, negative, or neutral. This technique is widely applied in areas such as customer feedback analysis, social media monitoring, and market research. More advanced sentiment analysis models can detect nuanced emotions, such as frustration, satisfaction, or sarcasm, although accurately identifying these emotions remains a challenge.
+\textbf{Sentiment Analysis} involves determining the emotional tone behind a piece of text. It is commonly used to classify text as positive, negative, or neutral. More advanced sentiment analysis models can detect nuanced emotions, such as frustration, satisfaction, or sarcasm, although accurately identifying these emotions remains a challenge \cite{giuffre2026sentiment}. For ethnographic analysis, sentiment analysis can provide insights into the emotional dynamics of a community, such as how users feel about certain topics or how the overall mood of discussions changes over time.
 \subsubsection{Named Entity Recognition}
 \textbf{Named Entity Recognition (NER)} is the process of identifying and classifying key entities within a text into predefined categories like names of people, organisations, locations, or dates. NER is essential for structuring unstructured text data and is often used in information extraction, search engines, and question-answering systems. Despite its usefulness, NER can struggle with ambiguous entities or context-dependent meanings.
@@ -171,13 +135,11 @@ This method is often used to organise lots of unstructured data, such as news ar
 \subsubsection{Stop Words}
 \textbf{Stop Words} are common words that are often filtered out in NLP tasks because they carry little meaningful information. Examples of stop words include "the", "is", "in", "and", etc. Removing stop words can help improve the performance of NLP models by reducing noise and focusing on more informative words. However, the choice of stop words can vary depending on the context and the specific task at hand.
-For example, in a Cork-specific dataset, words like "ah", or "grand" might be considered stop words, as they are commonly used in everyday speech but do not carry significant meaning for analysis.
+For example, in a Cork-specific dataset, words like "ah", or "grand" might be considered stop words, as they are commonly used in everyday speech but do not carry significant meaning for analysis. \cite{mungalpara2022stemming}
 \subsection{Limits of Computation Analysis}
 While computational methods enable large-scale observation and analysis of online communities, there are many limitations that must be acknowledged. Many limitations come from NLP techniques and the practical boundaries of computational resources.
 Natural Language Processors will be central to many aspects of the virtual ethnography, such as emotional and topic classification. While these models are strong and have shown results in many areas, they are imperfect and may produce inaccurate or misleading results.
 One key limitation is how the models will likely find it difficult to interpret context-dependent language. Online communities will often use sarcasm, irony or culturally specific references, all of which will be challenging to for NLP models to correctly interpret. For example, a sarcastic comment might be incorrectly classified as positive, despite conveying negativity.
 Emojis and emoticons are a common feature of online communication and can carry significant emotional meaning. However, NLP models may struggle to accurately interpret the sentiment conveyed by emojis, especially when they are used in combination with text or in a sarcastic manner. \cite{ahmad2024sentiment}
@@ -205,7 +167,6 @@ Due to data being collected across multiple platforms, they must be normalised i
 \newpage
 \section{Analysis}
 \subsection{Goals \& Objectives}
 The objective of this project is to provide a tool that can assist social scientists, digital ethnographers, and researchers to observing and interpret online communities and the interactions between them. Rather than replacing the study of digital ethnography or the related fields, this tool aims to aid researchers analyse communities.
@@ -247,7 +208,7 @@ Overall, while NLP provides powerful tools for analysing large datasets, its lim
 \subsubsection{Data Normalisation}
 Different social media platforms will produce data in many different formats. For example, Reddit data will have a much different reply structure to a forum-based platform like Boards.ie where there are no nested replies. Therefore, a core design requirement of the system is to normalise all incoming data into a single unified internal data model. This allows the same analytical functions to be applied across all data sources, regardless of their original structure.
-Posts and comments are two different types of user-generated content, however when it comes to ethnographic analysis, they are both just "events" or information that is being shared by a user. From an ethnographic perspective, the distinction between a post and a comment is not particularly important, since they both represent user-generated content that contributes to the community discourse. Therefore, the system will normalise all posts and comments into a single "event" data model, which will allow the same analytical functions to be applied uniformly across all content. This also simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
+Both comments and posts represent user-generated content that contributes to the community discourse. Therefore, the system will normalise all posts and comments into a single "event" data model, which will allow the same analytical functions to be applied uniformly across all content. This also simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
 Though separate processing paths are not needed, the system will still retain metadata that indicates whether an event was originally a post or a comment, as well as any relevant structural information (e.g., parent-child relationships in Reddit threads).
@@ -260,7 +221,6 @@ To mitigate this, the system will:
 \begin{itemize}
    \item Utilise GPU acceleration where available for NLP inference.
    \item Pre-compute some analytical results during data ingestion to speed up subsequent queries.
    \item Store NLP outputs in the database to avoid redundant processing.
    \item Implement asynchronous processing for long-running tasks.
 \end{itemize}
@@ -279,17 +239,16 @@ The system will:
    \item Provide user-agent headers that identify the system and its purposes
    \item Allow users the option to upload their own datasets instead of automated collection.
    \item For websites without an API, the \texttt{robots.txt} file will be examined to ensure compliance with platform guidelines.
    \item Data volume limits of up to 1000 posts per source will be enforced server-side to prevent excessive data collection.
 \end{itemize}
-Some platforms provide APIs that allow for easy and ethical data collection, such as YouTube and Reddit. These APIs have clear guidelines and rate limits that the system will adhere to. 
+Some platforms provide APIs that allow for easy and ethical data collection, such as YouTube and Reddit. These APIs have clear guidelines and rate limits that the system will adhere to. \cite{chugani2025ethicalscraping}
 \paragraph{Reddit (API)}
 Reddit provides a public API that allows for the retrieval of posts, comments, and metadata from subreddits. The system will use the official Reddit API with proper authentication via OAuth2 and access tokens. 
 In November 2025, Reddit introduced a new approval process for API access, which requires developers to apply for access and specify their intended use case. While the public unauthenticated endpoints are still accessible, they have far stricter rate limits (100 requests every 10 minutes) compared to authenticated access (100 requests per minute). Therefore, the system shall allow for authenticated access to the Reddit API to speed up data retrival.
-Unauthenticated access will still be available as a fallback if client credentials are not provided on the backend, but this will massively slow the data retreival process, and this will still only fetch public posts and comments.
+Unauthenticated access will still be available as a fallback if client credentials are not provided on the backend, but this will massively slow the data retrieval process, and this will still only fetch public posts and comments.
 From reddit, the system will collect posts, comments and all replies to comments, as well as metadata such as the author name and timestamp.
@@ -315,7 +274,7 @@ YouTube is supported via the official YouTube Data API v3, provided by Google. T
 Authentication is handled through an API key issued via the Google Cloud Console. The API enforces a quota system rather than a traditional rate limit: each project is allocated 10,000 quota units per day by default, with different operations consuming different amounts.
-In addition, comment retreival can be disabled by the video uploader, so the system will handle this case by skipping videos where comments are not accessible.
+In addition, comment retrieval can be disabled by the video uploader, so the system will handle this case by skipping videos where comments are not accessible.
 \subsubsection{Data Storage \& Retention}
 All data fetched from social media sites are stored locally in a PostgreSQL database. The system will not share or expose any of this data to third parties beyond the users of this application. Raw API responses are discarded once the relevant information is extracted. 
@@ -324,89 +283,37 @@ All datasets are associated with one and only one user account, and the users th
 The system will not store any personally identifiable information except for what is necessary for the analysis, which includes only usernames and timestamps. The system will not attempt to de-anonymise content creators or link data across platforms.
 \subsubsection{User Security}
 Standard security practices will be followed to protect user data and prevent unauthorized access. This includes:
 \begin{itemize}
    \item The hashing of all user passwords and no storage of plaintext passwords.
    \item The use of JWTs for session management, with secure signing and an expiration time of 24 hours.
    \item Access control on all analysis API endpoints to ensure that end-users can only access their own datasets and results.
    \item Parameterised queries for all database interactions to prevent SQL injection attacks.
 \end{itemize}
 \subsection{Requirements}
 The following requirements are derived from the backend architecture, NLP processing pipeline, and the React-based frontend interface.
 \subsubsection{Functional Requirements}
 \paragraph{Data Ingestion and Preparation}
 \begin{itemize}
    \item The system shall accept social media data in \texttt{.jsonl} format containing posts and nested comments.
    \item The system shall validate uploaded files and return structured error responses for invalid formats or malformed data.
    \item The system shall normalise posts and comments into a unified event-based dataset.
    \item The system shall give the user the option to automatically fetch datasets from social media sites filtered for specific keywords or categories.
    \item The system shall provide a loading screen with a progress bar after the dataset is uploaded.
 \end{itemize}
 \paragraph{Dataset Management}
 \begin{itemize}
-    \item The system shall utilise Natural Language Processing models to generate average emotions per event.
+    \item The system shall utilise Natural Language Processing models to generate analytical outputs such as sentiment analysis, topic modelling, and named entity recognition.
    \item The system shall utilise Natural Language Processing models to classify each event into a topic.
    \item The system shall utilise Natural Language Processing models to identify entities in each event.
    \item The system shall allow the users to view the raw dataset.
    \item The system shall return detailed endpoints that return calculated statistics grouped into themes.
 \end{itemize}
 \paragraph{Filtering and Search}
 \begin{itemize}
-    \item The system shall support keyword-based filtering across content, author, and optionally title fields.
+    \item The system shall support keyword-based, date-based and source-based filtering of the dataset.
    \item The system shall support filtering by start and end date ranges.
    \item The system shall support filtering by one or more data sources.
    \item The system shall allow multiple filters to be applied simultaneously.
    \item The system shall return a filtered dataset reflecting all active filters.
 \end{itemize}
-\paragraph{Temporal Analysis}
+\paragraph{Ethnographic Analysis}
 \begin{itemize}
-    \item The system shall compute event frequency per day.
+    \item The system shall provide endpoints for structural analysis, temporal analysis, linguistic analysis, and emotional analysis.
-    \item The system shall generate weekday--hour heatmap data representing activity distribution.
+    \item The system shall allow users to define custom topics for topic modelling and analysis.
    \item The system shall return outputs that are suitable for visualisation in the frontend.
 \end{itemize}
 \paragraph{Linguistic Analysis}
 \begin{itemize}
    \item The system shall compute word frequency statistics excluding standard and domain-specific stopwords.
    \item The system shall extract common bi-grams and tri-grams from textual content.
    \item The system shall compute lexical diversity metrics for the dataset.
 \end{itemize}
 \paragraph{Emotional Analysis}
 \begin{itemize}
    \item The system shall compute average emotional distribution per topic.
    \item The system shall compute overall average emotional distribution across the dataset.
    \item The system shall determine dominant emotion distributions.
    \item The system shall compute emotional distribution grouped by data source.
 \end{itemize}
 \paragraph{User Analysis}
 \begin{itemize}
    \item The system shall identify top users based on activity.
    \item The system shall compute per-user activity and behavioural metrics.
 \end{itemize}
 \paragraph{Interaction Analysis}
 \begin{itemize}
    \item The system shall compute average conversation thread depth.
    \item The system shall identify top interaction pairs between users.
    \item The system shall generate an interaction graph based on user relationships.
    \item The system shall compute conversation concentration metrics.
 \end{itemize}
 \paragraph{Cultural Analysis}
 \begin{itemize}
    \item The system shall identify identity-related linguistic markers.
    \item The system shall detect stance-related linguistic markers.
    \item The system shall compute average emotional expression per detected entity.
 \end{itemize}
 \paragraph{Frontend}
 \begin{itemize}
@@ -428,13 +335,6 @@ The following requirements are derived from the backend architecture, NLP proces
    \item NLP models shall be cached to prevent redundant loading.
 \end{itemize}
 \paragraph{Reliability and Robustness}
 \begin{itemize}
    \item The system shall implement structured exception handling.
    \item The system shall return meaningful JSON error responses for invalid requests.
    \item The dataset reset functionality shall preserve data integrity.
 \end{itemize}
 \newpage
 \section{Design}
 \subsection{System Architecture}
@@ -445,13 +345,6 @@ The following requirements are derived from the backend architecture, NLP proces
    \label{fig:architecture}
 \end{figure}
 \begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{img/schema.png}
    \caption{System Schema}
    \label{fig:schema}
 \end{figure}
 \subsection{Client-Server Architecture}
 The system will follow a client-server architecture, with a Flask-based backend API and a React-based frontend interface. The backend will handle data processing, NLP analysis, and database interactions, while the frontend will provide an interactive user interface for data exploration and visualization.
@@ -490,32 +383,12 @@ Originally, only file upload was supported, but the goal of the platform is to a
 In addition to social media posts, the system will allow users to upload a list of topics that they want to track in the dataset. This allows the system to generate custom topic analysis based on user-defined topics, which can be more relevant and insightful for specific research questions. For example, a researcher studying discussions around local politics in Cork might upload a list of political parties, politicians, and policy issues as topics to track.
 Below is a snippet of what a custom topic list might look like in \texttt{.json} format:
 \begin{Verbatim}[breaklines=true]
 {
    "Public Transport": "buses, bus routes, bus eireann, public transport, late buses, bus delays, trains, commuting without a car, transport infrastructure in Cork",
    "Traffic": "traffic jams, congestion, rush hour, cars backed up, gridlock, driving in Cork, road delays",
    "Parking": "parking spaces, parking fines, clamping, pay parking, parking permits, finding parking in the city",
    "Cycling": "cycling in Cork, bike lanes, cyclists, cycle safety, bikes on roads, cycling infrastructure"
 }
 \end{Verbatim}
 If a custom topic list is not provided by the user, the system will use a pre-defined generalised topic list that is designed to capture common themes across a wide range of online communities.
 Each method of ingestion will format the raw data into a standardised structure, where each post will be represented as a "Post" object and each comment will be represented as a "Comment" object.
 \subsubsection{Data Normalisation}
-After a dataset is ingested, the system will normalise all posts and nested comments into a single unified "event" data model. This means that both posts and comments will be represented as the same type of object, with a common set of fields that capture the relevant information for analysis. The fields in this unified data model will include:
+After a dataset is ingested, the system will normalise all posts and nested comments into a single unified "event" data model. This means that both posts and comments will be represented as the same type of object, with a common set of fields that capture the relevant information for analysis. 
 \begin{itemize}
    \item \texttt{id} - a unique identifier for the post or comment.
    \item \texttt{content} — the text content of the post or comment.
    \item \texttt{author} — the username of the content creator.
    \item \texttt{timestamp} — the date and time when the content was created
    \item \texttt{source} — the original platform from which the content was retrieved (e.g., Reddit, YouTube, Boards.ie).
    \item \texttt{type} — a field indicating whether the event is a "post" or a "comment".
    \item \texttt{parent\_id} — for comments, this field will reference the original id of the post it's commenting on.
    \item \texttt{reply\_to} - for comments, this field will reference the original id of the comment it's replying to. If the comment is a direct reply to a post, this field will be null.
 \end{itemize}
 The decision to normalise posts and comments into a single "event" data model allows the same analytical functions to be applied uniformly across all content, regardless of whether it was originally a post or a comment. This simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
@@ -535,6 +408,13 @@ NLP processing lets us perform much richer analysis of the dataset, as it provid
 \subsubsection{Data Storage}
 The enriched dataset is stored in a PostgreSQL database, with a schema similar to the unified data model defined in the normalisation section, with additional fields for the derived data, NLP outputs, and user ownership. Each dataset is associated with a specific user account, and the system supports multiple datasets per user.
 \begin{figure}[!h]
    \centering
    \includegraphics[width=1.0\textwidth]{img/schema.png}
    \caption{System Schema}
    \label{fig:schema}
 \end{figure}
 \subsubsection{Data Retrieval}
 The stored dataset can then be retrieved through the Flask API endpoints for analysis. The API supports filtering by keywords and date ranges, as well as grouping and aggregation for various analytical outputs.
@@ -594,15 +474,13 @@ Linguistic analysis allows researchers to understand the language and words used
 In this system, linguistic analysis will include:
 \begin{itemize}
    \item Word frequency statistics excluding standard and domain-specific stopwords.
-    \item Common bi-grams and tri-grams from textual content.
+    \item Common bi-grams and tri-grams from textual content. \cite{mungalpara2022stemming}
    \item Lexical diversity metrics for the dataset.
 \end{itemize}
 The word frequencies and n-gram metrics were chosen because they can provide insights into the language and phrases used commonly in an online community, which is important for ethnographic analysis and understanding a community fully. Lexical diversity metrics such as the total number of unique tokens versus the total number of tokens can show if a specific culture often repeats phrases (like memes, slang etc.) or if they often have structured, serious discussion without repeating themeselves.
-Outlining a list of stopwords is essential for linguistic analysis, as it filters out common words that wouldn't be useful for linguistic analysis. Stop Word lists can be provided by a Python library such as NLTK. 
+Outlining a list of stopwords is essential for linguistic analysis, as it filters out common words that wouldn't be useful for linguistic analysis. Stop Word lists can be provided by a Python library such as NLTK. In addition to standard stop words, the system also excludes link tokens such as "www", "http", and "https" from the word frequency analysis, as social media users will often include links in their posts and comments, and these tokens can become quite common and skew the word frequency results without adding meaningful insight.
 In addition to standard stop words, the system also excludes link tokens such as "www", "http", and "https" from the word frequency analysis, as social media users will often include links in their posts and comments, and these tokens can become quite common and skew the word frequency results without adding meaningful insight.
 \subsubsection{User Analysis}
 User analysis allows researchers to understand the behaviour and activity of individual users within a community. For example, a researcher might want to see who the most active users are in a community, or how different users contribute to the overall emotional tone of the community.
@@ -638,8 +516,6 @@ In this system, interactional analysis will include:
 For simplicity, an interaction is defined as a reply from one user to another, which can be either a comment replying to a post or a comment replying to another comment. The system will not attempt to capture more complex interactions such as mentions or indirect references between users, as these would require more advanced NLP techniques.
 \textbf{Average reply chain depth} was considered as a metric, however forum-based social media sites, such as boards.ie, do not have a way to reply to comments in the same way that Reddit does, therefore the concept of "reply chains" doesn't apply cleanly in the same way. One possible solution is to infer reply relationships from explicit user mentions embedded in content of the post, but this is not a reliable method.
 \subsubsection{Emotional Analysis}
 Emotional analysis allows researchers to understand the emotional tone of a community, and how it varies across different topics and users.
@@ -653,8 +529,6 @@ In this system, emotional analysis will include:
 It is emphasised that emotional analysis is inaccurate on an individual post level as the models cannot fully capture the nuance of human interaction and slang. Warnings will be presented to the user in the frontend that AI outputs can possible be misleading on an individual scale, and accuracy only increases with more posts. Even then it will not be perfect.
 In an ideal world, the models are accurate enough to capture general emotions on a macro-scale.
 \subsubsection{Cultural Analysis}
 Cultural analysis allows researchers to understand the cultural markers and identity signals that are present in a community, such as slang, memes, and recurring references. While some of this is covered in the linguistic analysis, cultural analysis will focus more on the identity and stance-related markers that are present in the language of the community.
@@ -792,7 +666,7 @@ The project was developed using the following tools and libraries:
    \item \textbf{react-chartjs-2} and \textbf{react-wordcloud} for data visualisation in the frontend.
 \end{itemize}
-The project was developed using Git for version control, with a branching strategy that included feature branches for new functionality and a main branch for stable code. Regular commits were made to document the development process and conventional commit messages were used to indicate the type of changes made. Occasionally, text bodies were included in commit messages to provide justification for design decisions or to explain changes that couldn't be easily understood from the diff alone.
+Git was used for version control, with regular commits and branches for new features.
 \subsection{Social Media Connectors}
 The first connectors implemented were the Reddit and Boards.ie connectors, as these were the original data sources for the Cork dataset. The YouTube connector was added later to improve diversity of data sources. In addition, the decision was made to only fetch new posts and fetch a fixed number of posts, rather than fetching the top posts of all time, which are usually full of memes and jokes that would skew the dataset and not be relevant for ethnographic analysis. In addition the temporal analysis would be skewed if we fetched top posts of all time, as the most popular posts are often from years ago, which would not be relevant for understanding the current state of the community.
@@ -828,7 +702,7 @@ Inspect element was used to poke around the structure of the Boards.ie website a
 As not all comments on a thread are on one page, pagination was implemented by looking for the "Next" button on the page and following the link to the next page of comments until there are no more pages left. This allows for fetching of all comments for a given post, even if they span multiple pages.
-A \texttt{ThreadPoolExecutor} was used to fetch posts in parallel, which improved the performance of the connector significantly, as fetching posts sequentially was very slow due to the need to fetch comments for each post, which often spanned multiple pages. Though there was diminishing returns after a certain number of threads, possibly due to site blocking or internet connection limits. Initially 20 threads were used, but this was later reduced to 10 threads to avoid potential issues with site blocking and to improve ethical considerations around web scraping.
+A \texttt{ThreadPoolExecutor} was used to fetch posts in parallel, which improved the performance of the connector significantly, as fetching posts sequentially was very slow due to the need to fetch comments for each post, which often spanned multiple pages. Though there was diminishing returns after a certain number of threads, possibly due to site blocking or internet connection limits. Initially 20 threads were used, but this was later reduced to 5 threads to avoid potential issues with site blocking and to improve ethical considerations around web scraping.
 \subsubsection{Connecter Plugin System}
 The connector plugin system was implemented to allow for easy addition of new data sources in the future. This would require simply implemented a new connector class and dropping it into the connectors directory, without needing to modify any existing code. This was achieved through the use of Python's \texttt{importlib} library, which allows for dynamic importing of modules at runtime.
@@ -951,12 +825,12 @@ As the project progressed and more posts were classified, the "surprise" and "ne
 \subsubsection{Topic Classification}
 For topic classification, a zero-shot classification approach was used, which allows for classification of text into arbitrary topic classes without needing to fine-tune a model for each specific set of topics. Initially, attempts were made to automatically generate topic classes based on the most common words in the dataset using TF-IDF, but this led to generic and strange classes that weren't useful for analysis. Therefore, it was decided that a topic list would be provided manually, either by the user or using a generic list of broad common topics. 
-Initially, the "all-mpnet-base-v2" \cite{all_mpnet_base_v2} was used as the base model for the zero-shot classification, which is a general-purpose sentence embedding model. While this worked well and produced good results, it was slow to run interference on large datasets, and would often take hours to classify a dataset of over 60,000 posts and comments.
+Initially, the "all-mpnet-base-v2" \cite{all_mpnet_base_v2} was used as the base model for the zero-shot classification, which is a general-purpose sentence embedding model. While this worked well and produced good results, it was slow to run inference on large datasets, and would often take hours to classify a dataset of over 60,000 posts and comments.
 Eventually, the "MiniLM-L6-v2 " \cite{minilm_l6_v2} was chosen as the base model for zero-shot classification, which is a smaller and faster sentence embedding model. While it may not produce quite as good results as the larger model, it still produces good results and is much faster to run inference on, which makes it more practical for use in this project.
 \subsubsection{Entity Recognition}
-At this point, the NLP pipeline was taking a long time to run on large datasets (such as the Cork dataset), therefore any NER (Named Entity Recognition) model that was added needed to be small and fast to run interference on large datasets. The "dslim/bert-base-NER" model from HuggingFace \cite{dslim_bert_base_ner} was chosen as it is a fine-tuned BERT model that can perform named entity recognition, and is relatively small and fast compared to other NER models.
+At this point, the NLP pipeline was taking a long time to run on large datasets (such as the Cork dataset), therefore any NER (Named Entity Recognition) model that was added needed to be small and fast to run inference on large datasets. The "dslim/bert-base-NER" model from HuggingFace \cite{dslim_bert_base_ner} was chosen as it is a fine-tuned BERT model that can perform named entity recognition, and is relatively small and fast compared to other NER models.
 This model outputs a list of entities for each post, and each entity has a type, which are:
 \begin{itemize}
@@ -1322,7 +1196,8 @@ The results of this evaluation are as follows:
    \item \textbf{Topic Classification Accuracy}: 64\% (32 out of 50 posts were correctly classified with the correct topic).
 \end{itemize}
-\subsubsection{Emotional Classification Limitations}
+
 \subsubsection{Emotional Classification Discussion}
 The emotional classification was notably limited in some regards. The decision described in Section \ref{sec:emotion-classification} to remove the "neutral" and "surprise" emotion classes from the emotional analysis was made after observing that the two classes were dominating the dataset. However, restricting the neutral class led to some posts being misclassified as another emotion which may not have been accurate, for example, take the content of the eleventh post in the output file (Record 11):
 \begin{quote}
@@ -1339,7 +1214,11 @@ In addition, some confusion arose between the "disgust" and "anger" emotion clas
 The model classified this post as "disgust" with a confidence of 0.35 and "anger" with a confidence of 0.38. This is a borderline case, and even two different human annotators could disagree on whether this post is more "disgust" or "anger", so it's understandable that the model would struggle with this. This highlights the limitations of the emotional classification, as emotions can be quite nuanced and subjective, and a model may not always capture the true emotional tone of a post accurately.
-\subsubsection{Topical Classification Limitations}
+A significant reason that the accuracy was sitting around (60–70\%) is the model’s inability to represent the multi-dimensional nature of human emotion. Many posts express multiple emotions simultaneously (e.g., frustration mixed with humour), yet the model is constrained to selecting a single dominant class. This leads to misclassification in cases where no single emotion is clearly dominant.
 In addition, the temporary exclusion of the “neutral” class forced inherently neutral posts into a specific category, artificially lowering accuracy. Borderline cases between closely related emotions (such as anger and disgust) also contributed to disagreement between manual annotations and model predictions, which shows how subjective emotional expressions can be.
 \subsubsection{Topical Classification Discussion}
 The topic classification also had some limitations, particularly with posts that contained multiple topics. For example, take the content of the 26th post in the output file:
 \begin{quote}
    \textit{We're staying in the city centre so walkable to most places. I checked electrics website earlier. Looked nice. Ended up booking Joules for Thursday then for Friday, we will try a new place called "conways yard" that was recommended here. In hoping to watch the England match there so I'd imagine if have to get there well before kick off (8pm) to get a seat bear a TV.}
@@ -1406,10 +1285,6 @@ The Boards.ie connector relies on web scraping, which is very fragile and prone
 \subsubsection{English-Only Support}
 Two of three NLP models used in the system are trained exclusively on English-language data. This means the system cannot accurately analyse datasets in other languages, which limits its usefulness for researchers working with non-English communities. This was noted as a specific concern by participants in the user feedback session, who work with both English and Turkish datasets.
 \subsubsection{Scalability}
 While asynchronous processing via Celery and Redis mitigates blocking during NLP enrichment and data fetching, the system is not designed to scale horizontally. A single Celery worker handles all tasks sequentially, and the PostgreSQL database is not configured for high-availability or replication. For research use at small to medium scale this is fine, but the system would require significant infrastructure changes to support concurrent large-scale usage across many users.
 \newpage
 \section{Conclusions}
 \subsection{Reflection}
@@ -1429,14 +1304,35 @@ On a personal level, the project was a significant learning experience in terms
    \label{fig:gnatt_chart}
 \end{figure}
-The project was maintained and developed using Git for version control, with the repository hosted on both Github and a self hosted Gitea instance. The project eventually began to use conventional commits to maintain a clean commit history, and commit messages contained rationale for non-obvious decisions. 
+The project was maintained and developed using Git for version control, with the repository hosted on both Github.
-Starting in Novemeber, the project went through a few iterations of basic functionality such as data retreival and storage. Research was done on digital ethnography, the traditional metrics used, and how they're implemented in code. The design of the system was also iterated on, with the initial design being a very simple frontend that showed simple aggregates, into a more complex and feature-rich dashboard with multiple analytical perspectives and NLP enrichments. 
+Starting in Novemeber, the project went through a few iterations of basic functionality such as data retrieval and storage. Research was done on digital ethnography, the traditional metrics used, and how they're implemented in code. The design of the system was also iterated on, with the initial design being a very simple frontend that showed simple aggregates, into a more complex and feature-rich dashboard with multiple analytical perspectives and NLP enrichments. 
 The majority of real development and implementation took place between January and April, with the final month of April being focused on testing, bug fixing, writing the report and preparation for the open day. The project was developed in an agile and iterative way, with new features being added and improved upon throughout the development process, rather than having a fixed plan for the entire project from the beginning.
 Git was as a changelog of decisions and rationale, to aid writing the report. But if this project were to be done again, I would maintain the report alongside the implementation from the beginning, as it would have made writing the report much easier and less stressful at the end.
 \subsection{Future Work}
 This section discusses several potential areas for future work and improvements to the system.
 \subsubsection{Improved Emotional Analysis}
 As noted in the user feedback and accuracy evaluation sections, the emotional analysis could be improved by implementing a more nuanced emotion classification model, such as the GoEmotions model with 27 emotion classes \cite{demszky2020goemotions}. 
 This would require some changes to the database schema, as currently, the "events" table contains a column for each of the five emotion classes, which would not be feasible with 27 emotion classes. A more flexible schema would be needed, such as having a separate "emotions" table that contains the emotion classifications for each post, with columns for the post ID, emotion class, and confidence score. 
 Or something similar to how NER classifications are stored, which are simply \texttt{JSONB} columns that contain a list of all the classifications for each post, which allows for a variable number of classifications and is more flexible for future changes to the emotion classification.
 \subsubsection{Multilingual Support}
 The project was largely built around English language datasets, therefore the emotional and NER models are trained on English language data and would not work with other languages. Beyond the NLP models, the stances and identity markers currently implemented use English-specific keywords such as "we", "us", "I", "me".
 To support multilingual datasets, multilingual NLP models could be implemented to allow language detection to be automatic. However, as the specific stance and identity markers would be required for different languages, a better solution would be for the user to specify the language of their dataset upon uploading, and then the system could use the correct NLP models, stance/identity marker lists and stop words for that language.
 \subsubsection{Improved Corpus Explorer}
 The corpus explorer could be improved by allowing users to see more metadata for each post, such as the NLP classifications and possibly even more than just the top emotion and topic.
 In addition, reconstructing the reply chains and conversation structures in the corpus explorer would allow users to see the context of each post and how they relate to each other. It would allow researchers to gauge the power dynamics between users and the conversational structures.
 Colouring grading each post in the corpus explorer based on its emotional classification would be both aesthetically pleasing and useful for users to quickly scan through the posts and get a sense of the emotional tone of the dataset.
 \newpage
 \bibliography{references}
--- a/report/references.bib
+++ b/report/references.bib
@@ -104,3 +104,46 @@
  pages     = {183--204}
 }
@misc{cook2023ethnography,
  author       = {Cook, Chloe},
  title        = {What is the Difference Between Ethnography and Digital Ethnography?},
  year         = {2023},
  month        = jan,
  day          = {19},
  howpublished = {\url{https://ethosapp.com/blog/what-is-the-difference-between-ethnography-and-digital-ethnography/}},
  note         = {Accessed: 2026-04-16},
  organization = {EthOS}
 }
@misc{giuffre2026sentiment,
  author       = {Giuffre, Steven},
  title        = {What is Sentiment Analysis?},
  year         = {2026},
  month        = mar,
  howpublished = {\url{https://www.vonage.com/resources/articles/sentiment-analysis/}},
  note         = {Accessed: 2026-04-16},
  organization = {Vonage}
 }
@misc{mungalpara2022stemming,
  author       = {Mungalpara, Jaimin},
  title        = {Stemming Lemmatization Stopwords and {N}-Grams in {NLP}},
  year         = {2022},
  month        = jul,
  day          = {26},
  howpublished = {\url{https://jaimin-ml2001.medium.com/stemming-lemmatization-stopwords-and-n-grams-in-nlp-96f8e8b6aa6f}},
  note         = {Accessed: 2026-04-16},
  organization = {Medium}
 }
@misc{chugani2025ethicalscraping,
  author       = {Chugani, Vinod},
  title        = {Ethical Web Scraping: Principles and Practices},
  year         = {2025},
  month        = apr,
  day          = {21},
  howpublished = {\url{https://www.datacamp.com/blog/ethical-web-scraping}},
  note         = {Accessed: 2026-04-16},
  organization = {DataCamp}
 }
--- a/server/connectors/boards_api.py
+++ b/server/connectors/boards_api.py
@@ -87,7 +87,7 @@ class BoardsAPI(BaseConnector):
            post = self._parse_thread(html, post_url)
            return post
-        with ThreadPoolExecutor(max_workers=10) as executor:
+        with ThreadPoolExecutor(max_workers=5) as executor:
            futures = {executor.submit(fetch_and_parse, url): url for url in urls}
            for i, future in enumerate(as_completed(futures)):
Author	SHA1	Message	Date
Dylan De Faoite	3db7c1d3ae	docs(report): add future work section	2026-04-16 16:54:18 +01:00
Dylan De Faoite	72e17e900e	fix(report): correct typos	2026-04-16 16:41:27 +01:00
Dylan De Faoite	7b9a17f395	fix(connector): reduce ThreadPoolExecutor max_workers	2026-04-16 16:37:27 +01:00
Dylan De Faoite	0a396dd504	docs(report): add more citations	2026-04-16 16:23:36 +01:00
Dylan De Faoite	c6e8144116	docs(report): add traditionl vs digital ethnography reference	2026-04-16 16:08:59 +01:00
Dylan De Faoite	760d2daf7f	docs(report): remove redundant phrasing	2026-04-16 15:59:24 +01:00