**crosspost** is a browser-based tool designed to support *digital ethnography*, the study of how people interact, communicate, and form culture in online spaces such as forums, social media platforms, and comment-driven communities.
A web-based analytics platform for exploring online communities. Built as a final year CS project at UCC, crosspost ingests data from Reddit, YouTube, and Boards.ie, runs NLP analysis on it (emotion detection, topic classification, named entity recognition, stance markers), and surfaces the results through an interactive dashboard.
The motivating use case is digital ethnography — studying how people talk, what they talk about, and how culture forms in online spaces. The included dataset is centred on Cork, Ireland.
The project aims to make it easier for students, researchers, and journalists to collect, organise, and explore online discourse in a structured and ethical way, without requiring deep technical expertise.
## What it does
- Fetch posts and comments from Reddit, YouTube, and Boards.ie (or upload your own .jsonl file)
- Normalise everything into a unified schema regardless of source
- Run NLP analysis asynchronously in the background via Celery workers
- Explore results through a tabbed dashboard: temporal patterns, word clouds, emotion breakdowns, user activity, interaction graphs, topic clusters, and more
- Multi-user support — each user has their own datasets, isolated from everyone else
By combining data ingestion, analysis, and visualisation in a single system, crosspost turns raw online interactions into meaningful insights about how conversations emerge, evolve, and spread across platforms.
# Prerequisites
- Docker & Docker Compose
- A Reddit App (client id & secret)
- YouTube Data v3 API Key
## Goals for this project
# Setup
- Collect data ethically: enable users to link/upload text, images, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well.
1) **Clone the Repo**
- Organise content: Store gathered material in a structured database with tagging for themes, dates, and sources.
```
Analyse patterns: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
- Visualise insights: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
cd crosspost
- Have clearly stated and explained ethical and privacy guidelines for users. The student will design the architecture, implement data pipelines, integrate basic NLP models, and create an interactive dashboard.
```
Beyond programming, the project involves applying ethical research principles, handling data responsibly, and designing for non-technical users. By the end, the project will demonstrate how computer science can bridge technology and social research — turning raw online interactions into meaningful cultural insights.
2) **Configure Enviornment Vars**
```
cp example.env .env
```
Fill in each required empty env. Some are already filled in, these are sensible defaults that usually don't need to be changed
## Scope
3) **Start everything**
```
docker compose up -d
```
This project focuses on:
This starts:
-Designing a modular data ingestion pipeline
-`crosspost_db` — PostgreSQL on port 5432
-Implementing backend data processing and storage
-`crosspost_redis` — Redis on port 6379
-Integrating lightweight NLP-based analysis
-`crosspost_flask` — Flask API on port 5000
-Building a simple, accessible frontend for exploration and visualisation
-`crosspost_worker` — Celery worker for background NLP/fetching tasks
-`crosspost_frontend` — Vite dev server on port 5173
# Requirements
# Data Format for Manual Uploads
If you want to upload your own data rather than fetch it via the connectors, the expected format is newline-delimited JSON (.jsonl) where each line is a post object:
- **Python packages** listed in `requirements.txt`
- **GPU support**: The Celery worker is configured with `--pool=solo` to avoid memory conflicts when multiple NLP models are loaded. If you have an NVIDIA GPU, uncomment the deploy.resources block in docker-compose.yml and make sure the NVIDIA Container Toolkit is installed.
In signing this declaration, you are conforming, in writing, that the submitted work is entirely your own original work, except where clearly attributed otherwise, and that it has not been submitted partly or wholly for any other educational award.
I hereby declare that:
\begin{itemize}
\item this is all my own work unless clearly indicated otherwise, with full and proper accreditation;
\item with respect to my own work: none of it has been submitted at any education institution contributing in any way to an educational award;
\item with respect to another’s work: all text, diagrams, code, or ideas, whether verbatim, paraphrased, or otherwise modified or adapted, have been duly attributed to the source in a scholarly manner, whether from books, papers, lecture notes or any other student’s work, whether published or unpublished, electronically or in print.
I would like to thank my supervisor, Paolo Palmieri, for his guidance and support throughout this project.
I would also like to thank Mastoureh Fathi, Pooya Ghoddousi, and Martino Zibetti on the MIGDIS project for taking the time to provide valuable feedback on the project and suggestions for future work.
\newpage
\section*{Abstract}
Online communities generate vast volumes of discourse that traditional ethnographic methods cannot analyse at scale. This project presents \textbf{Crosspost}, a web-based platform that applies computational methods to the study of online communities, bridging quantitative data analysis and qualitative digital ethnography.
The system aggregates public discussion data from multiple social media platforms, enriching it with Natural Language Processing techniques including emotion classification, topic modelling, and named entity recognition. Six analytical perspectives: temporal, linguistic, emotional, user, interactional, and cultural; are analysed through an interactive dashboard, allowing researchers to explore community behaviour, identity signals, and affective tone across large datasets without sacrificing access to the underlying posts.
The platform is evaluated against a Cork-specific dataset spanning Reddit, YouTube, and Boards.ie, showing the system's ability to generate ethnographic insights such as geographic identity, civic sentiment, and participation inequality across different online communities.
\newpage
\tableofcontents
\tableofcontents
\newpage
\newpage
\pagenumbering{arabic}
\section{Introduction}
\section{Introduction}
This project presents the design and implementation of a web-based analytics engine for the exploration and analysis of online discussion data. Built using \textbf{Flask and Pandas}, and supplemented with \textbf{Natural Language Processing} (NLP) techniques, the system provides an API for extracting structural, temporal, linguistic, and emotional insights from social media posts. A React-based frontend delivers interactive visualizations and user controls, the backend architecture implements analytical pipeline for the data, including data parsing, manipulation and analysis.
This project presents the design and implementation of a web-based analytics engine for the exploration and analysis of online discussion data. Built using \textbf{Flask and Pandas}, and supplemented with \textbf{Natural Language Processing} (NLP) techniques, the system provides an API for extracting structural, temporal, linguistic, and emotional insights from social media posts. A web-based frontend delivers interactive visualizations. The backend architecture implements an analytical pipeline for the data, including data parsing, manipulation and analysis.
\vspace{0.5cm}
\vspace{0.5cm}
Beyond its technical objectives, the system is conceptually informed by approaches from\textbf{digital ethnography} and computational social science. Traditional Ethnography is the practice of studying individual or group culture from the point of view of the subject of the study. Digital ethnography seeks to understand how social relations, topics and norms are constructed in online spaces.
Beyond its technical objectives, the system is based on the concepts and ideas of\textbf{digital ethnography} and computational social science. Traditional Ethnography is the practice of studying individual or group culture from the point of view of the subject of the study. Digital ethnography seeks to understand how social relations, topics and norms are constructed in online spaces.
\subsection{Motivation}
\subsection{Motivation}
There are many beneficiaries of a digital ethnography analytic system: social scientists gain a deeper understanding of contemporary culture and online communities; businesses and marketers can better understand consumer behaviour and online engagement; educators and designers can improve digital learning environments and user experiences; and policymakers can make informed decisions regarding digital platforms, online safety, and community regulation.
There are many beneficiaries of a digital ethnography analytic system: social scientists gain a deeper understanding of contemporary culture and online communities; businesses and marketers can better understand consumer behaviour and online engagement; educators and designers can improve digital learning environments and user experiences; and policymakers can make informed decisions regarding digital platforms, online safety, and community regulation.
\subsection{Goals \& Objectives}
\begin{itemize}
\item\textbf{Collect data ethically}: enable users to link/upload text, and interaction data (messages etc) from specified online communities. Potentially an automated method for importing (using APIs or scraping techniques) could be included as well.
\item\textbf{Organise content}: Store gathered material in a structured database with tagging for themes, dates, and sources.
\item\textbf{Analyse patterns}: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
\item\textbf{Visualise insights}: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
\end{itemize}
\subsection{The Cork Dataset}
A defining feature of this project is its focus on a geographically grounded dataset centred on \textbf{Cork, Ireland}. The system analyses publicly available discussions relating to Cork drawn from multiple online platforms:
\begin{itemize}
\item The \textbf{r/Cork} subreddit
\item The \textbf{r/Ireland} subreddit using a Cork-specific search filter
\item\textbf{YouTube} videos retrieved using Cork-related search queries
\item The \textbf{Boards.ie Cork section}
\end{itemize}
\newpage
\newpage
\section{Background}
\section{Background}
\subsection{What is Digital Ethnography?}
\subsection{What is Digital Ethnography?}
Digital Ethnography is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities.
\textit{Digital Ethnography} is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities.
There are multiple methods to carry out digital ethnography, such as online participant observation through automated or manual methods, digital interviews via text or video or tracing digital footprints.
Compared to traditional ethnography, digital ethnography is usually faster and more cost-effective due to the availability of large swathes of data across social media sites such as Reddit, YouTube, and Facebook and lack of need to travel. Traditional ethnography often relied on in-person interviews and in-person observation of communities. \cite{coleman2010ethnographic}
Compared to traditional ethnography, digital ethnography is usually faster and more cost-effective due to the availability of large swathes of data across social media sites such as Reddit, YouTube, and Facebook and lack of need to travel. Traditional ethnography often relied on in-person interviews and in-person observation of communities. \cite{coleman2010ethnographic}
\subsubsection{Traditional Ethnography}
\subsubsection{Traditional Ethnography}
Ethnography originated in the late nineteenth and early twentieth centuries as a method for understanding cultures through long-term fieldwork. The goal was not just to describe behaviour, but to show how people made sense of that world. Over time, ethnography grew beyond anthropology into sociology, media studies, education, and human computer interaction, becoming a broadly used qualitative research approach. Traditional ethnography was closely tied to physical locations: villages, workplaces or towns. However, as communication technologies developed and social life increasingly took place through technological mediums, it was no longer tied to a physical place. Researchers questioned whether social interactions could still be studied properly if they were no longer tied to physical places.
Ethnography originated in the late nineteenth and early twentieth centuries as a method for understanding cultures through long-term fieldwork. The goal was not just to describe behaviour, but to show how people made sense of that world. Over time, ethnography grew beyond anthropology into sociology, media studies, education, and human computer interaction, becoming a broadly used qualitative research approach. Traditional ethnography was closely tied to physical locations: villages, workplaces or towns. However, as communication technologies developed and social life increasingly took place through technological mediums, it was no longer tied to a physical place.
\subsubsection{Transition to Digital Spaces}
\subsubsection{Transition to Digital Spaces}
The rise of the internet in the late twentieth century massively changed social interaction. Online forums, emails, SMS and social media platforms became central to human communication. All types of groups and identities were constructed. As a result, ethnographic methods were adapted to study these emerging digital environments. Early work in this area was referred to as "virtual ethnography" or "digital ethnography", where online spaces began to mixed and intertwine with traditional cultural spaces.
The rise of the internet in the late twentieth century massively changed social interaction. Online forums, emails, SMS and social media platforms became central to human communication. All types of groups and identities were constructed. As a result, ethnographic methods were adapted to study these emerging digital environments. Early work in this area was referred to as "virtual ethnography" or "digital ethnography", where online spaces began to mix and intertwine with traditional cultural spaces.
Digital ethnography gives us new challenges to overcome in comparison to traditional ethnography. The field is distributed across platforms, devices and online-offline interactions. For example, a digital ethnographer studying influencer culture might examine Instagram posts, comment sections, private messages, algorithms, and also conduct interviews or observe offline events. This transitions requires flexibility, since researchers can no longer rely solely on face-to-face interactions.
There are new challenges to overcome in comparison to traditional ethnography. Digital ethnography is distributed across platforms, devices and online-offline interactions. For example, a digital ethnographer studying influencer culture might examine Instagram posts, comment sections, private messages, algorithms, and also conduct interviews or observe offline events. In some ways, digital ethnography is easier than traditional ethnography. Cost is reduced, as there is no need to travel or spend long periods in a field; it's less invasive as there is no need to interact with subjects directly, and there is a much larger amount of data available for analysis. \cite{cook2023ethnography}
\subsection{Online Communities}
\subsection{Online Communities}
There are many different types of online communities, often structured in various ways, with many different types of users, norms and power dynamics. These communities can range from large-scale social networking platforms and discussion forums to niche interest. Each type of community fosters different forms of interaction, participation, and identity construction.
There are many different types of online communities, often structured in various ways, with many different types of users, norms and power dynamics. These communities can range from large-scale social networking platforms and discussion forums to niche interests groups. Each type of community fosters different forms of interaction, participation, and identity construction.
Participation within these communities is usually not evenly distributed. The majority of users are passive consumers (lurkers), a smaller percentage contribute occasionally, and a very small core group produces most of the content. This uneven contribution structure has significant implications for digital ethnography, as visible discourse may disproportionately reflect the perspectives of highly active members rather than the broader community. This is particularly evident in some reputation-based systems such as Reddit, which allows for the opinions of a few to rise above the rest.
Participation within these communities is usually not evenly distributed. The majority of users are passive consumers (lurkers)\cite{sun2014lurkers}, a smaller percentage contribute occasionally, and a very small core group produces most of the content. This uneven contribution structure has significant implications for digital ethnography, as visible discourse may disproportionately reflect the perspectives of highly active members rather than the broader community. This is particularly evident in some reputation-based systems such as Reddit, which allows for the opinions of a few to rise above the rest.
Examples of digital spaces include:
\begin{itemize}
\item\textbf{Social media platforms} (e.g., Facebook, Twitter, Instagram) where users create profiles, share content, and interact with others.
\item\textbf{Online forums and communities} (e.g., Reddit, Boards.ie) where users engage in threaded discussions around specific topics or interests.
\item\textbf{Video platforms} (e.g., YouTube) where users share and comment on video content, often fostering communities around specific channels or topics.
\item\textbf{Messaging apps} (e.g., WhatsApp, Discord) where users engage in private or group conversations, often with a more informal and intimate tone.
\end{itemize}
\subsection{Digital Ethnography Metrics}
\subsection{Digital Ethnography Metrics}
This section describes common keywords and metrics use to measure and quantify online communities using digital ethnography.
This section describes common keywords and metrics used to measure and quantify online communities using digital ethnography.
\subsubsection{Sentiment Analysis}
Sentiment Analysis involves capturing the emotions associated with a specific post, topic or entity. This type of analysis can be as simple as classifying a post as "positive" or "negative", or classifying a post into a set of pre-existing emotions such as anger, joy or sadness.
\subsubsection{Active vs Passive Participation}
\subsubsection{Active vs Passive Participation}
\label{sec:passive_participation}
\label{sec:passive_participation}
@@ -120,9 +119,7 @@ Not everyone in an online community participates in the same way. Some users pos
This distinction between active and passive participation (passive users are often referred to as "lurkers") is important in digital ethnography, because looking only at posts and comments can give a misleading picture of how large or engaged a community actually is.
This distinction between active and passive participation (passive users are often referred to as "lurkers") is important in digital ethnography, because looking only at posts and comments can give a misleading picture of how large or engaged a community actually is.
This uneven distribution of participation is well documented in the literature. The "90-9-1" principle describes a consistent pattern across many online communities, whereby approximately 90\% of users only consume content, 9\% contribute occasionally, and
This uneven distribution of participation is well documented in the literature. The "90-9-1" principle describes a consistent pattern across many online communities, whereby approximately 90\% of users only consume content, 9\% contribute occasionally, and just 1\% are responsible for the vast majority of content creation \cite{sun2014lurkers}.
just 1\% are responsible for the vast majority of content creation \cite{sun2014lurkers}.
\subsubsection{Temporal Activity Patterns}
\subsubsection{Temporal Activity Patterns}
Looking at when a community is active can reveal quite a lot about its nature and membership. A subreddit that peaks at 2am UTC might have a mostly American userbase, while one that is consistently active across all hours could suggest a more globally distributed community. Beyond timezones, temporal patterns can also capture things like how a community responds to external events, like a sudden spike in posting activity often corresponds to something newsworthy happening that is relevant to the community.
Looking at when a community is active can reveal quite a lot about its nature and membership. A subreddit that peaks at 2am UTC might have a mostly American userbase, while one that is consistently active across all hours could suggest a more globally distributed community. Beyond timezones, temporal patterns can also capture things like how a community responds to external events, like a sudden spike in posting activity often corresponds to something newsworthy happening that is relevant to the community.
@@ -145,10 +142,10 @@ Hedge and certainty markers are discussed in this article \cite{shen2021stance}.
\textbf{Deontic Patterns} contains phrases that imply obligation, such as "must, should, need, have to". In the context of online communities, these patterns are often used to assert authority or to reinforce communal norms and "unwritten rules."
\textbf{Deontic Patterns} contains phrases that imply obligation, such as "must, should, need, have to". In the context of online communities, these patterns are often used to assert authority or to reinforce communal norms and "unwritten rules."
\textbf{Permission Patterns} refer to phrases where someone is asking permision, like "can, allowed, ok, permitted". These patterns could serve as an indicator of a user's status within an online community.
\textbf{Permission Patterns} refer to phrases where someone is asking permission, like "can, allowed, ok, permitted". These patterns could serve as an indicator of a user's status within an online community.
\subsection{Natural Language Processing}
\subsection{Natural Language Processing}
\textbf{Natural Language Processing} is a branch of artificial intelligence that allows machines to interpret, analyse and generate human language. The aim of NLP models is not only to understand single words individually, but to be able to understand the context of those words in a broader paragraph or story.
\textbf{Natural Language Processing} is a branch of artificial intelligence that allows machines to interpret, analyse and generate human language. The aim of NLP models is not only to understand individual words, but to understand them in context.
NLP can carry out many different types of tasks, such as classifying sentences or paragraphs, generating text content, extracting answers from text or even speech recognition in audio. However, even with the advances in NLP models, many challenges and limitations remain. These include understanding ambiguity, cultural context, sarcasm, and humour.
NLP can carry out many different types of tasks, such as classifying sentences or paragraphs, generating text content, extracting answers from text or even speech recognition in audio. However, even with the advances in NLP models, many challenges and limitations remain. These include understanding ambiguity, cultural context, sarcasm, and humour.
@@ -158,7 +155,7 @@ Digital ethnography traditionally relied on manual reading of texts and intervie
NLP techniques can be used to automatically process and analyse large volumes and applying ethnographic methods at scale. For example, NLP can be used to identify common themes and topics in a subreddit, track how these themes evolve over time, and even detect the emotional tone of discussions. This allows researchers to gain insights into the dynamics of online communities that would be impossible to achieve through manual analysis alone.
NLP techniques can be used to automatically process and analyse large volumes and applying ethnographic methods at scale. For example, NLP can be used to identify common themes and topics in a subreddit, track how these themes evolve over time, and even detect the emotional tone of discussions. This allows researchers to gain insights into the dynamics of online communities that would be impossible to achieve through manual analysis alone.
\subsubsection{Sentiment Analysis}
\subsubsection{Sentiment Analysis}
\textbf{Sentiment Analysis} involves determining the emotional tone behind a piece of text. It is commonly used to classify text as positive, negative, or neutral. This technique is widely applied in areas such as customer feedback analysis, social media monitoring, and market research. More advanced sentiment analysis models can detect nuanced emotions, such as frustration, satisfaction, or sarcasm, although accurately identifying these emotions remains a challenge.
\textbf{Sentiment Analysis} involves determining the emotional tone behind a piece of text. It is commonly used to classify text as positive, negative, or neutral. More advanced sentiment analysis models can detect nuanced emotions, such as frustration, satisfaction, or sarcasm, although accurately identifying these emotions remains a challenge\cite{giuffre2026sentiment}. For ethnographic analysis, sentiment analysis can provide insights into the emotional dynamics of a community, such as how users feel about certain topics or how the overall mood of discussions changes over time.
\subsubsection{Named Entity Recognition}
\subsubsection{Named Entity Recognition}
\textbf{Named Entity Recognition (NER)} is the process of identifying and classifying key entities within a text into predefined categories like names of people, organisations, locations, or dates. NER is essential for structuring unstructured text data and is often used in information extraction, search engines, and question-answering systems. Despite its usefulness, NER can struggle with ambiguous entities or context-dependent meanings.
\textbf{Named Entity Recognition (NER)} is the process of identifying and classifying key entities within a text into predefined categories like names of people, organisations, locations, or dates. NER is essential for structuring unstructured text data and is often used in information extraction, search engines, and question-answering systems. Despite its usefulness, NER can struggle with ambiguous entities or context-dependent meanings.
@@ -171,14 +168,13 @@ This method is often used to organise lots of unstructured data, such as news ar
\subsubsection{Stop Words}
\subsubsection{Stop Words}
\textbf{Stop Words} are common words that are often filtered out in NLP tasks because they carry little meaningful information. Examples of stop words include "the", "is", "in", "and", etc. Removing stop words can help improve the performance of NLP models by reducing noise and focusing on more informative words. However, the choice of stop words can vary depending on the context and the specific task at hand.
\textbf{Stop Words} are common words that are often filtered out in NLP tasks because they carry little meaningful information. Examples of stop words include "the", "is", "in", "and", etc. Removing stop words can help improve the performance of NLP models by reducing noise and focusing on more informative words. However, the choice of stop words can vary depending on the context and the specific task at hand.
For example, in a Cork-specific dataset, words like "ah", or "grand" might be considered stop words, as they are commonly used in everyday speech but do not carry significant meaning for analysis.
For example, in a Cork-specific dataset, words like "ah", or "grand" might be considered stop words, as they are commonly used in everyday speech but do not carry significant meaning for analysis.\cite{mungalpara2022stemming}
\subsection{Limits of Computation Analysis}
\subsection{Limits of NLP}
\label{sec:nlp_limitations}
While computational methods enable large-scale observation and analysis of online communities, there are many limitations that must be acknowledged. Many limitations come from NLP techniques and the practical boundaries of computational resources.
While computational methods enable large-scale observation and analysis of online communities, there are many limitations that must be acknowledged. Many limitations come from NLP techniques and the practical boundaries of computational resources.
Natural Language Processors will be central to many aspects of the virtual ethnography, such as emotional and topic classification. While these models are strong and have shown results in many areas, they are imperfect and may produce inaccurate or misleading results.
One key limitation is how the models will likely find it difficult to interpret context-dependent language. Online communities will often use sarcasm, irony or culturally specific references, all of which will be challenging for NLP models to correctly interpret. For example, a sarcastic comment might be incorrectly classified as positive, despite conveying negativity.
One key limitation is how the models will likely find it difficult to interpret context-dependent language. Online communities will often use sarcasm, irony or culturally specific references, all of which will be challenging to for NLP models to correctly interpret. For example, a sarcastic comment might be incorrectly classified as positive, despite conveying negativity.
Emojis and emoticons are a common feature of online communication and can carry significant emotional meaning. However, NLP models may struggle to accurately interpret the sentiment conveyed by emojis, especially when they are used in combination with text or in a sarcastic manner. \cite{ahmad2024sentiment}
Emojis and emoticons are a common feature of online communication and can carry significant emotional meaning. However, NLP models may struggle to accurately interpret the sentiment conveyed by emojis, especially when they are used in combination with text or in a sarcastic manner. \cite{ahmad2024sentiment}
@@ -188,6 +184,7 @@ In addition, the simplification of complex human interactions and emotions into
The performance and speed of the system will be influenced by the computational resources available during development and execution. While the system will attempt to use GPU acceleration during NLP inference, these resource may not always be available or may not be particularly strong should they exist.
The performance and speed of the system will be influenced by the computational resources available during development and execution. While the system will attempt to use GPU acceleration during NLP inference, these resource may not always be available or may not be particularly strong should they exist.
\subsection{Cork Dataset}
\subsection{Cork Dataset}
\label{sec:cork_dataset}
The Cork dataset serves as the foundation for this project, providing a geographically and culturally grounded corpus for analysis. Rather than examining a globally distributed or topic-neutral community, the dataset centres on a single city with Cork, Ireland which allows the system's analytical outputs to be interpreted against a known social and cultural context.
The Cork dataset serves as the foundation for this project, providing a geographically and culturally grounded corpus for analysis. Rather than examining a globally distributed or topic-neutral community, the dataset centres on a single city with Cork, Ireland which allows the system's analytical outputs to be interpreted against a known social and cultural context.
The dataset is drawn from four distinct online platforms, each of which represents a structurally different mode of online community participation:
The dataset is drawn from four distinct online platforms, each of which represents a structurally different mode of online community participation:
@@ -205,9 +202,8 @@ Due to data being collected across multiple platforms, they must be normalised i
\newpage
\newpage
\section{Analysis}
\section{Analysis}
\subsection{Goals \& Objectives}
\subsection{Goals \& Objectives}
The objective of this project is to provide a tool that can assist social scientists, digital ethnographers, and researchers to observing and interpret online communities and the interactions between them. Rather than replacing the study of digital ethnography or the related fields, this tool aims to aid researchers analyse communities.
The objective of this project is to provide a tool that can assist social scientists, digital ethnographers, and researchers to observe and interpret online communities and the interactions between them. Rather than replacing the study of digital ethnography or the related fields, this tool aims to aid researchers in analysing communities.
Specifically, the system aims to:
Specifically, the system aims to:
@@ -222,8 +218,6 @@ Specifically, the system aims to:
\item\textbf{Track emotional and cultural signals}: Analyse sentiment and emotion distributions across posts, users, and topics to better understand the affective tone of discussions and how it evolves over time.
\item\textbf{Track emotional and cultural signals}: Analyse sentiment and emotion distributions across posts, users, and topics to better understand the affective tone of discussions and how it evolves over time.
\item\textbf{Examine temporal evolution}: Provide time-series analysis of activity levels, topic trends, and emotional shifts, enabling longitudinal observation of community development.
\item\textbf{Promote ethical data practices}: Restrict analysis to publicly available data, provide opt-out mechanisms for computationally intensive processing, and ensure responsible handling of user-generated content.
\item\textbf{Promote ethical data practices}: Restrict analysis to publicly available data, provide opt-out mechanisms for computationally intensive processing, and ensure responsible handling of user-generated content.
\end{itemize}
\end{itemize}
@@ -247,7 +241,7 @@ Overall, while NLP provides powerful tools for analysing large datasets, its lim
\subsubsection{Data Normalisation}
\subsubsection{Data Normalisation}
Different social media platforms will produce data in many different formats. For example, Reddit data will have a much different reply structure to a forum-based platform like Boards.ie where there are no nested replies. Therefore, a core design requirement of the system is to normalise all incoming data into a single unified internal data model. This allows the same analytical functions to be applied across all data sources, regardless of their original structure.
Different social media platforms will produce data in many different formats. For example, Reddit data will have a much different reply structure to a forum-based platform like Boards.ie where there are no nested replies. Therefore, a core design requirement of the system is to normalise all incoming data into a single unified internal data model. This allows the same analytical functions to be applied across all data sources, regardless of their original structure.
Posts and comments are two different types of user-generated content, however when it comes to ethnographic analysis, they are both just "events" or information that is being shared by a user. From an ethnographic perspective, the distinction between a post and a comment is not particularly important, since they both represent user-generated content that contributes to the community discourse. Therefore, the system will normalise all posts and comments into a single "event" data model, which will allow the same analytical functions to be applied uniformly across all content. This also simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
Both comments and posts represent user-generated content that contributes to the community discourse. Therefore, the system will normalise all posts and comments into a single "event" data model, which will allow the same analytical functions to be applied uniformly across all content. This also simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
Though separate processing paths are not needed, the system will still retain metadata that indicates whether an event was originally a post or a comment, as well as any relevant structural information (e.g., parent-child relationships in Reddit threads).
Though separate processing paths are not needed, the system will still retain metadata that indicates whether an event was originally a post or a comment, as well as any relevant structural information (e.g., parent-child relationships in Reddit threads).
@@ -260,7 +254,6 @@ To mitigate this, the system will:
\begin{itemize}
\begin{itemize}
\item Utilise GPU acceleration where available for NLP inference.
\item Utilise GPU acceleration where available for NLP inference.
\item Pre-compute some analytical results during data ingestion to speed up subsequent queries.
\item Pre-compute some analytical results during data ingestion to speed up subsequent queries.
\item Store NLP outputs in the database to avoid redundant processing.
\item Implement asynchronous processing for long-running tasks.
\item Implement asynchronous processing for long-running tasks.
\end{itemize}
\end{itemize}
@@ -276,25 +269,24 @@ The system will:
\begin{itemize}
\begin{itemize}
\item Respect rate limits by implementing an exponential backoff strategy for API requests.
\item Respect rate limits by implementing an exponential backoff strategy for API requests.
\item Only collect data that is publicly available and does not require authentication or violate platform terms of service.
\item Only collect data that is publicly available and does not require authentication or violate platform terms of service.
\item Provide user-agent headers that identify the system and its purposes
\item Provide user-agent headers that identify the system and its purposes.
\item Allow users the option to upload their own datasets instead of automated collection.
\item Allow users the option to upload their own datasets instead of automated collection.
\item For websites without an API, the \texttt{robots.txt} file will be examined to ensure compliance with platform guidelines.
\item For websites without an API, the \texttt{robots.txt} file will be examined to ensure compliance with platform guidelines.
\item Data volume limits of up to 1000 posts per source will be enforced server-side to prevent excessive data collection.
\end{itemize}
\end{itemize}
Some platforms provide APIs that allow for easy and ethical data collection, such as YouTube and Reddit. These APIs have clear guidelines and rate limits that the system will adhere to.
Some platforms provide APIs that allow for easy and ethical data collection, such as YouTube and Reddit. These APIs have clear guidelines and rate limits that the system will adhere to. \cite{chugani2025ethicalscraping}
\paragraph{Reddit (API)}
\paragraph{Reddit (API)}
Reddit provides a public API that allows for the retrieval of posts, comments, and metadata from subreddits. The system will use the official Reddit API with proper authentication via OAuth2 and access tokens.
Reddit provides a public API that allows for the retrieval of posts, comments, and metadata from subreddits. The system will use the official Reddit API with proper authentication via OAuth2 and access tokens.
In November 2025, Reddit introduced a new approval process for API access, which requires developers to apply for access and specify their intended use case. While the public unauthenticated endpoints are still accessible, they have far stricter rate limits (100 requests every 10 minutes) compared to authenticated access (100 requests per minute). Therefore, the system shall allow for authenticated access to the Reddit API to speed up data retrival.
In November 2025, Reddit introduced a new approval process for API access, which requires developers to apply for access and specify their intended use case. While the public unauthenticated endpoints are still accessible, they have far stricter rate limits (100 requests every 10 minutes) compared to authenticated access (100 requests per minute). Therefore, the system shall allow for authenticated access to the Reddit API to speed up data retrieval.
Unauthenticated access will still be available as a fallback if client credentials are not provided on the backend, but this will massively slow the data retreival process, and this will still only fetch public posts and comments.
Unauthenticated access will still be available as a fallback if client credentials are not provided on the backend, but this will massively slow the data retrieval process, and this will still only fetch public posts and comments.
From reddit, the system will collect posts, comments and all replies to comments, as well as metadata such as the author name and timestamp.
From Reddit, the system will collect posts, comments and all replies to comments, as well as metadata such as the author name and timestamp.
\paragraph{Boards.ie (Web Scraping)}
\paragraph{Boards.ie (Web Scraping)}
Boards.ie is an Irish discussion forum with no public API, so the system will use web scraping instead. The platforms \texttt{robots.txt} will be used to ensure compliance with the site's guidelines for automated access. The boards.ie \texttt{robots.txt} file contains the following information:
Boards.ie is an Irish discussion forum with no public API, so the system will use web scraping instead. The platforms \texttt{robots.txt} will be used to ensure compliance with the site's guidelines for automated access. The Boards.ie \texttt{robots.txt} file contains the following information:
\begin{verbatim}
\begin{verbatim}
Sitemap: https://www.boards.ie/sitemapindex.xml
Sitemap: https://www.boards.ie/sitemapindex.xml
@@ -315,7 +307,7 @@ YouTube is supported via the official YouTube Data API v3, provided by Google. T
Authentication is handled through an API key issued via the Google Cloud Console. The API enforces a quota system rather than a traditional rate limit: each project is allocated 10,000 quota units per day by default, with different operations consuming different amounts.
Authentication is handled through an API key issued via the Google Cloud Console. The API enforces a quota system rather than a traditional rate limit: each project is allocated 10,000 quota units per day by default, with different operations consuming different amounts.
In addition, comment retreival can be disabled by the video uploader, so the system will handle this case by skipping videos where comments are not accessible.
In addition, comment retrieval can be disabled by the video uploader, so the system will handle this case by skipping videos where comments are not accessible.
\subsubsection{Data Storage \& Retention}
\subsubsection{Data Storage \& Retention}
All data fetched from social media sites are stored locally in a PostgreSQL database. The system will not share or expose any of this data to third parties beyond the users of this application. Raw API responses are discarded once the relevant information is extracted.
All data fetched from social media sites are stored locally in a PostgreSQL database. The system will not share or expose any of this data to third parties beyond the users of this application. Raw API responses are discarded once the relevant information is extracted.
@@ -324,94 +316,43 @@ All datasets are associated with one and only one user account, and the users th
The system will not store any personally identifiable information except for what is necessary for the analysis, which includes only usernames and timestamps. The system will not attempt to de-anonymise content creators or link data across platforms.
The system will not store any personally identifiable information except for what is necessary for the analysis, which includes only usernames and timestamps. The system will not attempt to de-anonymise content creators or link data across platforms.
\subsubsection{User Security}
Standard security practices will be followed to protect user data and prevent unauthorized access. This includes:
\begin{itemize}
\item The hashing of all user passwords and no storage of plaintext passwords.
\item The use of JWTs for session management, with secure signing and an expiration time of 24 hours.
\item Access control on all analysis API endpoints to ensure that end-users can only access their own datasets and results.
\item Parameterised queries for all database interactions to prevent SQL injection attacks.
\end{itemize}
\subsection{Requirements}
\subsection{Requirements}
The following requirements are derived from the backend architecture, NLP processing pipeline, and the React-based frontend interface.
The following requirements are derived from the backend architecture, NLP processing pipeline, and the React-based frontend interface.
\subsubsection{Functional Requirements}
\subsubsection{Functional Requirements}
\paragraph{Data Ingestion and Preparation}
\paragraph{Data Ingestion and Preparation}
\begin{itemize}
\begin{itemize}
\item The system shall accept social media data in \texttt{.jsonl} format containing posts and nested comments.
\item The system shall accept social media data in \texttt{.jsonl} format containing posts and nested comments.
\item The system shall validate uploaded files and return structured error responses for invalid formats or malformed data.
\item The system shall validate uploaded files and return structured error responses for invalid formats or malformed data.
\item The system shall normalise posts and comments into a unified event-based dataset.
\item The system shall normalise posts and comments into a unified event-based dataset.
\item The system shall give the user the option to automatically fetch datasets from social media sites filtered for specific keywords or categories.
\item The system shall give the user the option to automatically fetch datasets from social media sites filtered for specific keywords or categories.
\item The system shall provide a loading screen with a progress bar after the dataset is uploaded.
\end{itemize}
\end{itemize}
\paragraph{Dataset Management}
\paragraph{Dataset Management}
\begin{itemize}
\begin{itemize}
\item The system shall utilise Natural Language Processing models to generate average emotions per event.
\item The system shall utilise Natural Language Processing models to generate analytical outputs such as sentiment analysis, topic modelling, and named entity recognition.
\item The system shall utilise Natural Language Processing models to classify each event into a topic.
\item The system shall utilise Natural Language Processing models to identify entities in each event.
\item The system shall allow the users to view the raw dataset.
\item The system shall return detailed endpoints that return calculated statistics grouped into themes.
\item The system shall return detailed endpoints that return calculated statistics grouped into themes.
\end{itemize}
\end{itemize}
\paragraph{Filtering and Search}
\paragraph{Filtering and Search}
\begin{itemize}
\begin{itemize}
\item The system shall support keyword-based filtering across content, author, and optionally title fields.
\item The system shall support keyword-based, date-based and source-based filtering of the dataset.
\item The system shall support filtering by start and end date ranges.
\item The system shall support filtering by one or more data sources.
\item The system shall allow multiple filters to be applied simultaneously.
\item The system shall allow multiple filters to be applied simultaneously.
\item The system shall return a filtered dataset reflecting all active filters.
\end{itemize}
\end{itemize}
\paragraph{Temporal Analysis}
\paragraph{Ethnographic Analysis}
\begin{itemize}
\begin{itemize}
\item The system shall compute event frequency per day.
\item The system shall provide endpoints for structural analysis, temporal analysis, linguistic analysis, and emotional analysis.
\item The system shall generate weekday--hour heatmap data representing activity distribution.
\item The system shall allow users to define custom topics for topic modelling and analysis.
\item The system shall return outputs that are suitable for visualisation in the frontend.
\end{itemize}
\end{itemize}
\paragraph{Linguistic Analysis}
\begin{itemize}
\item The system shall compute word frequency statistics excluding standard and domain-specific stopwords.
\item The system shall extract common bi-grams and tri-grams from textual content.
\item The system shall compute lexical diversity metrics for the dataset.
\end{itemize}
\paragraph{Emotional Analysis}
\begin{itemize}
\item The system shall compute average emotional distribution per topic.
\item The system shall compute overall average emotional distribution across the dataset.
\item The system shall determine dominant emotion distributions.
\item The system shall compute emotional distribution grouped by data source.
\end{itemize}
\paragraph{User Analysis}
\begin{itemize}
\item The system shall identify top users based on activity.
\item The system shall compute per-user activity and behavioural metrics.
\end{itemize}
\paragraph{Interaction Analysis}
\begin{itemize}
\item The system shall compute average conversation thread depth.
\item The system shall identify top interaction pairs between users.
\item The system shall generate an interaction graph based on user relationships.
\item The system shall compute conversation concentration metrics.
\end{itemize}
\paragraph{Cultural Analysis}
\begin{itemize}
\item The system shall identify identity-related linguistic markers.
\item The system shall detect stance-related linguistic markers.
\item The system shall compute average emotional expression per detected entity.
\end{itemize}
\paragraph{Frontend}
\paragraph{Frontend}
\begin{itemize}
\begin{itemize}
\item The system shall provide a frontend UI to accommodate all of the above functions
\item The system shall provide a frontend UI to accommodate all of the above functions
\item The system shall provide a tab for each endpoint in the frontend
\item The system shall provide a tab for each endpoint in the frontend
\item The system shall provide a simple user-friendly interface for uploading and viewing analytics and visualisations.
\end{itemize}
\end{itemize}
\subsubsection{Non-Functional Requirements}
\subsubsection{Non-Functional Requirements}
@@ -428,13 +369,6 @@ The following requirements are derived from the backend architecture, NLP proces
\item NLP models shall be cached to prevent redundant loading.
\item NLP models shall be cached to prevent redundant loading.
\end{itemize}
\end{itemize}
\paragraph{Reliability and Robustness}
\begin{itemize}
\item The system shall implement structured exception handling.
\item The system shall return meaningful JSON error responses for invalid requests.
\item The dataset reset functionality shall preserve data integrity.
\end{itemize}
\newpage
\newpage
\section{Design}
\section{Design}
\subsection{System Architecture}
\subsection{System Architecture}
@@ -445,32 +379,25 @@ The following requirements are derived from the backend architecture, NLP proces
The system will follow a client-server architecture, with a Flask-based backend API and a React-based frontend interface. The backend will handle data processing, NLP analysis, and database interactions, while the frontend will provide an interactive user interface for data exploration and visualization.
The system will follow a client-server architecture, with a Flask-based backend API and a React-based frontend interface. The backend will handle data processing, NLP analysis, and database interactions, while the frontend will provide an interactive user interface for data exploration and visualization.
The reasoning behind this architecture is that it allows the analytics to be aggregated and computed on the server side using Pandas which is much faster than doing it on the client frontend. The frontend will focus on rendering and visualising the data.
The reasoning behind this architecture is that it allows the analytics to be aggregated and computed on the server side using Pandas, which is much faster than doing it on the client frontend. The frontend will focus on rendering and visualising the data.
\subsubsection{API Design}
\subsubsection{API Design}
The Flask backend will expose a RESTful API with endpoints for dataset management, authentication and user management, and analytical queries. Flask will call on backend components for data parsing, normalisation, NLP processing and database interfacing.
The Flask backend will expose a RESTful API with endpoints for dataset management, authentication and user management, and analytical queries. Flask will call on backend components for data parsing, normalisation, NLP processing and database interfacing.
Flask was chosen for its simplicity, familiarity and speed of development. It also has many extensions that can be used for authentication (Flask-Bcrypt, Flask-Login).
Flask was chosen for its simplicity and speed of development. It also has many extensions that can be used for authentication (Flask-Bcrypt, Flask-Login).
The API is separated into three separate groups, \textbf{authentication}, \textbf{dataset management} and \textbf{analysis}.
The API is separated into three separate groups, \textbf{authentication}, \textbf{dataset management} and \textbf{analysis}.
\subsubsection{React Frontend}
\subsubsection{React Frontend}
React was chosen for the frontend due to its massive library of pre-built components with efficient rendering capabilities and ability to display many different types of data. The frontend will be structured around a tabbed interface, with each tab corresponding to a different analytical endpoint (e.g., temporal analysis, linguistic analysis, emotional analysis). Each tab will fetch data from the backend API and render it using appropriate visualisation libraries (react-wordcloud for word clouds, react-chartjs-2 for charts, etc). The frontend will also include controls for filtering the dataset based on keywords, date ranges, and data sources.
React was chosen for the frontend due to its massive library of pre-built components with efficient rendering capabilities and ability to display many different types of data. The frontend will be structured around a tabbed interface, with each tab corresponding to a different analytical endpoint (e.g., temporal analysis, linguistic analysis, emotional analysis). Each tab will fetch data from the backend API and render it using appropriate visualisation libraries (react-wordcloud for word clouds, react-chartjs-2 for charts, etc.). The frontend will also include controls for filtering the dataset based on keywords, date ranges, and data sources.
\subsection{Data Pipeline}
\subsection{Data Pipeline}
As this project is focused on the collection and analysis of online community data, the primary component that must be well-designed is the data pipeline, which encompasses the processes of data ingestion, normalisation, enrichment, storage, and retrieval for analysis.
As this project is focused on the collection and analysis of online community data, the primary component that must be well-designed is the data pipeline, which encompasses the processes of data ingestion, normalisation, enrichment, storage, and retrieval for analysis.
A unified data model is used to represent all incoming data, regardless of its original source or structure. This ensures that the same pipeline works across YouTube, Reddit and boards.ie data, and can be easily extended to new sources in the future.
A unified data model is used to represent all incoming data, regardless of its original source or structure. This ensures that the same pipeline works across YouTube, Reddit and Boards.ie data, and can be easily extended to new sources in the future.
\begin{figure}
\begin{figure}
\centering
\centering
@@ -486,36 +413,16 @@ The system will support two methods of data ingestion:
\item\textbf{Automated Fetching}: Users can trigger the system to automatically fetch data from supported social media platforms using specified keywords or filters.
\item\textbf{Automated Fetching}: Users can trigger the system to automatically fetch data from supported social media platforms using specified keywords or filters.
\end{itemize}
\end{itemize}
Originally, only file upload was supported, but the goal of the platform is to aid researchers with ethnograpic analysis, and many researchers will not have the technical expertise to fetch data from social media APIs or scrape websites. Therefore, the system was designed to support automated fetching of data from social media platforms, which allows users to easily obtain datasets without needing to manually collect and format data themselves.
Originally, only file upload was supported, but the goal of the platform is to aid researchers with ethnographic analysis, and many researchers will not have the technical expertise to fetch data from social media APIs or scrape websites. Therefore, the system was designed to support automated fetching of data from social media platforms, which allows users to easily obtain datasets without needing to manually collect and format data themselves.
In addition to social media posts, the system will allow users to upload a list of topics that they want to track in the dataset. This allows the system to generate custom topic analysis based on user-defined topics, which can be more relevant and insightful for specific research questions. For example, a researcher studying discussions around local politics in Cork might upload a list of political parties, politicians, and policy issues as topics to track.
In addition to social media posts, users can upload a list of topics that they want to track in the dataset. Custom topic lists allow the system to generate custom topic analysis based on user-defined topics, which can be more relevant and insightful for specific research questions. For example, a researcher studying discussions around local politics in Cork might upload a list of political parties, politicians, and policy issues as topics to track.
Below is a snippet of what a custom topic list might look like in \texttt{.json} format:
If a custom topic list is not provided by the user, a pre-defined generalised topic list that is designed to capture common themes across a wide range of online communities is used.
\begin{Verbatim}[breaklines=true]
{
"Public Transport": "buses, bus routes, bus eireann, public transport, late buses, bus delays, trains, commuting without a car, transport infrastructure in Cork",
"Parking": "parking spaces, parking fines, clamping, pay parking, parking permits, finding parking in the city",
"Cycling": "cycling in Cork, bike lanes, cyclists, cycle safety, bikes on roads, cycling infrastructure"
}
\end{Verbatim}
If a custom topic list is not provided by the user, the system will use a pre-defined generalised topic list that is designed to capture common themes across a wide range of online communities.
Each method of ingestion will format the raw data into a standardised structure, where each post will be represented as a "Post" object and each comment will be represented as a "Comment" object.
Each method of ingestion will format the raw data into a standardised structure, where each post will be represented as a "Post" object and each comment will be represented as a "Comment" object.
\subsubsection{Data Normalisation}
\subsubsection{Data Normalisation}
After a dataset is ingested, the system will normalise all posts and nested comments into a single unified "event" data model. This means that both posts and comments will be represented as the same type of object, with a common set of fields that capture the relevant information for analysis. The fields in this unified data model will include:
After a dataset is ingested, all posts and nested comments are normalised into a single unified "event" data model. Therefore, both posts and comments will be represented as the same type of object, with a common set of fields that capture the relevant information for analysis.
\begin{itemize}
\item\texttt{id} - a unique identifier for the post or comment.
\item\texttt{content} — the text content of the post or comment.
\item\texttt{author} — the username of the content creator.
\item\texttt{timestamp} — the date and time when the content was created
\item\texttt{source} — the original platform from which the content was retrieved (e.g., Reddit, YouTube, Boards.ie).
\item\texttt{type} — a field indicating whether the event is a "post" or a "comment".
\item\texttt{parent\_id} — for comments, this field will reference the original id of the post it's commenting on.
\item\texttt{reply\_to} - for comments, this field will reference the original id of the comment it's replying to. If the comment is a direct reply to a post, this field will be null.
\end{itemize}
The decision to normalise posts and comments into a single "event" data model allows the same analytical functions to be applied uniformly across all content, regardless of whether it was originally a post or a comment. This simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
The decision to normalise posts and comments into a single "event" data model allows the same analytical functions to be applied uniformly across all content, regardless of whether it was originally a post or a comment. This simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
@@ -535,6 +442,13 @@ NLP processing lets us perform much richer analysis of the dataset, as it provid
\subsubsection{Data Storage}
\subsubsection{Data Storage}
The enriched dataset is stored in a PostgreSQL database, with a schema similar to the unified data model defined in the normalisation section, with additional fields for the derived data, NLP outputs, and user ownership. Each dataset is associated with a specific user account, and the system supports multiple datasets per user.
The enriched dataset is stored in a PostgreSQL database, with a schema similar to the unified data model defined in the normalisation section, with additional fields for the derived data, NLP outputs, and user ownership. Each dataset is associated with a specific user account, and the system supports multiple datasets per user.
The stored dataset can then be retrieved through the Flask API endpoints for analysis. The API supports filtering by keywords and date ranges, as well as grouping and aggregation for various analytical outputs.
The stored dataset can then be retrieved through the Flask API endpoints for analysis. The API supports filtering by keywords and date ranges, as well as grouping and aggregation for various analytical outputs.
@@ -546,15 +460,15 @@ These will be implemented in a separate module that will be called during the da
\subsubsection{Emotional Classification}
\subsubsection{Emotional Classification}
Emotional Classification will be the bedrock of the ethnographic analysis, as it provides insight into the emotions of a community and how they relate to different topics and users. As mentioned in the feasibility analysis, the outputs of the emotion classification model should be interpreted as indicative patterns rather than definitive representations of user meaning, due to the limitations of NLP models.
Emotional Classification will be the bedrock of the ethnographic analysis, as it provides insight into the emotions of a community and how they relate to different topics and users. As mentioned in the feasibility analysis, the outputs of the emotion classification model should be interpreted as indicative patterns rather than definitive representations of user meaning, due to the limitations of NLP models.
Usage of simple VADER-based models is usually too-simplistic for the type of text data being analysed. Classifying posts into positive, negative and neutral categories is not nuanced enough to truly capture the emotional tone of a community. Therefore, the system will use a more complex model that can classify text into a wider range of emotions, which will allow for richer analysis of the emotions of the community.
Usage of simple VADER-based models is usually too-simplistic for the type of text data being analysed. Classifying posts into positive, negative and neutral categories is not nuanced enough to truly capture the emotional tone of a community. Therefore, the system will use a more complex model that can classify text into a wider range of emotions, which will allow for richer emotional analysis of the community.
\subsubsection{Topic Classification}
\subsubsection{Topic Classification}
Topic classification will allow the system to classify specific posts into specific topics, which can be used to understand what a community is talking about, and in conjunction with emotional classification, how they feel about these topics as well. The system will support both a generalised topic classification model that can classify posts into a set of pre-defined general topics, as well as a custom topic classification model that can classify posts into user-defined topics based on a list of topics and descriptions provided by the user.
The system will support both a generalised topic classification model that can classify posts into a set of pre-defined general topics, as well as a custom topic classification model that can classify posts into user-defined topics based on a list of topics and descriptions provided by the user.
Initially, the system would have extract common themes and topics from the dataset by extracting common keywords and phrases, and then use these to generate a topic list. However, this approach was noisy and topics were often singular random words that did not have any overlap with each other, making topic classification less insightful. Therefore, specified or pre-defined topic lists will instead be used.
Initially, the system generated common themes and topics from the dataset by extracting common keywords and phrases, and then use these to generate a topic list. However, this approach was noisy and topics were often singular random words that did not have any overlap with each other, making topic classification less insightful. Therefore, custom user-provided or pre-defined topic lists will instead be used.
\subsubsection{Named Entity Recognition}
\subsubsection{Named Entity Recognition}
Named Entity Recognition allows the system to identify specific entities mentioned in the text, like people, places and organisations. In combination with emotional classification, we can see the general sentiment around specific places and people in a community, which can be very insightful for ethnographic analysis. For example, in a Cork-specific dataset, we might see that the city centre is often mentioned with negative emotions due to traffic and parking issues, while local parks are mentioned with positive emotions.
In combination with emotional classification, with NER we can see the general sentiment around specific places and people in a community, which can be very insightful for ethnographic analysis. For example, in a Cork-specific dataset, we might see that the city centre is often mentioned with negative emotions due to traffic and parking issues, while local parks are mentioned with positive emotions.
\subsection{Ethnographic Analysis}
\subsection{Ethnographic Analysis}
The main goal of this project is to provide a tool that can assist researchers with ethnographic analysis of online communities. Therefore, ethnographic analysis will be a core component of the system.
The main goal of this project is to provide a tool that can assist researchers with ethnographic analysis of online communities. Therefore, ethnographic analysis will be a core component of the system.
@@ -571,9 +485,9 @@ The system is designed to support multiple types of analysis, such as:
\item\textbf{Cultural Analysis}: looking at the cultural markers and identity signals that are present in a community, such as slang, memes, and recurring references.
\item\textbf{Cultural Analysis}: looking at the cultural markers and identity signals that are present in a community, such as slang, memes, and recurring references.
\end{itemize}
\end{itemize}
Each of these types of analysis are available at different API endpoints for any given dataset, and the frontend is designed to allow users to easily switch between them and explore the data from different angles.
All types of analysis are available at different API endpoints for any given dataset, and the frontend is designed to allow users to easily switch between them and explore the data from different angles.
For each type of analysis that involves analysing the content of the posts themselves, they will be split into tokens and stop words will be stripped from them, which makes analysis easier.
Some types of analysis that involve inspecting the content of the posts themselves. The content will be split into tokens and stop words will be stripped from them, as these common words would not provide meaningful insight for the analysis.
\subsubsection{Temporal Analysis}
\subsubsection{Temporal Analysis}
Temporal analysis allows researchers to understand what a community is talking about over time, and how the emotional tone of the community changes over time. For example, a researcher might want to see how discussions around a specific topic evolve over time, or how the emotional tone of a community changes in response to external events.
Temporal analysis allows researchers to understand what a community is talking about over time, and how the emotional tone of the community changes over time. For example, a researcher might want to see how discussions around a specific topic evolve over time, or how the emotional tone of a community changes in response to external events.
@@ -594,15 +508,13 @@ Linguistic analysis allows researchers to understand the language and words used
In this system, linguistic analysis will include:
In this system, linguistic analysis will include:
\begin{itemize}
\begin{itemize}
\item Word frequency statistics excluding standard and domain-specific stopwords.
\item Word frequency statistics excluding standard and domain-specific stopwords.
\item Common bi-grams and tri-grams from textual content.
\item Common bi-grams and tri-grams from textual content.\cite{mungalpara2022stemming}
\item Lexical diversity metrics for the dataset.
\item Lexical diversity metrics for the dataset.
\end{itemize}
\end{itemize}
The word frequencies and n-gram metrics were chosen because they can provide insights into the language and phrases used commonly in an online community, which is important for ethnographic analysis and understanding a community fully. Lexical diversity metrics such as the total number of unique tokens versus the total number of tokens can show if a specific culture often repeats phrases (like memes, slang etc.) or if they often have structured, serious discussion without repeating themeselves.
The word frequencies and n-gram metrics were chosen because they can provide insights into the language and phrases used commonly in an online community, which is important for ethnographic analysis and understanding a community fully. Lexical diversity metrics such as the total number of unique tokens versus the total number of tokens can show if a specific culture often repeats phrases (like memes, slang etc.) or if they often have structured, serious discussion without repeating themselves.
Outlining a list of stopwords is essential for linguistic analysis, as it filters out common words that wouldn't be useful for linguistic analysis. Stop Word lists can be provided by a Python library such as NLTK.
Outlining a list of stopwords is essential for linguistic analysis, as it filters out common words that wouldn't be useful for linguistic analysis. Stop Word lists can be provided by a Python library such as NLTK. In addition to standard stop words, the system also excludes link tokens such as "www", "http", and "https" from the word frequency analysis, as social media users will often include links in their posts and comments, and these tokens can become quite common and skew the word frequency results without adding meaningful insight.
In addition to standard stop words, the system also excludes link tokens such as "www", "http", and "https" from the word frequency analysis, as social media users will often include links in their posts and comments, and these tokens can become quite common and skew the word frequency results without adding meaningful insight.
\subsubsection{User Analysis}
\subsubsection{User Analysis}
User analysis allows researchers to understand the behaviour and activity of individual users within a community. For example, a researcher might want to see who the most active users are in a community, or how different users contribute to the overall emotional tone of the community.
User analysis allows researchers to understand the behaviour and activity of individual users within a community. For example, a researcher might want to see who the most active users are in a community, or how different users contribute to the overall emotional tone of the community.
@@ -617,7 +529,7 @@ Initially the user endpoint contained the interactional statistics as well, as a
Identifying top users allows us to see the most active and prolific posters in a community, which might often be site-specific bots that comment on every post or deleted users, which often show up as simply "[Deleted User]" and can aggregate together in statistics . An example might be a User Moderator bot on Reddit, seen below.
Identifying top users allows us to see the most active and prolific posters in a community, which might often be site-specific bots that comment on every post or deleted users, which often show up as simply "[Deleted User]" and can aggregate together in statistics . An example might be a User Moderator bot on Reddit, seen below.
@@ -627,7 +539,7 @@ Identifying top users allows us to see the most active and prolific posters in a
While it's impossible to filter out all of these bots, deleted users can simply be filtered out using an exclusion list.
While it's impossible to filter out all of these bots, deleted users can simply be filtered out using an exclusion list.
\subsubsection{Interactional Analysis}
\subsubsection{Interactional Analysis}
Instead of per-user analysis, interactional analysis looks at the interactions between users, such as who replies to who and who is contributing the most to the conversations.
Instead of per-user analysis, interactional analysis looks at the interactions between users, such as who replies to whom and who is contributing the most to the conversations.
In this system, interactional analysis will include:
In this system, interactional analysis will include:
\begin{itemize}
\begin{itemize}
@@ -636,24 +548,20 @@ In this system, interactional analysis will include:
\item Conversation concentration metrics such as who is contributing the most to the conversations and how much of the conversation is dominated by a small number of users.
\item Conversation concentration metrics such as who is contributing the most to the conversations and how much of the conversation is dominated by a small number of users.
\end{itemize}
\end{itemize}
For simplicity, an interaction is defined as a reply from one user to another, which can be either a comment replying to a post or a comment replying to another comment. The system will not attempt to capture more complex interactions such as mentions or indirect references between users, as these would require more advanced NLP techniques.
For simplicity, an interaction is defined as a reply from one user to another, which can be either a comment replying to a post or a comment replying to another comment. The system will not attempt to capture more complex interactions such as mentions or indirect references between users, as these would require more advanced NLP techniques. Unfortunately, \texttt{Boards.ie} does not have a reply structure beyond mentions in linear threads, so interactional analysis is limited for that data structure.
\textbf{Average reply chain depth} was considered as a metric, however forum-based social media sites, such as boards.ie, do not have a way to reply to comments in the same way that Reddit does, therefore the concept of "reply chains" doesn't apply cleanly in the same way. One possible solution is to infer reply relationships from explicit user mentions embedded in content of the post, but this is not a reliable method.
\subsubsection{Emotional Analysis}
\subsubsection{Emotional Analysis}
Emotional analysis allows researchers to understand the emotional tone of a community, and how it varies across different topics and users.
Emotional analysis allows researchers to understand the emotional tone of a community, and how it varies across different topics and users.
In this system, emotional analysis will include:
In this system, emotional analysis will include:
\begin{itemize}
\begin{itemize}
\item Average emotional by topic.
\item Average emotion by topic.
\item Overall average emotional distribution across the dataset.
\item Overall average emotional distribution across the dataset.
\item Dominant emotion distributions for each event
\item Dominant emotion distributions for each event
\item Average emotion by data source
\item Average emotion by data source
\end{itemize}
\end{itemize}
It is emphasised that emotional analysis is inaccurate on an individual post level as the models cannot fully capture the nuance of human interaction and slang. Warnings will be presented to the user in the frontend that AI outputs can possible be misleading on an individual scale, and accuracy only increases with more posts. Even then it will not be perfect.
It is emphasised that emotional analysis is inaccurate on an individual post level as the models cannot fully capture the nuance of human interaction and slang. Warnings will be presented to the user in the frontend that AI outputs can possibly be misleading on an individual scale, and accuracy only increases with more posts. NLP limitations are discussed further in Section \ref{sec:nlp_limitations}
In an ideal world, the models are accurate enough to capture general emotions on a macro-scale.
\subsubsection{Cultural Analysis}
\subsubsection{Cultural Analysis}
Cultural analysis allows researchers to understand the cultural markers and identity signals that are present in a community, such as slang, memes, and recurring references. While some of this is covered in the linguistic analysis, cultural analysis will focus more on the identity and stance-related markers that are present in the language of the community.
Cultural analysis allows researchers to understand the cultural markers and identity signals that are present in a community, such as slang, memes, and recurring references. While some of this is covered in the linguistic analysis, cultural analysis will focus more on the identity and stance-related markers that are present in the language of the community.
@@ -672,7 +580,7 @@ These metrics were chosen because they can provide insights into the cultural ma
\subsection{Frontend Design}
\subsection{Frontend Design}
The primary audience for this tool is researchers and social scientists, not software developers. Therefore the frontend needs to feel approachable and easy to use for non-technical users. At the same time it must support multi-dataset workflows and handle long-running background processes.
The primary audience for this tool is researchers and social scientists, not software developers. Therefore the frontend needs to feel approachable and easy to use for non-technical users. At the same time it must support multi-dataset workflows and handle long-running background processes.
React was chosen as the UI framework primarily for its large amount of pre-built visualisation components. There are many different types of data being visualised in this system, such as word clouds, bar charts, line charts, heatmaps and network graphs, and React has a large library of pre-built components for all of these types of visualisations.
React was chosen as the UI framework primarily for its large amount of pre-built visualisation components and for its component-based architecture. There are many different types of data being visualised in this system, such as word clouds, bar charts, line charts, heatmaps and network graphs, and React has a large library of pre-built components for all of these types of visualisations.
\subsubsection{Structure}
\subsubsection{Structure}
A persistent layout shell will wrap every page of the frontend, providing a consistent header for navigation and account management. This will also store login state and user information in a global way, such that no component has to manage authentication state on its own. The main content area will be reserved for the dataset management and analysis interface.
A persistent layout shell will wrap every page of the frontend, providing a consistent header for navigation and account management. This will also store login state and user information in a global way, such that no component has to manage authentication state on its own. The main content area will be reserved for the dataset management and analysis interface.
@@ -685,7 +593,7 @@ The visual design of the frontend will be clean and minimalistic, with a focus o
\subsection{Automatic Data Collection}
\subsection{Automatic Data Collection}
Originally, the system was designed to only support manual dataset uploads, where users would collect their own data from social media platforms and format it into the required \texttt{.jsonl} format.
Originally, the system was designed to only support manual dataset uploads, where users would collect their own data from social media platforms and format it into the required \texttt{.jsonl} format.
However, this approach is time consuming and since this system is designed to aid researchers rather than burden them, the system includes functionality to automatically fetch data from social media platforms. This allows users to easily obtain datasets without needing to manually collect and format data themselves, which is especially beneficial for researchers who may not have technical expertise in data collection.
However, this approach is time consuming and since this system is designed to aid researchers rather than burden them, the system includes functionality to automatically fetch data from social media platforms. This allows users to easily obtain datasets without needing to manually collect and format data themselves, which is especially beneficial for researchers who may not have technical expertise in data analytics or programming.
The initial system will contain connectors for:
The initial system will contain connectors for:
\begin{itemize}
\begin{itemize}
@@ -704,7 +612,7 @@ Creating a base interface for what a connector should look like allows for the e
The connector registry is designed so that any new connector implementing \texttt{BaseConnector} is automatically discovered and registered at runtime, without requiring changes to any existing code. This allows for a modular and extensible architecture where new data sources can be integrated with minimal effort.
The connector registry is designed so that any new connector implementing \texttt{BaseConnector} is automatically discovered and registered at runtime, without requiring changes to any existing code. This allows for a modular and extensible architecture where new data sources can be integrated with minimal effort.
\subsection{Asynchronous Processing}
\subsection{Asynchronous Processing}
The usage of NLP models for tasks such as sentiment analysis, topic classification, and entity recognition can be computationally intensive, especially for large datasets. In addition, fetching large datasets from sites like Reddit and YouTube takes a lot of time, due to the sequential nature of data fetching and severe rate limits on even authenticated Reddit accounts. To prevent the Flask API from blocking while these tasks are being processed, an asynchronous processing queue will be implemented using \textbf{Redis} and \textbf{Celery}.
In addition to NLP, fetching large datasets from sites like Reddit and YouTube takes a lot of time, due to the sequential nature of data fetching and severe rate limits on even authenticated Reddit accounts. To prevent the Flask API from blocking while these tasks are being processed, an asynchronous processing queue will be implemented using \textbf{Redis} and \textbf{Celery}.
\subsubsection{Dataset Enrichment}
\subsubsection{Dataset Enrichment}
A non-normalised dataset will be passed into Celery along with the dataset id and the user id of the dataset owner. At this point, the program is running separately to the main Flask thread. The program then calls on the \textbf{Normalisation \& Enrichment Module} to:
A non-normalised dataset will be passed into Celery along with the dataset id and the user id of the dataset owner. At this point, the program is running separately to the main Flask thread. The program then calls on the \textbf{Normalisation \& Enrichment Module} to:
@@ -724,23 +632,25 @@ Asynchronous processing is especially important for automatic data-fetching, as
\subsubsection{Database vs On-Disk Storage}
\subsubsection{Database vs On-Disk Storage}
Originally, the system was designed to store \texttt{json} datasets on disk and load them into memory for processing. This was simple and time-efficient for early development and testing. However, as the functionality of the system expanded, it become clear that a more persistent and scalable storage solution was needed.
Originally, the system was designed to store \texttt{json} datasets on disk and load them into memory for processing. This was simple and time-efficient for early development and testing. However, as the functionality of the system expanded, it become clear that a more persistent and scalable storage solution was needed.
Storing datasets in a database allows for more efficient querying, filtering, and updating of data without needing to reload entire datasets into memory. However the priamry benefit of using a database is support for \textbf{ multiple users and multiple datasets per user}.
Storing datasets in a database allows for more efficient querying, filtering, and updating of data without needing to reload entire datasets into memory. However the primary benefit of using a database is support for \textbf{ multiple users and multiple datasets per user}.
An additional benefit of using a database was that it allowed the NLP processing to be done once, with the NLP results stored alongside the original data in the database. This meant that the system could avoid redundant NLP processing on the same data, which was a significant performance improvement.
An additional benefit of using a database was that it allowed the NLP processing to be done once, with the NLP results stored alongside the original data in the database. This meant that the system could avoid redundant NLP processing on the same data, which was a significant performance improvement.
\texttt{PostgreSQL} was chosen as the database solution due to its robustness, support for complex queries, and compatibility with Python through \texttt{psycopg2}. PostgreSQL's support for JSONB fields allows for storage of unstructured NLP outputs, which alternatives like SQLite does not support.
\texttt{PostgreSQL} was chosen as the database solution due to its robustness, support for complex queries, and compatibility with Python through \texttt{psycopg2}. PostgreSQL's support for JSONB fields allows for storage of unstructured NLP outputs, which alternatives like SQLite does not support.
\subsubsection{Unified Data Model vs Split Data Model}
\subsubsection{Unified Data Model vs Split Data Model}
The choice between a \textbf{Unified Data Model} and a \textbf{Split Data Model} led to many swaps in design for the API.
The choice between a \textbf{Unified Data Model} and a \textbf{Split Data Model} led to many swaps in design for the early API.
A unified data model means both posts and comments are stored in the same data object, with a field to differentiate between the two. A split data model means posts and comments are stored in separate tables, with a foreign key relationship between them.
\paragraph{The Case for a Unified Data Model}
\paragraph{The Case for a Unified Data Model}
\begin{itemize}
\begin{itemize}
\item\textbf{Simpler Schema}: One \texttt{events} table rather than split comments and posts tables
\item\textbf{Simpler Schema}: One \texttt{events} table rather than split comments and posts tables.
\item\textbf{Simpler Pipeline}: The same pipeline works for both types
\item\textbf{Simpler Pipeline}: The same pipeline works for both posts and comments.
\item\textbf{Differentiation Possible}: Through the \texttt{type} column, we can still differentiate between a post and a comment, though more awkwardly.
\item\textbf{Differentiation Possible}: Through the \texttt{type} column, we can still differentiate between a post and a comment, though more awkwardly.
\end{itemize}
\end{itemize}
But it led to a simplification of some of the content, for example a post title is very different from the comment content. Reply chains must be reconstructed using the \texttt{reply\_to} and \texttt{parent\_id} fields and some fields, like \texttt{reply\_to} will be null depending on the data source. For example, boards.ie does not support nested replies.
A unified data model led to a simplification of some of the content, for example a post title is very different from the comment content. Reply chains must be reconstructed using the \texttt{reply\_to} and \texttt{parent\_id} fields and some fields, like \texttt{reply\_to} will be null depending on the data source. For example, \texttt{Boards.ie} does not support nested replies.
\paragraph{The Case for a Split Data Model}
\paragraph{The Case for a Split Data Model}
\begin{itemize}
\begin{itemize}
@@ -748,16 +658,7 @@ But it led to a simplification of some of the content, for example a post title
\item\textbf{Accurate Reply Relationship}: Reply relationships are naturally represented, comments have a foreign key to posts, no reconstruction needed.
\item\textbf{Accurate Reply Relationship}: Reply relationships are naturally represented, comments have a foreign key to posts, no reconstruction needed.
\end{itemize}
\end{itemize}
However each analytical query would either need to be post or comment specific, or require a table merge later in the pipeline. For ethnographic analysis, the distinction between a post and a comment is minimal. From a research point of view a post and a comment are both just a user saying something at a point in time, and treating them uniformly reflects that.
The decision to \textbf{stick with a unified data model was made} since the downsides of a Unified Model could be mitigated through reconstruction of reply chains using specific fields, and being able to differentiate between a post and a comment using a type field. Largely, in ethnography, a post and a comment are both just a user saying something at a point in time, and even in cases where they might need to be treated differently (reply-chains, interactions graphs), that distinction can still be made using specific fields.
The decision to \textbf{stick with a unified data model was made} since the downsides of a Unified Model could be mitigated through reconstruction of reply chains using specific fields, and being able to differentiate between a post and a comment using a type field. Largely, in ethnography, a post and a comment are both just a user saying something at a point in time, and even in cases where they might need to be treated differently (reply-chains, interactions graphs), that distinction can still be made.
\subsection{Deployment}
Docker Compose is used to containerise the entire application.
During development, the source code for the backend and frontend will be mounted as volumes within the containers to allow for live code updates during development, which will speed up the process.
Environment variables, such as database credentials and social media API keys, will be managed through an \texttt{.env} file that is passed into the Docker containers through \texttt{docker-compose.yaml}.
\newpage
\newpage
@@ -767,9 +668,9 @@ outlined. In this chapter, the details on how this was implemented will be
discussed.
discussed.
\subsection{Overview}
\subsection{Overview}
In the initial stages, the project was a small Python script that would fetch data from Reddit and aggregate simple statistics such as number of posts and a number of comments. Some early features like search and subreddit specific searches were added through hard-coded variables. The Reddit Connector code was extracted into it's own \texttt{RedditConnector} module, though the connector abstraction had not yet been formalised.
In the initial stages, the project was a small Python script that would fetch data from Reddit and aggregate simple statistics such as number of posts and a number of comments. Some early features like search and subreddit specific searches were added through hard-coded variables. The Reddit Connector code was extracted into its own \texttt{RedditConnector} module, though the connector abstraction had not yet been formalised.
As this was going to be a web-based tool, the Flask server was then setup. A rudimentary sentiment analysis endpoint was added as an initial test using the VADER Sentiment Python module. An endpoint to fetch from Reddit was added but temporarily scratched. Eventually more analysis endpoints were added, creating the many different analytical perspectives that are available in the final system, such as linguistic analysis and user analysis.
As this was going to be a web-based tool, the Flask server was then setup. A rudimentary sentiment analysis endpoint was added as an initial test using the VADER Sentiment Python module. An endpoint to fetch from Reddit was added but temporarily shelved. Eventually more analysis endpoints were added, creating the many different analytical perspectives that are available in the final system, such as linguistic analysis and user analysis.
At this stage, datasets were simply files stored on the machine and loaded into memory globally, which made early development and testing easier, but as the project progressed, the database was added to allow multiple datasets and users. While this was happening, further infrastructure was added to support multiple users, and to fix long-standing issues such as the blocking nature of NLP and data fetching, which was solved through the addition of Redis and Celery for asynchronous processing. Multiple user support was added through the addition of user accounts, with authentication and dataset ownership endpoints.
At this stage, datasets were simply files stored on the machine and loaded into memory globally, which made early development and testing easier, but as the project progressed, the database was added to allow multiple datasets and users. While this was happening, further infrastructure was added to support multiple users, and to fix long-standing issues such as the blocking nature of NLP and data fetching, which was solved through the addition of Redis and Celery for asynchronous processing. Multiple user support was added through the addition of user accounts, with authentication and dataset ownership endpoints.
@@ -792,46 +693,46 @@ The project was developed using the following tools and libraries:
\item\textbf{react-chartjs-2} and \textbf{react-wordcloud} for data visualisation in the frontend.
\item\textbf{react-chartjs-2} and \textbf{react-wordcloud} for data visualisation in the frontend.
\end{itemize}
\end{itemize}
The project was developed using Git for version control, with a branching strategy that included feature branches for new functionality and a main branch for stable code. Regular commits were made to document the development process and conventional commit messages were used to indicate the type of changes made. Occasionally, text bodies were included in commit messages to provide justification for design decisions or to explain changes that couldn't be easily understood from the diff alone.
Git was used for version control, with regular commits and branches for new features.
\subsection{Social Media Connectors}
\subsection{Social Media Connectors}
The first connectors implemented were the Reddit and Boards.ie connectors, as these were the original data sources for the Cork dataset. The YouTube connector was added later to improve diversity of data sources. In addition, the decision was made to only fetch new posts and fetch a fixed number of posts, rather than fetching the top posts of all time, which are usually full of memes and jokes that would skew the dataset and not be relevant for ethnographic analysis. In addition the temporal analysis would be skewed if we fetched top posts of all time, as the most popular posts are often from years ago, which would not be relevant for understanding the current state of the community.
The first connectors implemented were the Reddit and Boards.ie connectors, as these were the original data sources for the Cork dataset. The YouTube connector was added later to improve diversity of data sources.
The connectors fetch the newest posts from their specified source, which allows for the collection of the most recent data from the community, which is more relevant for contemporary analysis of communities. This limits long term temporal analysis but allows for more up-to-date analysis of the community.
\subsubsection{Data Transfer Objects}
\subsubsection{Data Transfer Objects}
Data Transfers Objects are simple classes that represent the data structure of a post or comment as it is retrieved from the source platform. They are used to encapsulate the raw data and provide a consistent interface for the rest of the system to interact with, regardless of the source platform.
Data Transfer Objects are simple classes that represent the data structure of a post or comment as it is retrieved from the source platform. They are used to encapsulate the raw data and provide a consistent interface for the rest of the system to interact with, regardless of the source platform.
These are later replaced by the unified "event" data model during the normalisation process, but they are a useful abstraction for the connectors to work with. Two DTOs are defined: \texttt{PostDTO} and \texttt{CommentDTO}, which represent the structure of a post and a comment respectively as they are retrieved from the source platform. The \texttt{PostDTO} will contain a list of \texttt{CommentDTO} objects.
These are later replaced by the unified "event" data model during the normalisation process, but they are a useful abstraction for the connectors to work with. Two DTOs are defined: \texttt{PostDTO} and \texttt{CommentDTO}, which represent the structure of a post and a comment respectively as they are retrieved from the source platform. The \texttt{PostDTO} will contain a list of \texttt{CommentDTO} objects.
\subsubsection{Reddit Connector}
\subsubsection{Reddit Connector}
The initial implementation of the Reddit connector was a simple class that simply used the \texttt{requests} library to fetch data directly from the Reddit API. The online Reddit API documentation was used as a reference for the implementation of the connector\cite{reddit_api}. It uses the \texttt{reddit.com/r/\{subreddit\}/new} endpoint to fetch the most recent posts from a specified subreddit, and the \texttt{reddit.com/r/\{subreddit\}/{post\_id}/comments} endpoint to fetch comments for each post.
The initial implementation of the Reddit connector was a simple class that simply used the \texttt{requests} library to fetch data directly from the Reddit API. The connector follows the Reddit API specification\cite{reddit_api}. It uses the \texttt{reddit.com/r/\{subreddit\}/new} endpoint to fetch the most recent posts from a specified subreddit, and the \texttt{reddit.com/r/\{subreddit\}/{post\_id}/comments} endpoint to fetch comments for each post.
It primary method implemented is of this signature:
It primary method implemented is of this signature:
The \texttt{reddit.com/r/\{subreddit\}/new} has a default limit of 100 posts per request, therefore\textbf{pagination} was implemented to allow fetching of more than 100 posts, which is necessary for Reddit datasets of larger than 100 posts. The connector will keep fetching posts until it reaches the specified number of posts to fetch, or until there are no more posts available.
The endpoint returns a maximum of 100 posts per request, so\textbf{pagination} was implemented to allow fetching of more than 100 posts, which is necessary for Reddit datasets of larger than 100 posts. The connector will keep fetching posts until it reaches the specified number of posts to fetch, or until there are no more posts available.
The "after" parameter is a post id and tells the API to fetch posts that come after that specific post id in the subreddit, which allows for pagination through the posts. The connector keeps track of the last post id fetched and uses it to fetch the next batch of posts until the desired number of posts is reached or there are no more posts available.
It became apparent that when unauthenticated, the Reddit API has severe rate limits that make fetching large datasets take hours, therefore the connector was updated to support authentication using Reddit API client credentials, which are provided through environment variables. This was done using the \texttt{requests\_oauthlib} library, which provides a convenient way to handle OAuth2 authentication with the Reddit API. With authentication, the rate limits are increased, allowing for faster data fetching.
It became apparent that when unauthenticated, the Reddit API has severe rate limits that make fetching large datasets take hours, therefore the connector was updated to support authentication using Reddit API client credentials, which are provided through environment variables. This was done using the \texttt{requests\_oauthlib} library, which provides a convenient way to handle OAuth2 authentication with the Reddit API. With authentication, the rate limits are increased, allowing for faster data fetching.
\subsubsection{YouTube Connector}
\subsubsection{YouTube Connector}
The YouTube connector was the simplest out of the three initial connectors, as YouTube provides an official API that is well-documented compared to the Reddit API. The Python library \texttt{google-api-python-client} was used to interact with the YouTube Data API. It provides simple methods like \texttt{youtube.search().list()} to search for videos based on keywords, and \texttt{youtube.commentThreads().list()} to fetch comments for a specific video.
The YouTube connector was the simplest out of the three initial connectors, as YouTube provides an official API that is well-documented compared to the Reddit API. The Python library \texttt{google-api-python-client} was used to interact with the YouTube Data API.
Like the Reddit Connector, it implements the \texttt{get\_new\_posts\_by\_search} method, which searches for videos based on a query and then fetches comments for those videos. As the Google API library handles comment fetching and pagination internally, the implementation was straightforward and did not require manual handling of pagination or rate limits.
Like the Reddit Connector, it implements the \texttt{get\_new\_posts\_by\_search} method, which searches for videos based on a query and then fetches comments for those videos. A limit of 50 posts per query is imposed by the YouTube API, therefore pagination was implemented to allow fetching more than 50 posts.
\subsubsection{Boards.ie Connector}
\subsubsection{Boards.ie Connector}
The Boards.ie connector was the most complex connector to implement, as Boards.ie does not provide an official API for data retrieval, which meant web scraping techniques were utilised to fetch data from the site. The \texttt{requests} library was used to make HTTP requests to the Boards.ie website, and the \texttt{BeautifulSoup} library was used to parse the HTML content and extract the relevant data.
The Boards.ie connector was the most complex connector to implement, as Boards.ie does not provide an official API for data retrieval, which meant web scraping techniques were utilised to fetch data from the site. The \texttt{requests} library was used to make HTTP requests to the Boards.ie website, and the \texttt{BeautifulSoup} library was used to parse the HTML content and extract the relevant data.
Inspect element was used to poke around the structure of the Boards.ie website and find the relevant HTML elements that contain the post and comment data. \texttt{BeautifulSoup} was then used to extract the correct data from the \texttt{.Message.userContent} tag and the \texttt{.PageTitle} tag, which contain the content and title of the posts. Each comment lived in an \texttt{ItemComment} class. Each of these were collected and iterated through to create the list of \texttt{PostDTO} and \texttt{CommentDTO} objects that represent the data retrieved from the site.
Browser developer tools were used to inspect the HTML structure and to find the relevant HTML elements that contain the post and comment data. \texttt{BeautifulSoup} was then used to extract the correct data from the \texttt{.Message.userContent} tag and the \texttt{.PageTitle} tag, which contain the content and title of the posts. Each comment lived in an \texttt{ItemComment} class. Each of these were collected and iterated through to create the list of \texttt{PostDTO} and \texttt{CommentDTO} objects that represent the data retrieved from the site.
As not all comments on a thread are on one page, pagination was implemented by looking for the "Next" button on the page and following the link to the next page of comments until there are no more pages left. This allows for fetching of all comments for a given post, even if they span multiple pages.
As not all comments on a thread are on one page, pagination was implemented by looking for the "Next" button on the page and following the link to the next page of comments until there are no more pages left. This allows for fetching of all comments for a given post, even if they span multiple pages.
A \texttt{ThreadPoolExecutor} was used to fetch posts in parallel, which improved the performance of the connector significantly, as fetching posts sequentially was very slow due to the need to fetch comments for each post, which often spanned multiple pages. Though there was diminishing returns after a certain number of threads, possibly due to site blocking or internet connection limits. Initially 20 threads were used, but this was later reduced to 10 threads to avoid potential issues with site blocking and to improve ethical considerations around web scraping.
A \texttt{ThreadPoolExecutor} was used to fetch posts in parallel, which improved the performance of the connector significantly, as fetching posts sequentially was very slow due to the need to fetch comments for each post, which often spanned multiple pages. Though there were diminishing returns after a certain number of threads, possibly due to site blocking or internet connection limits. Initially 20 threads were used, but this was later reduced to 5 threads to avoid potential issues with site blocking and to improve ethical considerations around web scraping.
\subsubsection{Connecter Plugin System}
\subsubsection{Connector Plugin System}
The connector plugin system was implemented to allow for easy addition of new data sources in the future. This would require simply implemented a new connector class and dropping it into the connectors directory, without needing to modify any existing code. This was achieved through the use of Python's \texttt{importlib} library, which allows for dynamic importing of modules at runtime.
The connector plugin system was implemented to allow for easy addition of new data sources in the future. This requires simply implementing a new connector class and dropping it into the connectors directory, without needing to modify any existing code. This was achieved through the use of Python's \texttt{importlib} library, which allows for dynamic importing of modules at runtime.
To achieve this, the base class \texttt{BaseConnector} was defined, which allows a standard interface for all connectors to implement. Each connector implements the \texttt{get\_new\_posts\_by\_search} method, which takes in a search query, a category (which is the subreddit for Reddit, or the category for Boards.ie), and a limit on the number of posts to fetch. The method returns a list of \texttt{PostDTO} objects that represent the data retrieved from the source platform.
To achieve this, the base class \texttt{BaseConnector} was defined, which allows a standard interface for all connectors to implement. Each connector implements the \texttt{get\_new\_posts\_by\_search} method, which takes in a search query, a category (which is the subreddit for Reddit, or the category for Boards.ie), and a limit on the number of posts to fetch. The method returns a list of \texttt{PostDTO} objects that represent the data retrieved from the source platform.
@@ -857,7 +758,7 @@ A low-level \texttt{PostgreConnector} module was implemented to handle the raw S
This module provides a simple interface for executing SQL queries. It's used by higher level modules to interact with the database without needing to worry about the details of database connections and query execution.
This module provides a simple interface for executing SQL queries. It's used by higher level modules to interact with the database without needing to worry about the details of database connections and query execution.
\subsubsection{Dataset Manager}
\subsubsection{Dataset Manager}
The dataset manager is a higher-level module that provides an interface for managing datasets in the database. It uses the low-level \texttt{PostgreConnector} to execute SQL queries, but provides more specific methods for dataset management, such as creating a new dataset, fetching a dataset by id, and updating a dataset metadata. Dependency injection is used to pass an instance of the \texttt{PostgreConnector}.
The dataset manager is a higher-level module that provides an interface for managing datasets in the database. It uses the low-level \texttt{PostgreConnector} to execute SQL queries, but provides more specific methods for dataset management, such as creating a new dataset, fetching a dataset by id, and updating dataset metadata. Dependency injection is used to pass an instance of the \texttt{PostgreConnector}.
The \texttt{DatasetManager} class is responsible for all database interactions relating to datasets, and draws a deliberate distinction between two categories of data:
The \texttt{DatasetManager} class is responsible for all database interactions relating to datasets, and draws a deliberate distinction between two categories of data:
\begin{itemize}
\begin{itemize}
@@ -882,28 +783,7 @@ The most important authentication methods implemented are as follows:
\item\texttt{get\_user\_by\_id(user\_id: int) -> None | dict}: Fetches a user's information from the database based on their user ID, returning a dictionary of user details if found or \texttt{None} if no such user exists.
\item\texttt{get\_user\_by\_id(user\_id: int) -> None | dict}: Fetches a user's information from the database based on their user ID, returning a dictionary of user details if found or \texttt{None} if no such user exists.
\end{itemize}
\end{itemize}
Defensive programming is used in the authentication manager to handle edge cases like duplicate usernames or emails, an example of this is in the \texttt{register\_user()} method, shown below:
Defensive programming is used in the authentication manager to handle edge cases like duplicate usernames or emails. This module is a simple interface that the higher level Flask API can call for easy management of user authentication and registration.
raise ValueError("Username must be longer than 3 characters")
if not EMAIL_REGEX.match(email):
raise ValueError("Please enter a valid email address")
if self.get_user_by_email(email):
raise ValueError("Email already registered")
if self.get_user_by_username(username):
raise ValueError("Username already taken")
self._save_user(username, email, hashed_password)
\end{Verbatim}
This module is a simple interface that the higher level Flask API can call for easy management of user authentication and registration.
\subsection{Data Pipeline}
\subsection{Data Pipeline}
The data pipeline began with the data connectors mentioned in the previous section, which are responsible for fetching raw data from the source platforms. However they were not initially included as part of the data pipeline, as the initial system was designed to only support manual dataset uploads. The data connectors were used to fetch data for the Cork dataset, which was then uploaded automatically through the API. Once the automatic data fetching functionality was added, the connectors were integrated into the data pipeline.
The data pipeline began with the data connectors mentioned in the previous section, which are responsible for fetching raw data from the source platforms. However they were not initially included as part of the data pipeline, as the initial system was designed to only support manual dataset uploads. The data connectors were used to fetch data for the Cork dataset, which was then uploaded automatically through the API. Once the automatic data fetching functionality was added, the connectors were integrated into the data pipeline.
@@ -911,16 +791,7 @@ The data pipeline began with the data connectors mentioned in the previous secti
\subsubsection{Data Enrichment}
\subsubsection{Data Enrichment}
The data enrichment process is responsible for taking the raw data retrieved from the connectors and transforming it into a format that is suitable for analysis. This involves several steps, including normalisation, NLP processing, and storage in the database.
The data enrichment process is responsible for taking the raw data retrieved from the connectors and transforming it into a format that is suitable for analysis. This involves several steps, including normalisation, NLP processing, and storage in the database.
Data Normalisation was intended to be a separate step in the data pipeline, but it was later combined with the enrichment processe as normalisation is a very small part of the process and can be done in a few lines of code, therefore it was combined with data enrichment. In normalisation, the list of \texttt{Post} objects retrieved from the connectors is flattened into a unified list of "events", which is a Pandas DataFrame that contains both posts and comments in a single table. The structure of the comments expansion method is as follows:
Data Normalisation was intended to be a separate step in the data pipeline, but it was later combined with the enrichment process as normalisation is a very small part of the process and can be done in a few lines of code, therefore it was combined with data enrichment. In normalisation, the list of \texttt{Post} objects retrieved from the connectors is flattened into a unified list of "events", which is a Pandas DataFrame that contains both posts and comments in a single table.
\begin{itemize}
\item The method receives a DataFrame \texttt{df} where each row represents a post, and the \texttt{comments} column contains a list of comment dictionaries.
\item The \texttt{comments} column is exploded using \texttt{pandas.DataFrame.explode()}, so that each comment occupies its own row, paired with the \texttt{id} of its parent post.
\item Rows where the comment value is not a dictionary are filtered out, discarding any \texttt{None} or malformed entries that may have resulted from posts with no comments.
\item\texttt{pd.json\_normalize()} is applied to the remaining comment dictionaries, flattening them into a structured DataFrame with one column per field.
\item The original DataFrame is stripped of its \texttt{comments} column to form \texttt{posts\_df}, and a \texttt{type} column is added with the value \texttt{"post"}, along with a \texttt{parent\_id} column set to \texttt{None}, as posts have no parent.
\item The comments DataFrame is similarly tagged with \texttt{type = "comment"}, and its \texttt{parent\_id} is populated from the \texttt{post\_id} field, establishing the relationship back to the originating post.
\item Both DataFrames are concatenated using \texttt{pd.concat()}, and the now-redundant \texttt{post\_id} column is dropped, yielding a single unified events table containing both posts and comments with a consistent schema.
\end{itemize}
The \texttt{enrich()} method is the primary method for dataset enrichment in the module, and provides two main functionalities:
The \texttt{enrich()} method is the primary method for dataset enrichment in the module, and provides two main functionalities:
\begin{itemize}
\begin{itemize}
@@ -942,21 +813,21 @@ The NLP module is responsible for adding new columns to the dataset that contain
\label{sec:emotion-classification}
\label{sec:emotion-classification}
For emotional classification, initially a pre-trained VADER sentiment analysis model was used, which provides a very simple way to classify text into positive, negative, and neutral emotions. Though for ethnographic analysis, a more complex emotional model that can capture more nuance is needed, therefore the VADER model was later replaced with a fine-tuned transformer-based model that can classify text into a wider range of emotions.
For emotional classification, initially a pre-trained VADER sentiment analysis model was used, which provides a very simple way to classify text into positive, negative, and neutral emotions. Though for ethnographic analysis, a more complex emotional model that can capture more nuance is needed, therefore the VADER model was later replaced with a fine-tuned transformer-based model that can classify text into a wider range of emotions.
GoEMOTION \cite{demszky2020goemotions} was considered as a potential model for emotional classification, as it is extremely nuanced and can capture a wide range of emotions, however it had over 27 emotion classes, which was too many for the purposes of this project, as it would have been difficult to visualise and analyse such a large number of emotion classes.
GoEMOTION \cite{demszky2020goemotions} was considered as a potential model for emotional classification, as it is extremely nuanced and can capture a wide range of emotions, however it has 27 emotion classes, which was too many for the purposes of this project, as it would have been difficult to visualise and analyse such a large number of emotion classes.
A middle ground was found with the "Emotion English DistilRoBERTa-base" model from HuggingFace \cite{hartmann2022emotionenglish}, which is a fine-tuned transformer-based model that can classify text into 6 emotion classes: anger, disgust, fear, joy, sadness, neutral and surprise.
A middle ground was found with the "Emotion English DistilRoBERTa-base" model from HuggingFace \cite{hartmann2022emotionenglish}, which is a fine-tuned transformer-based model that can classify text into 6 emotion classes: anger, disgust, fear, joy, sadness, neutral and surprise.
As the project progressed and more posts were classified, the "surprise" and "neutral" emotions were found to be dominating the dataset, which made it difficult to analyse the other emotions. This could possible be because the model is not fine-tuned for internet slang, and usage of exclamation marks and emojis, which are common in social media posts, may be classified as "surprise" or "neutral" rather than the intended emotion. Therefore, the "surprise" and "neutral" emotion classes were removed from the dataset, and the confidence numbers were re-normalised to the remaining 5 emotions.
As the project progressed and more posts were classified, the "surprise" and "neutral" emotions were found to be dominating the dataset, which made it difficult to analyse the other emotions. This may be because the model is not fine-tuned for internet slang, and usage of exclamation marks and emojis, which are common in social media posts, may be classified as "surprise" or "neutral" rather than the intended emotion. Therefore, the "surprise" and "neutral" emotion classes were removed from the dataset, and the confidence numbers were re-normalised to the remaining 5 emotions.
\subsubsection{Topic Classification}
\subsubsection{Topic Classification}
For topic classification, a zero-shot classification approach was used, which allows for classification of text into arbitrary topic classes without needing to fine-tune a model for each specific set of topics. Initially, attempts were made to automatically generate topic classes based on the most common words in the dataset using TF-IDF, but this led to generic and strange classes that weren't useful for analysis. Therefore, it was decided that a topic list would be provided manually, either by the user or using a generic list of broad common topics.
For topic classification, a zero-shot classification approach was used, which allows for classification of text into arbitrary topic classes without needing to fine-tune a model for each specific set of topics. Initially, attempts were made to automatically generate topic classes based on the most common words in the dataset using TF-IDF, but this led to generic and strange classes that weren't useful for analysis. Therefore, it was decided that a topic list would be provided manually, either by the user or using a generic list of broad common topics.
Initially, the "all-mpnet-base-v2" \cite{all_mpnet_base_v2} was used as the base model for the zero-shot classification, which is a general-purpose sentence embedding model. While this worked well and produced good results, it was slow to run interference on large datasets, and would often take hours to classify a dataset of over 60,000 posts and comments.
Initially, the "all-mpnet-base-v2" \cite{all_mpnet_base_v2} was used as the base model for the zero-shot classification, which is a general-purpose sentence embedding model. While this worked well and produced good results, it was slow to run inference on large datasets, and would often take hours to classify a dataset of over 60,000 posts and comments.
Eventually, the "MiniLM-L6-v2 " \cite{minilm_l6_v2} was chosen as the base model for zero-shot classification, which is a smaller and faster sentence embedding model. While it may not produce quite as good results as the larger model, it still produces good results and is much faster to run inference on, which makes it more practical for use in this project.
Eventually, the "MiniLM-L6-v2 " \cite{minilm_l6_v2} was chosen as the base model for zero-shot classification, which is a smaller and faster sentence embedding model. While it may not produce quite as good results as the larger model, it still produces good results and is much faster to run inference on, which makes it more practical for use in this project.
\subsubsection{Entity Recognition}
\subsubsection{Entity Recognition}
At this point, the NLP pipeline was taking a long time to run on large datasets (such as the Cork dataset), therefore any NER (Named Entity Recognition) model that was added needed to be small and fast to run interference on large datasets. The "dslim/bert-base-NER" model from HuggingFace \cite{dslim_bert_base_ner} was chosen as it is a fine-tuned BERT model that can perform named entity recognition, and is relatively small and fast compared to other NER models.
At this point, the NLP pipeline was taking a long time to run on large datasets (such as the Cork dataset), therefore any NER (Named Entity Recognition) model that was added needed to be small and fast to run inference on large datasets. The "dslim/bert-base-NER" model from HuggingFace \cite{dslim_bert_base_ner} was chosen as it is a fine-tuned BERT model that can perform named entity recognition, and is relatively small and fast compared to other NER models.
This model outputs a list of entities for each post, and each entity has a type, which are:
This model outputs a list of entities for each post, and each entity has a type, which are:
\begin{itemize}
\begin{itemize}
@@ -1078,11 +949,11 @@ The \texttt{StatGen} (Statistics Generator) class is a higher level module that
Initially, all statistics were implemented within this class, however as the class grew larger and larger, it was refactored to delegate the different categories of statistics to separate classes, listed in the sections above. The class directly instantiates these analysis classes. Dependency injection of the analysis classes was considered for looser coupling, but since they were split purely for organisational and neatness purposes, extra decoupling complexity wasn't needed.
Initially, all statistics were implemented within this class, however as the class grew larger and larger, it was refactored to delegate the different categories of statistics to separate classes, listed in the sections above. The class directly instantiates these analysis classes. Dependency injection of the analysis classes was considered for looser coupling, but since they were split purely for organisational and neatness purposes, extra decoupling complexity wasn't needed.
Beyond improving the quality of the code, the other main function of this class is to provide a single centralised area to manage staistical filtering. Each statistical method of the class will take in a dictionary of filters as a parameter, then the private method \texttt{\_prepare\_filtered\_df} will apply the filters to the dataset and return the filtered dataset. Four types of filters are supported:
Beyond improving the quality of the code, the other main function of this class is to provide a single centralised area to manage statistical filtering. Each statistical method of the class will take in a dictionary of filters as a parameter, then the private method \texttt{\_prepare\_filtered\_df} will apply the filters to the dataset and return the filtered dataset. Four types of filters are supported:
\begin{itemize}
\begin{itemize}
\item\texttt{start\_date}: A date string that filters the dataset to only include events after the specified date.
\item\texttt{start\_date}: A date string that filters the dataset to only include events after the specified date.
\item\texttt{end\_date}: A date string that filters the dataset to only include events before the specified date.
\item\texttt{end\_date}: A date string that filters the dataset to only include events before the specified date.
\item\texttt{source}: A string that filters the dataset to only include events
\item\texttt{source}: A string that filters the dataset to only include events from the specified source platform.
\item\texttt{search\_query}: A string that filters the dataset to only include events that contain the search query in their content.
\item\texttt{search\_query}: A string that filters the dataset to only include events that contain the search query in their content.
\end{itemize}
\end{itemize}
@@ -1173,7 +1044,7 @@ In addition to an extra navigation bar, it also contains a filter component that
\end{figure}
\end{figure}
\subsubsection{Analysis Page}
\subsubsection{Analysis Page}
The Analysis page fires six API requests in parallel to fetch the six categories of statistics (temporal, linguistic, user, interactional, emotional and cultural), and each category is rendered in a separate section on the page with its own visualisation. The API requests are fired when the page loads, and also whenever the filters are updated. This allows the API calls to be centralised into a single component, such that any change in the filters will automatically update all of the statistics on the page. Appying filters re-fetches all six endpoints with new query parameters.
The Analysis page fires six API requests in parallel to fetch the six categories of statistics (temporal, linguistic, user, interactional, emotional and cultural), and each category is rendered in a separate section on the page with its own visualisation. The API requests are fired when the page loads, and also whenever the filters are updated. This allows the API calls to be centralised into a single component, such that any change in the filters will automatically update all of the statistics on the page. Applying filters re-fetches all six endpoints with new query parameters.
The majority of statistics are displayed using a custom KPI component that shows the name of the statistic, the value, and a secondary label for other information. An example of this can be seen in Figure \ref{fig:kpi_card}. The statistics that are not displayed as KPIs, such as the temporal analysis line chart and heatmap, will be discussed in the next sections.
The majority of statistics are displayed using a custom KPI component that shows the name of the statistic, the value, and a secondary label for other information. An example of this can be seen in Figure \ref{fig:kpi_card}. The statistics that are not displayed as KPIs, such as the temporal analysis line chart and heatmap, will be discussed in the next sections.
@@ -1251,7 +1122,7 @@ To deploy the application, Docker was used to containerise both the backend and
\item\textbf{Redis Container}: This container runs Redis. It uses the official Redis image from Docker Hub.
\item\textbf{Redis Container}: This container runs Redis. It uses the official Redis image from Docker Hub.
\end{itemize}
\end{itemize}
To run the application, the user needs to have Docker and Docker Compose installed on their machine. They then need to fill in the necessary environment variables in the \texttt{.env} file, for which there is a template provided as \texttt{.env.example}. The example env file contains defaults for most vairables, except for the Reddit and Google API credentials that will need to be sourced. In addition, they JWT secret key will need to be set to a random 128-bit string for security reasons.
To run the application, the user needs to have Docker and Docker Compose installed on their machine. They then need to fill in the necessary environment variables in the \texttt{.env} file, for which there is a template provided as \texttt{.env.example}. The example env file contains defaults for most variables, except for the Reddit and Google API credentials that will need to be sourced. In addition, they will need to set the JWT secret key to a random 128-bit string for security reasons.
Once the environment variables are set, the user can run the command \texttt{docker compose up -d} in the root directory of the project, which will build and start all of the containers. The application will then be accessible at \texttt{http://localhost:5173} in the user's web browser.
Once the environment variables are set, the user can run the command \texttt{docker compose up -d} in the root directory of the project, which will build and start all of the containers. The application will then be accessible at \texttt{http://localhost:5173} in the user's web browser.
@@ -1292,7 +1163,7 @@ The dashboard currently provides no indication of how much engagement a post rec
\subsection{NLP Accuracy}
\subsection{NLP Accuracy}
The accuracy of the NLP models used in the system was evaluated using a small manually annotated dataset. By taking 50 random examples of posts from the Cork dataset and manually annotating their topic and emotion, then comparing these annotations to the model's predictions, the accuracy of the models can be estimated. Keep in mind that this is a small sample size and is tied to a specific dataset, with specific pre-defined topics, so it may not be representative of the overall accuracy of the models across different datasets and topics.
The accuracy of the NLP models used in the system was evaluated using a small manually annotated dataset. By taking 50 random examples of posts from the Cork dataset and manually annotating their topic and emotion, then comparing these annotations to the model's predictions, the accuracy of the models can be estimated. Keep in mind that this is a small sample size and is tied to a specific dataset, with specific pre-defined topics, so it may not be representative of the overall accuracy of the models across different datasets and topics.
To do this, this command was run on the Docker database containter to extract 50 random posts from the Cork dataset:
To do this, this command was run on the Docker database container to extract 50 random posts from the Cork dataset:
The emotional classification was notably limited in some regards. The decision described in Section \ref{sec:emotion-classification} to remove the "neutral" and "surprise" emotion classes from the emotional analysis was made after observing that the two classes were dominating the dataset. However, restricting the neutral class led to some posts being misclassified as another emotion which may not have been accurate, for example, take the content of the eleventh post in the output file (Record 11):
The emotional classification was notably limited in some regards. The decision described in Section \ref{sec:emotion-classification} to remove the "neutral" and "surprise" emotion classes from the emotional analysis was made after observing that the two classes were dominating the dataset. However, restricting the neutral class led to some posts being misclassified as another emotion which may not have been accurate, for example, take the content of the eleventh post in the output file (Record 11):
\begin{quote}
\begin{quote}
@@ -1339,7 +1211,11 @@ In addition, some confusion arose between the "disgust" and "anger" emotion clas
The model classified this post as "disgust" with a confidence of 0.35 and "anger" with a confidence of 0.38. This is a borderline case, and even two different human annotators could disagree on whether this post is more "disgust" or "anger", so it's understandable that the model would struggle with this. This highlights the limitations of the emotional classification, as emotions can be quite nuanced and subjective, and a model may not always capture the true emotional tone of a post accurately.
The model classified this post as "disgust" with a confidence of 0.35 and "anger" with a confidence of 0.38. This is a borderline case, and even two different human annotators could disagree on whether this post is more "disgust" or "anger", so it's understandable that the model would struggle with this. This highlights the limitations of the emotional classification, as emotions can be quite nuanced and subjective, and a model may not always capture the true emotional tone of a post accurately.
A significant reason that the accuracy was sitting around (60–70\%) is the model’s inability to represent the multi-dimensional nature of human emotion. Many posts express multiple emotions simultaneously (e.g., frustration mixed with humour), yet the model is constrained to selecting a single dominant class. This leads to misclassification in cases where no single emotion is clearly dominant.
In addition, the temporary exclusion of the “neutral” class forced inherently neutral posts into a specific category, artificially lowering accuracy. Borderline cases between closely related emotions (such as anger and disgust) also contributed to disagreement between manual annotations and model predictions, which shows how subjective emotional expressions can be.
\subsubsection{Topical Classification Discussion}
The topic classification also had some limitations, particularly with posts that contained multiple topics. For example, take the content of the 26th post in the output file:
The topic classification also had some limitations, particularly with posts that contained multiple topics. For example, take the content of the 26th post in the output file:
\begin{quote}
\begin{quote}
\textit{We're staying in the city centre so walkable to most places. I checked electrics website earlier. Looked nice. Ended up booking Joules for Thursday then for Friday, we will try a new place called "conways yard" that was recommended here. In hoping to watch the England match there so I'd imagine if have to get there well before kick off (8pm) to get a seat bear a TV.}
\textit{We're staying in the city centre so walkable to most places. I checked electrics website earlier. Looked nice. Ended up booking Joules for Thursday then for Friday, we will try a new place called "conways yard" that was recommended here. In hoping to watch the England match there so I'd imagine if have to get there well before kick off (8pm) to get a seat bear a TV.}
@@ -1347,7 +1223,7 @@ The topic classification also had some limitations, particularly with posts that
This post was classified with the topic "Rugby" with a topic confidence of 0.47, which is quite high by most standards. However this could arguably be classified as "City Center" or even "Pubs" due to the mention of the city centre and the pub "Conway's Yard". This highlights a limitation of the topic classification, which is that it can struggle with posts that contain multiple topics, as it is only able to assign one dominant topic to each post.
This post was classified with the topic "Rugby" with a topic confidence of 0.47, which is quite high by most standards. However this could arguably be classified as "City Center" or even "Pubs" due to the mention of the city centre and the pub "Conway's Yard". This highlights a limitation of the topic classification, which is that it can struggle with posts that contain multiple topics, as it is only able to assign one dominant topic to each post.
To address this, making the topic classification more similar to the emotional classification might be benefical. That is, instead of just assigning one dominant topic to each post, the model could assign a confidence score for each topic class, which would allow posts to be classified with multiple topics if they have high confidence scores for multiple topics.
To address this, making the topic classification more similar to the emotional classification might be beneficial. That is, instead of just assigning one dominant topic to each post, the model could assign a confidence score for each topic class, which would allow posts to be classified with multiple topics if they have high confidence scores for multiple topics.
In addition, ensuring a well-curated topic list that is specific to the dataset can help improve the accuracy of the topic classification, as it reduces the chances of posts being misclassified into irrelevant topics and reduces possible overlap between topics.
In addition, ensuring a well-curated topic list that is specific to the dataset can help improve the accuracy of the topic classification, as it reduces the chances of posts being misclassified into irrelevant topics and reduces possible overlap between topics.
@@ -1368,11 +1244,13 @@ This section will outline the performance of the NLP processing, which is the pr
Overall this tends to follow a linear trend, with the time taken increasing linearly with the number of posts. As noted above, the number of events the pipeline is processing is likely 10-20x the number of posts, due to comments, so the actual number of events being processed is likely around 1000 for the 100 post benchmark, and around 10,000 for the 1000 post benchmark.
Overall this tends to follow a linear trend, with the time taken increasing linearly with the number of posts. As noted above, the number of events the pipeline is processing is likely 10-20x the number of posts, due to comments, so the actual number of events being processed is likely around 1000 for the 100 post benchmark, and around 10,000 for the 1000 post benchmark.
The 1000 posts benchmark for \texttt{boards.ie} took 312.83s for NLP processing, which is much higher than the other sources. This is likely due to the fact that \texttt{boards.ie} is a forum site, with long running conversations that can last years, therefore the number of comments per thread. is significantly higher than other sources. There is an average of around 900 comments per post in the \texttt{boards.ie} dataset, compared to ~30 comments per post in the Reddit and YouTube datasets, which explains the significant increase in NLP processing time for the \texttt{boards.ie} dataset.
The 1000 posts benchmark for \texttt{Boards.ie} took 312.83s for NLP processing, which is much higher than the other sources. This is likely due to the fact that \texttt{Boards.ie} is a forum site, with long running conversations that can last years, therefore the number of comments per thread is significantly higher than other sources. There is an average of around 900 comments per post in the \texttt{Boards.ie} dataset, compared to ~30 comments per post in the Reddit and YouTube datasets, which explains the significant increase in NLP processing time for the \texttt{Boards.ie} dataset.
\subsubsection{Auto-fetching Performance}
\subsubsection{Auto-fetching Performance}
This section outlines the performance of the auto-fetching feature, which is the process of fetching data from the sources using the connectors. The performance of this feature is measured in terms of the time taken to fetch a certain number of posts from each source. The benchmarks are shown in Table \ref{tab:performance_benchmarks}.
This section outlines the performance of the auto-fetching feature, which is the process of fetching data from the sources using the connectors. The performance of this feature is measured in terms of the time taken to fetch a certain number of posts from each source. The benchmarks are shown in Table \ref{tab:performance_benchmarks}.
One important thing to note is that the YouTube API does not return more than 500 posts for a given search query, so the 1000 posts benchmark for YouTube is not available.
\begin{table}[!h]
\begin{table}[!h]
\centering
\centering
\begin{tabular}{|c|c|c|c|}
\begin{tabular}{|c|c|c|c|}
@@ -1381,13 +1259,75 @@ This section outlines the performance of the auto-fetching feature, which is the
\hline
\hline
10 posts & 3.25s & 103.28s & 2.08s \\
10 posts & 3.25s & 103.28s & 2.08s \\
100 posts & 37.46s & 1182.71s & 12.52s \\
100 posts & 37.46s & 1182.71s & 12.52s \\
1000 posts & 482.87s &DNF & 74.80s\\
1000 posts & 482.87s &11196.19s & N/A\\
\hline
\hline
\end{tabular}
\end{tabular}
\caption{Performance Benchmarks for Auto-fetching and NLP Processing}
\caption{Performance Benchmarks for the Auto-Fetch Process}
\label{tab:performance_benchmarks}
\label{tab:performance_benchmarks}
\end{table}
\end{table}
\texttt{Boards.ie} is by far the slowest, likely to be a combination of two factors: web scraping is simply slower than using an API, as comments have to be fetched page by page, with the connector loading and parsing each page fully \textbf{and}\texttt{Boards.ie} threads have a significantly higher number of comments per post due to the forum nature of the site. Though the rate of post-fetching from \texttt{Boards.ie} is poor, it scaled linearly with the number of posts.
Reddit was much faster than \texttt{Boards.ie}, likely due to the fact that it uses an API. Though it is affected by rate limits, and during the 1000-post benchmark test, the API rate limits was hit once, where it stalled for exactly 120 seconds, which when taken into account, also shows that Reddit scales linearly with the number of posts.
YouTube was the fastest source, likely due to the fact that it also uses an API and hit no rate limits. However, the YouTube API does not allow fetching more than 500 posts for a given search query. For 500 posts, the time taken was 74.80s. If we extrapolate the time taken for 1000 posts, it would be around 149.60s, which is still much faster than the other sources and scaled linearly with the number of posts.
\subsection{Cork Dataset Findings}
The Cork dataset, described in Section \ref{sec:cork_dataset}, was analysed using the system, and several findings were observed from the analysis.
\subsubsection{Temporal Findings}
Temporal activity patterns show that the most active hours for posting are weekdays between 1pm and 3pm, which suggests that users post during their lunch breaks in the early afternoon. There is also a smaller peak in activity in the evenings around 8pm, which is likely when users are off work. The least active hours are early mornings between 1am and 6am, which is expected as most users are likely asleep during these hours.
\caption{Activity Heatmap for the Cork Dataset, where the X axis represents the hour of day and the Y axis represents the day of the week. Blue areas indicate higher activity, lighter areas indicate lower activity.}
\caption{Bigrams and Trigrams for the Cork Dataset, showing the most common two-word and three-word combinations in the dataset.}
\label{fig:cork_linguistic}
\end{figure}
Figure \ref{fig:cork_linguistic} shows the most common bigrams and trigrams in the Cork dataset. Trigrams like "north main street" and "oliver plunkett street" suggest strong local geographic identity in the community, a shared physical space. Bigrams like "city centre" and "cork city" further reinforce this.
"anti social behaviour" with 85 mentions suggests that this is a common civic concern within the community. The use of "years ago" suggests that there is a sense of nostalgia or reflection of the past in the community's discourse.
\caption{Average Emotion Scores for the Cork Dataset}
\label{fig:cork_emotions}
\end{figure}
Figure \ref{fig:cork_emotions} shows the average emotion scores for the Cork dataset. The most dominant emotion in the dataset is "joy", which suggests that the community has a more positive tone than other online communities. However this could also be due to the fact that the "neutral" emotion class was removed from the analysis, which may have forced some neutral posts to be classified as "joy".
\caption{Topic-Emotion Analysis for Exams, Flooding and Food in the Cork Dataset}
\label{fig:cork_emotions_by_topic}
\end{figure}
In Figure \ref{fig:cork_emotions_by_topic}, we can see that the "Food" topic has a much higher average "joy" score than any other emotion, shown by it having a model confidence of 0.53 for "joy". In addition this topic has a large sample size of 1390 posts, which reinforces the finding that the "Food" topic is associated with "joy" in the Cork dataset.
The "Exams" topic has a dominant emotion of "sadness", which makes sense given the stressful nature of exams.
Interestingly, the "Flooding" topic has a dominant emotion of "anger". This could mean that users are angry about the flooding situation or the government response to flooding, however further analysis shows in Figure \ref{fig:flooding_posts} that it is a mixture of posts being misclassified as "flooding" due to the presence of words like "flood" and "water", and posts that are angry about the government response and infrastructure issues related to flooding. This highlights the limitations of the topic classification, as it can struggle with posts that contain multiple topics or posts that are misclassified into a topic due to the presence of certain keywords.
\caption{Posts Classified with the "Flooding" Topic}
\label{fig:flooding_posts}
\end{figure}
\subsection{Limitations}
\subsection{Limitations}
Several limitations of the system became apparent through development, evaluation and user testing.
Several limitations of the system became apparent through development, evaluation and user testing.
@@ -1406,37 +1346,54 @@ The Boards.ie connector relies on web scraping, which is very fragile and prone
\subsubsection{English-Only Support}
\subsubsection{English-Only Support}
Two of three NLP models used in the system are trained exclusively on English-language data. This means the system cannot accurately analyse datasets in other languages, which limits its usefulness for researchers working with non-English communities. This was noted as a specific concern by participants in the user feedback session, who work with both English and Turkish datasets.
Two of three NLP models used in the system are trained exclusively on English-language data. This means the system cannot accurately analyse datasets in other languages, which limits its usefulness for researchers working with non-English communities. This was noted as a specific concern by participants in the user feedback session, who work with both English and Turkish datasets.
\subsubsection{Scalability}
While asynchronous processing via Celery and Redis mitigates blocking during NLP enrichment and data fetching, the system is not designed to scale horizontally. A single Celery worker handles all tasks sequentially, and the PostgreSQL database is not configured for high-availability or replication. For research use at small to medium scale this is fine, but the system would require significant infrastructure changes to support concurrent large-scale usage across many users.
\newpage
\newpage
\section{Conclusions}
\section{Conclusions}
\subsection{Reflection}
\subsection{Reflection}
I have learned a lot through the process of building this system, both in terms of technical skills and personal growth. This project represented the most technically complex system I had built independently to date. Being able
I have learned a lot through the process of building this system, both in terms of technical skills and personal growth. This project represented the most technically complex system I had built independently to date.
The analytical scope is the project's most visible limitation. Six analytical angles across many data sources sounds comprehensive, but with a single developer and a fixed timeline, the actual ethnographic depth achievable was modest. The decision between depth of ethnographic analysis and typical SaaS-type infrastructure and features was a tension throughout the project. Eventually a balance between the two was achieved, but there was some sacrafice in the analysis depth for the sake of building a more complete and polished system.
The analytical scope is the project's most visible limitation. Six analytical angles across many data sources sounds comprehensive, but with a single developer and a fixed timeline, the actual ethnographic depth achievable was modest. The decision between depth of ethnographic analysis and typical SaaS-type infrastructure and features was a tension throughout the project. Eventually a balance between the two was achieved, but there was some sacrifice in the analysis depth for the sake of building a more complete and polished system.
Planning the project was a challenge, as generally I tend to work iteratively. I jump in and start building straight away, and I find that the process of building helps me to figure out what I actually want to build. This led to some awkward parts in the report where design and implementation often overlapped and were made in a non-linear fashion. Creating the design section was difficult when implementation had already started, and design was still changed throughout the implementation process.
Planning the project was a challenge, as generally I tend to work iteratively. I jump in and start building straight away, and I find that the process of building helps me to figure out what I actually want to build. This led to some awkward parts in the report where design and implementation often overlapped and were made in a non-linear fashion. Creating the design section was difficult when implementation had already started, and design was still changed throughout the implementation process.
On a personal level, the project was a significant learning experience in terms of time management and project planning. The planning and implementation of the project was ambitious but easy to get carried away with, and I found myself spending a lot of time on features that were not essential to the core functionality of the system. The implementation was felt productive and visible in a way that the writing of a report was not, I found myself spending more time on the implementation than the report, and the report was pushed to the sidelines until the end of the project.
On a personal level, the project was a significant learning experience in terms of time management and project planning. The planning and implementation of the project was ambitious but easy to get carried away with, and I found myself spending a lot of time on features that were not essential to the core functionality of the system. The implementation felt productive and visible in a way that the writing of a report was not, I found myself spending more time on the implementation than the report, and the report was pushed to the sidelines until the end of the project.
The project was maintained and developed using Git for version control, with the repository hosted on both Github and a self hosted Gitea instance. The project eventually began to use conventional commits to maintain a clean commit history, and commit messages contained rationale for non-obvious decisions.
The project was maintained and developed using Git for version control, with the repository being hosted on Github.
Starting in Novemeber, the project went through a few iterations of basic functionality such as data retreival and storage. Research was done on digital ethnography, the traditional metrics used, and how they're implemented in code. The design of the system was also iterated on, with the initial design being a very simple frontend that showed simple aggregates, into a more complex and feature-rich dashboard with multiple analytical perspectives and NLP enrichments.
Starting in November, the project went through a few iterations of basic functionality such as data retrieval and storage. Research was done on digital ethnography, the traditional metrics used, and how they're implemented in code. The design of the system was also iterated on, with the initial design being a very simple frontend that showed simple aggregates, into a more complex and feature-rich dashboard with multiple analytical perspectives and NLP enrichments.
The majority of real development and implementation took place between January and April, with the final month of April being focused on testing, bug fixing, writing the report and preparation for the open day. The project was developed in an agile and iterative way, with new features being added and improved upon throughout the development process, rather than having a fixed plan for the entire project from the beginning.
The majority of real development and implementation took place between January and April, with the final month of April being focused on testing, bug fixing, writing the report and preparation for the open day. The project was developed in an agile and iterative way, with new features being added and improved upon throughout the development process, rather than having a fixed plan for the entire project from the beginning.
Git was as a changelog of decisions and rationale, to aid writing the report. But if this project were to be done again, I would maintain the report alongside the implementation from the beginning, as it would have made writing the report much easier and less stressful at the end.
Git was used as a changelog of decisions and rationale, to aid writing the report. But if this project were to be done again, I would maintain the report alongside the implementation from the beginning, as it would have made writing the report much easier and less stressful at the end.
\subsection{Future Work}
This section discusses several potential areas for future work and improvements to the system.
\subsubsection{Improved Emotional Analysis}
As noted in the user feedback and accuracy evaluation sections, the emotional analysis could be improved by implementing a more nuanced emotion classification model, such as the GoEmotions model with 27 emotion classes \cite{demszky2020goemotions}.
This would require some changes to the database schema, as currently, the "events" table contains a column for each of the five emotion classes, which would not be feasible with 27 emotion classes. A more flexible schema would be needed, such as having a separate "emotions" table that contains the emotion classifications for each post, with columns for the post ID, emotion class, and confidence score.
Or something similar to how NER classifications are stored, which are simply \texttt{JSONB} columns that contain a list of all the classifications for each post, which allows for a variable number of classifications and is more flexible for future changes to the emotion classification.
\subsubsection{Multilingual Support}
The project was largely built around English language datasets, therefore the emotional and NER models are trained on English language data and would not work with other languages. Beyond the NLP models, the stances and identity markers currently implemented use English-specific keywords such as "we", "us", "I", "me".
To support multilingual datasets, multilingual NLP models could be implemented to allow language detection to be automatic. However, as the specific stance and identity markers would be required for different languages, a better solution would be for the user to specify the language of their dataset upon uploading, and then the system could use the correct NLP models, stance/identity marker lists and stop words for that language.
\subsubsection{Improved Corpus Explorer}
The corpus explorer could be improved by allowing users to see more metadata for each post, such as the NLP classifications and possibly even more than just the top emotion and topic.
In addition, reconstructing the reply chains and conversation structures in the corpus explorer would allow users to see the context of each post and how they relate to each other. It would allow researchers to gauge the power dynamics between users and the conversational structures.
Colour grading each post in the corpus explorer based on its emotional classification would be both aesthetically pleasing and useful for users to quickly scan through the posts and get a sense of the emotional tone of the dataset.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.