Compare commits

..

2 Commits

2 changed files with 92 additions and 5 deletions

View File

@@ -1,4 +1,4 @@
\documentclass{article}[12pt]
\documentclass{article}
\usepackage{graphicx}
\usepackage{setspace}
\usepackage{hyperref}
@@ -6,6 +6,8 @@
\begin{document}
\bibliographystyle{plain}
\begin{titlepage}
\centering
@@ -444,11 +446,13 @@ The system will follow a client-server architecture, with a Flask-based backend
The reasoning behind this architecture is that it allows the analytics to be aggregated and computed on the server side using Pandas which is much faster than doing it on the client frontend. The frontend will focus on rendering and visualising the data.
\subsubsection{Flask API}
\subsubsection{API Design}
The Flask backend will expose a RESTful API with endpoints for dataset management, authentication and user management, and analytical queries. Flask will call on backend components for data parsing, normalisation, NLP processing and database interfacing.
Flask was chosen for its simplicity, familiarity and speed of development. It also has many extensions that can be used for authentication (Flask-Bcrypt, Flask-Login).
The API is separated into three separate groups, \textbf{authentication}, \textbf{dataset management} and \textbf{analysis}.
\subsubsection{React Frontend}
React was chosen for the frontend due to its massive library of pre-built components with efficient rendering capabilities and ability to display many different types of data. The frontend will be structured around a tabbed interface, with each tab corresponding to a different analytical endpoint (e.g., temporal analysis, linguistic analysis, emotional analysis). Each tab will fetch data from the backend API and render it using appropriate visualisation libraries (react-wordcloud for word clouds, react-chartjs-2 for charts, etc). The frontend will also include controls for filtering the dataset based on keywords, date ranges, and data sources.
@@ -656,6 +660,12 @@ In this system, cultural analysis will include:
\item Average emotions per entity
\end{itemize}
\subsection{Frontend Design}
The frontend is built with React and TypeScript, and the analysis sections are structured around a tabbed dashboard interface where each tab corresponds to a distinct analytical perspective: temporal, linguistic, emotional, user, and interaction analysis. This organisation mirrors the shape of the backend API and makes it straightforward for a researcher to navigate between different lenses on the same dataset without losing context.
React was chosen for its efficient rendering model and the breadth of its visualisation ecosystem
\subsection{Automatic Data Collection}
Originally, the system was designed to only support manual dataset uploads, where users would collect their own data from social media platforms and format it into the required \texttt{.jsonl} format.
@@ -678,11 +688,21 @@ Creating a base interface for what a connector should look like allows for the e
The connector registry is designed so that any new connector implementing \texttt{BaseConnector} is automatically discovered and registered at runtime, without requiring changes to any existing code. This allows for a modular and extensible architecture where new data sources can be integrated with minimal effort.
\subsection{Asynchronous Processing}
The usage of NLP models for tasks such as sentiment analysis, topic classification, and entity recognition can be computationally intensive, especially for large datasets. To prevent the Flask API from blocking while these tasks are being processed, an asynchronous processing queue will be implemented using \textbf{Redis} and \textbf{Celery}.
The usage of NLP models for tasks such as sentiment analysis, topic classification, and entity recognition can be computationally intensive, especially for large datasets. In addition, fetching large datasets from sites like Reddit and YouTube takes a lot of time, due to the sequential nature of data fetching and severe rate limits on even authenticated Reddit accounts. To prevent the Flask API from blocking while these tasks are being processed, an asynchronous processing queue will be implemented using \textbf{Redis} and \textbf{Celery}.
When NLP processing is triggered or data is being fetched from social media APIs, a task will be added to the Redis queue. Celery workers will then pop tasks off the Redis queue and process these tasks in the background, which ensures the API to remain responsive to user requests. This approach also allows for better scalability, as additional workers can be added to handle increased load.
\subsubsection{Dataset Enrichment}
A non-normalised dataset will be passed into Celery along with the dataset id and the user id of the dataset owner. At this point, the program is running separately to the main Flask thread. The program then calls on the \textbf{Normalisation \& Enrichment Module} to:
\begin{itemize}
\item Flatten the dataset from posts with nested comments to unified event data model.
\item Add derived timestamp columns to aid with temporal analysis
\item Add topic, emotional and entity NLP analysis as columns
\end{itemize}
\subsubsection{Data Fetching}
If the user triggers a data auto-fetch from any given social media site, a task will be added to the Redis queue. Celery workers will then pop tasks off the Redis queue and process these tasks in the background, which ensures the API to remain responsive to user requests. The specific data connectors are called and the data fetching begins. Once the data has been fetched from all social media sites, NLP processing begins and we are at the same stage as before.
Asynchronous processing is especially important for automatic data-fetching, as particularly large datasets can take hours to fetch.
Some of the these tasks, like fetching data from social media APIs are very long-running tasks that can take hours to complete. By using asynchronous processing that updates the database with progress updates, users can see the status of their data fetching through the frontend.
\subsection{Design Tradeoffs}
\subsubsection{Database vs On-Disk Storage}
@@ -695,6 +715,26 @@ An additional benefit of using a database was that it allowed the NLP processing
\texttt{PostgreSQL} was chosen as the database solution due to its robustness, support for complex queries, and compatibility with Python through \texttt{psycopg2}. PostgreSQL's support for JSONB fields allows for storage of unstructured NLP outputs, which alternatives like SQLite does not support.
\subsubsection{Unified Data Model vs Split Data Model}
The choice between a \textbf{Unified Data Model} and a \textbf{Split Data Model} led to many swaps in design for the API.
\paragraph{The Case for a Unified Data Model}
\begin{itemize}
\item \textbf{Simpler Schema}: One \texttt{events} table rather than split comments and posts tables
\item \textbf{Simpler Pipeline}: The same pipeline works for both types
\item \textbf{Differentiation Possible}: Through the \texttt{type} column, we can still differentiate between a post and a comment, though more awkwardly.
\end{itemize}
But it led to a simplification of some of the content, for example a post title is very different from the comment content. Reply chains must be reconstructed using the \texttt{reply\_to} and \texttt{parent\_id} fields and some fields, like \texttt{reply\_to} will be null depending on the data source. For example, boards.ie does not support nested replies.
\paragraph{The Case for a Split Data Model}
\begin{itemize}
\item \textbf{Per-Type Analysis}: A post has different attributes to a comment, extending the analysis for post-specific metrics (like title sentiment, title-to-post length ratio) is easier later down the line.
\item \textbf{Accurate Reply Relationship}: Reply relationships are naturally represented, comments have a foreign key to posts, no reconstruction needed.
\end{itemize}
However each analytical query would either need to be post or comment specific, or require a table merge later in the pipeline. For ethnographic analysis, the distinction between a post and a comment is minimal. From a research point of view a post and a comment are both just a user saying something at a point in time, and treating them uniformly reflects that.
The decision to \textbf{stick with a unified data model was made} since the downsides of a Unified Model could be mitigated through reconstruction of reply chains using specific fields, and being able to differentiate between a post and a comment using a type field. Largely, in ethnography, a post and a comment are both just a user saying something at a point in time, and even in cases where they might need to be treated differently (reply-chains, interactions graphs), that distinction can still be made.
\subsection{Deployment}
Docker Compose is used to containerise the entire application, including:
@@ -713,6 +753,44 @@ Enviornment variables, such as database credentials and social media API keys, w
\newpage
\section{Implementation}
In the previous chapter, the architecture of the web-based ethnography tool was
outlined. In this chapter, the details on how this was implemented will be
discussed.
\subsection{Overview}
In the initial stages, the project was a small Python script that would fetch data from Reddit and aggregate simple statistics such as number of posts and a number of comments. Some early features like search and subreddit specific searches were added through hard-coded variables. The Reddit Connector code was extracted into it's own \texttt{RedditConnector} module, though the connector abstraction had not yet been formalised.
As this was going to be a web-based tool, the Flask server was then setup. A rudimentary sentiment analysis endpoint was added as an initial test using the VADER Sentiment Python module. An endpoint to fetch from Reddit was added but temporarily scratched. Eventually more analysis endpoints were added, creating the many different analytical perspectives that are available in the final system, such as linguistic analysis and user analysis.
At this stage, datasets were simply files stored on the machine and loaded into memory globally, which made early development and testing easier, but as the project progressed, the database was added to allow multiple datasets and users. While this was happening, further infrastructure was added to support multiple users, and to fix long-standing issues such as the blocking nature of NLP and data fetching, which was solved through the addition of Redis and Celery for asynchronous processing. Multiple user support was added through the addition of user accounts, with authentication and dataset ownership endpoints.
A very basic frontend was created with React, which was just a simple interface to call the API endpoints and display some basic summary stats such as number of posts, number of comments, and average sentiment. After the initial analysis endpoints were created and the API was fully functional, the frontend was expanded to include the full tabbed interface with visualisations for each analytical perspective.
\subsection{Project Tooling}
The project was developed using the following tools and libraries:
\begin{itemize}
\item \textbf{Python 3.13} for the backend API and data processing.
\item \textbf{Flask} for the web server and API development.
\item \textbf{BeautifulSoup} and \textbf{Requests} for web scraping (used in the Boards.ie connector).
\item \textbf{google-api-python-client} for interacting with the YouTube Data API.
\item \textbf{PostgreSQL} for the database.
\item \textbf{Redis} and \textbf{Celery} for asynchronous task processing.
\item \textbf{React} and \textbf{TypeScript} for the frontend interface.
\item \textbf{Docker} and \textbf{Docker Compose} for containerisation and deployment.
\item \textbf{Pandas} for data manipulation and analysis.
\item \textbf{NLTK} for basic stop word lists and tokenisation.
\item \textbf{Transformers} for NLP models used in emotion classification, topic classification, and named entity recognition.
\item \textbf{react-chartjs-2} and \textbf{react-wordcloud} for data visualisation in the frontend.
\end{itemize}
The project was developed using Git for version control, with a branching strategy that included feature branches for new functionality and a main branch for stable code. Regular commits were made to document the development process and conventional commit messages were used to indicate the type of changes made. Occasionally, text bodies were included in commit messages to provide justification for design decisions or to explain changes that couldn't be easily understood from the diff alone.
\subsection{Social Media Connectors}
The first connectors implemented were the Reddit and Boards.ie connectors, as these were the original data sources for the Cork dataset. The YouTube connector was added later to improve diversity of data sources. In addition, the decision was made to only fetch new posts and fetch a fixed number of posts, rather than fetching the top posts of all time, which are usually full of memes and jokes that would skew the dataset and not be relevant for ethnographic analysis. In addition the temporal analysis would be skewed if we fetched top posts of all time, as the most popular posts are often from years ago, which would not be relevant for understanding the current state of the community.
\subsubsection{Reddit Connector}
The initial implementation of the Reddit connector was a simple class that simply used the \texttt{requests} library to fetch data directly from the Reddit API. The online Reddit API documentation was used as a reference for the implementation of the connector \cite{reddit_api}
\newpage
\section{Evaluation}
@@ -720,4 +798,6 @@ Enviornment variables, such as database credentials and social media API keys, w
\newpage
\section{Conclusions}
\bibliography{references}
\end{document}

7
report/references.bib Normal file
View File

@@ -0,0 +1,7 @@
@online{reddit_api,
author = {{Reddit Inc.}},
title = {Reddit API Documentation},
year = {2025},
url = {https://www.reddit.com/dev/api/},
urldate = {2026-04-08}
}