docs(report): add data storage section

docs(report): add custom topic section
2026-04-06 19:26:10 +01:00 · 2026-04-06 18:47:29 +01:00
1 changed files with 57 additions and 4 deletions
--- a/report/main.tex
+++ b/report/main.tex
@@ -2,6 +2,7 @@
 \usepackage{graphicx}
 \usepackage{setspace}
 \usepackage{hyperref}
 \usepackage{fvextra}
 \begin{document}
@@ -418,7 +419,24 @@ The system will support two methods of data ingestion:
 Originally, only file upload was supported, but the goal of the platform is to aid researchers with ethnograpic analysis, and many researchers will not have the technical expertise to fetch data from social media APIs or scrape websites. Therefore, the system was designed to support automated fetching of data from social media platforms, which allows users to easily obtain datasets without needing to manually collect and format data themselves.
-Each method of ingestion will format the raw data into a standardised structure, where each post will be represented as a "Post" object and each comment will be represented as a "Comment" object. Both objects will have a common set of fields, such as:
+In addition to social media posts, the system will allow users to upload a list of topics that they want to track in the dataset. This allows the system to generate custom topic analysis based on user-defined topics, which can be more relevant and insightful for specific research questions. For example, a researcher studying discussions around local politics in Cork might upload a list of political parties, politicians, and policy issues as topics to track.
 Below is a snippet of what a custom topic list might look like in \texttt{.json} format:
 \begin{Verbatim}[breaklines=true]
 {
    "Public Transport": "buses, bus routes, bus eireann, public transport, late buses, bus delays, trains, commuting without a car, transport infrastructure in Cork",
    "Traffic": "traffic jams, congestion, rush hour, cars backed up, gridlock, driving in Cork, road delays",
    "Parking": "parking spaces, parking fines, clamping, pay parking, parking permits, finding parking in the city",
    "Cycling": "cycling in Cork, bike lanes, cyclists, cycle safety, bikes on roads, cycling infrastructure"
 }
 \end{Verbatim}
 If a custom topic list is not provided by the user, the system will use a pre-defined generalised topic list that is designed to capture common themes across a wide range of online communities.
 Each method of ingestion will format the raw data into a standardised structure, where each post will be represented as a "Post" object and each comment will be represented as a "Comment" object.
 \subsubsection{Data Normalisation}
 After a dataset is ingested, the system will normalise all posts and nested comments into a single unified "event" data model. This means that both posts and comments will be represented as the same type of object, with a common set of fields that capture the relevant information for analysis. The fields in this unified data model will include:
 \begin{itemize}
    \item \texttt{id} - a unique identifier for the post or comment.
    \item \texttt{content} — the text content of the post or comment.
@@ -432,15 +450,47 @@ Each method of ingestion will format the raw data into a standardised structure,
 The decision to normalise posts and comments into a single "event" data model allows the same analytical functions to be applied uniformly across all content, regardless of whether it was originally a post or a comment. This simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
 As part of this normalisation process, the dataset is also \textbf{flattened}, so rather than comments being nested within their parent posts as they are in the raw source data, all events are stored as a flat sequence of records. The relationships between posts and comments are preserved through the \texttt{parent\_id} and \texttt{reply\_to} fields. This allows for more efficient querying and analysis of the data, as well as simplifying the data model.
 Overall, the data normalisation process unifies the structure of the dataset, and flattens the data into a format that makes analysis more efficient and easier.
-\subsubsection{Data Normalisation}
+\subsubsection{Data Enrichment}
 After normalisation, the dataset is enriched with additional derived fields and NLP outputs. This includes:
 \begin{itemize}
    \item \textbf{Datetime Derivations}: Fields such as day of week, hour of day, and week of year are derived from the raw timestamp and stored alongside the event, so they do not need to be recomputed on every query.
    \item \textbf{Emotion Classification}: Each event is run through an NLP model that assigns an emotional label to the text content, such as joy, anger, or sadness.
    \item \textbf{Topic Classification}: Each event is classified into its most relevant topic using an NLP model, based on either the user-provided topic list or the system default.
    \item \textbf{Named Entity Recognition}: Each event is processed to identify any named entities mentioned in the text, such as people, places, or organisations, which are stored as a list associated with the event.
 \end{itemize}
 NLP processing allows for much richer analysis of the dataset, as it provides additional layers of information beyond just the raw text content. After enrichment, the dataset is ready to be stored in the database and made available for analysis through the API endpoints.
 \subsubsection{Data Storage}
 The enriched dataset is stored in a PostgreSQL database, with a schema similar to the unified data model defined in the normalisation section, with additional fields for the derived data, NLP outputs, and user ownership. Each dataset is associated with a specific user account, and the system supports multiple datasets per user.
 The \texttt{events} table in PostgreSQL contains the following fields:
 \begin{itemize}
    \item \texttt{id}: a unique identifier for the event.
    \item \texttt{dataset\_id}: a foreign key referencing the dataset this event belongs to. If the dataset is deleted.
    \item \texttt{post\_id}: the original identifier of the post or comment as it appeared on the source platform.
    \item \texttt{type}: whether the event is a post or a comment.
    \item \texttt{author}: the username of the content creator.
    \item \texttt{content}: the text content of the event.
    \item \texttt{timestamp}: the Unix epoch time at which the content was created.
    \item \texttt{date}, \texttt{dt}, \texttt{hour}, \texttt{weekday}: datetime fields derived from the timestamp at ingestion time.
    \item \texttt{title}: the title of the post, if the event is a post. Null for comments.
    \item \texttt{parent\_id}: for comments, the identifier of the post it belongs to. Null for posts.
    \item \texttt{reply\_to}: for comments, the identifier of the comment it directly replies to. Null if the comment is a direct reply to a post.
    \item \texttt{source}: the platform from which the content was retrieved.
    \item \texttt{topic}, \texttt{topic\_confidence}: the topic assigned to the event by the NLP model, along with a confidence score.
    \item \texttt{ner\_entities}: a list of named entities identified in the content, stored as a \texttt{JSONB} field.
    \item \texttt{emotion\_anger}, \texttt{emotion\_disgust}, \texttt{emotion\_fear}, \texttt{emotion\_joy}, \texttt{emotion\_sadness}: emotion scores assigned to the event by the NLP model.
 \end{itemize}
-\subsection{Connector Abstraction}
+\subsection{Automatic Data Collection}
-While the system is designed around a Cork-based dataset, it is intentionally source-agnostic, meaning that additional data sources could be added in the future without changes to the core analytical pipeline.
+\subsubsection{Connector Abstractions}
 While the system is designed around a Cork-based dataset, it is intentionally source-agnostic, meaning that additional data sources for data ingestion could be added in the future without changes to the core analytical pipeline.
 \textbf{Data Connectors} are components responsible for fetching and normalising data from specific sources. Each connector implements a standard interface for data retrieval, such as:
 \begin{itemize}
@@ -451,6 +501,9 @@ Creating a base interface for what a connector should look like allows for the e
 The connector registry is designed so that any new connector implementing \texttt{BaseConnector} is automatically discovered and registered at runtime, without requiring changes to any existing code. This allows for a modular and extensible architecture where new data sources can be integrated with minimal effort.
 \subsection{Ethnographic Analysis}
 \subsection{Client-Server Architecture}
 The system will follow a client-server architecture, with a Flask-based backend API and a React-based frontend interface. The backend will handle data processing, NLP analysis, and database interactions, while the frontend will provide an interactive user interface for data exploration and visualization.
Author	SHA1	Message	Date
Dylan De Faoite	107dae0e95	docs(report): add data storage section	2026-04-06 19:26:10 +01:00
Dylan De Faoite	23833e2c5b	docs(report): add custom topic section	2026-04-06 18:47:29 +01:00