docs(report): add data pipeline and connector sections

Also moved requirements to the end of design, where it is more appropriately placed. Requirements can be specified after discussing potential pitfalls.
docs(report): add ethics section
2026-04-04 14:36:52 +01:00 · 2026-04-04 13:52:56 +01:00
1 changed files with 127 additions and 43 deletions
--- a/report/main.tex
+++ b/report/main.tex
@@ -1,6 +1,8 @@
-\documentclass{article}
+\documentclass{article}[12pt]
 \usepackage{graphicx}
 \usepackage{setspace}
+\usepackage{hyperref}
+
 \begin{document}

 \begin{titlepage}
@@ -39,7 +41,7 @@ There are many beneficiaries of a digital ethnography analytic system: social sc

 \subsection{Goals \& Objectives}
 \begin{itemize}
-    \item \textbf{Collect data ethically}: enable users to link/upload text, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well.
+    \item \textbf{Collect data ethically}: enable users to link/upload text, and interaction data (messages etc) from specified online communities. Potentially an automated method for importing (using APIs or scraping techniques) could be included as well.
    \item \textbf{Organise content}: Store gathered material in a structured database with tagging for themes, dates, and sources.
    \item \textbf{Analyse patterns}: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
    \item \textbf{Visualise insights}: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
@@ -172,6 +174,110 @@ Specifically, the system aims to:

 Ultimately, the project seeks to demonstrate how computational systems can aid and augment social scientists and digital ethnographers toolkits.

+\subsection{Feasibility Analysis}
+\subsubsection{NLP Limitations}
+Online communities often use sarcasm, irony or context-specific references, all of which will be challenging for NLP models, especially weaker ones, to correctly interpret. In a Cork-specific dataset, this will be especially apparent due to the use of regional slang or informal grammar. 
+
+Therefore, the outputs of the model for any single event should not be considered as definitive, but rather as an indicative pattern that is more likely to be correct when aggregated across the entire dataset. For example, while a single comment about a specific topic might be misclassified as positive, the overall sentiment of that topic across thousands of comments is more likely to reflect the true emotional tone of the community.
+
+To account for NLP limitations, the system will:
+\begin{itemize}
+    \item Rely on \textbf{aggregated results} rather than individual classifications.
+    \item Provide \textbf{context for outputs}, such as confidence scores where available.
+    \item Allow \textbf{access to original text} behind each NLP result.
+\end{itemize}
+
+Overall, while NLP provides powerful tools for analysing large datasets, its limitations must be acknowledged and mitigated through careful design and interpretation of results.
+
+\subsubsection{Data Normalisation}
+Different social media platforms will produce data in many different formats. For example, Reddit data will have a much different reply structure to a forum-based platform like Boards.ie where there are no nested replies. Therefore, a core design requirement of the system is to normalise all incoming data into a single unified internal data model. This allows the same analytical functions to be applied across all data sources, regardless of their original structure.
+
+Posts and comments are two different types of user-generated content, however when it comes to ethnographic analysis, they are both just "events" or information that is being shared by a user. From an ethnographic perspective, the distinction between a post and a comment is not particularly important, since they both represent user-generated content that contributes to the community discourse. Therefore, the system will normalise all posts and comments into a single "event" data model, which will allow the same analytical functions to be applied uniformly across all content. This also simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
+
+Though separate processing paths are not needed, the system will still retain metadata that indicates whether an event was originally a post or a comment, as well as any relevant structural information (e.g., parent-child relationships in Reddit threads).
+
+\subsubsection{Scalability Constraints}
+This system should be scalable enough to handle large datasets, but there are practical limits to how much data can be processed within reasonable timeframes, especially given the computational demands of NLP models.
+
+Some of the data can be precomputed during the data ingestion phases, such as datetime column derivations and NLP outputs, which can speed up and make queries more efficient. However, the initial processing time of large datasets will still be significant, especially if the dataset contains hundreds of thousands of posts and comments.
+
+To mitigate this, the system will:
+\begin{itemize}
+    \item Utilise GPU acceleration where available for NLP inference.
+    \item Pre-compute some analytical results during data ingestion to speed up subsequent queries.
+    \item Store NLP outputs in the database to avoid redundant processing.
+    \item Implement asynchronous processing for long-running tasks.
+\end{itemize}
+
+Overall, while the system is designed to be scalable, it is important to set realistic expectations regarding performance and processing times, especially for very large datasets.
+
+\subsection{Ethics}
+The system will process only publicly available data, and will not attempt to access private or restricted content.
+
+\subsubsection{Automated Data Collection}
+The system will provide an option for users to automatically fetch datasets from social media sites filtered for keywords or categories. Therefore, it's important to ensure that this data collection is done ethically.
+
+The system will:
+\begin{itemize}
+    \item Respect rate limits by implementing an exponential backoff strategy for API requests.
+    \item Only collect data that is publicly available and does not require authentication or violate platform terms of service.
+    \item Provide user-agent headers that identify the system and its purposes
+    \item Allow users the option to upload their own datasets instead of automated collection.
+    \item For websites without an API, the \texttt{robots.txt} file will be examined to ensure compliance with platform guidelines.
+    \item Data volume limits of up to 1000 posts per source will be enforced server-side to prevent excessive data collection.
+\end{itemize}
+
+Some platforms provide APIs that allow for easy and ethical data collection, such as YouTube and Reddit. These APIs have clear guidelines and rate limits that the system will adhere to. 
+
+\paragraph{Reddit (API)}
+Reddit provides a public API that allows for the retrieval of posts, comments, and metadata from subreddits. The system will use the official Reddit API with proper authentication via OAuth2 and access tokens. 
+
+In November 2025, Reddit introduced a new approval process for API access, which requires developers to apply for access and specify their intended use case. While the public unauthenticated endpoints are still accessible, they have far stricter rate limits (100 requests every 10 minutes) compared to authenticated access (100 requests per minute). Therefore, the system shall allow for authenticated access to the Reddit API to speed up data retrival.
+
+Unauthenticated access will still be available as a fallback if client credentials are not provided on the backend, but this will massively slow the data retreival process, and this will still only fetch public posts and comments.
+
+From reddit, the system will collect posts, comments and all replies to comments, as well as metadata such as the author name and timestamp.
+
+\paragraph{Boards.ie (Web Scraping)}
+Boards.ie is an Irish discussion forum with no public API, so the system will use web scraping instead. The platforms \texttt{robots.txt} will be used to ensure compliance with the site's guidelines for automated access. The boards.ie \texttt{robots.txt} file contains the following information:
+
+\begin{verbatim}
+Sitemap: https://www.boards.ie/sitemapindex.xml
+User-agent: *
+Disallow: /entry/
+Disallow: /messages/
+Disallow: /profile/comments/
+Disallow: /profile/discussions/
+Disallow: /search/
+Disallow: /sso/
+Disallow: /sso
+\end{verbatim}
+
+Public discussion threads are allowed to be automatically crawled, while user profiles, private messages, and authentication endpoints are not allowed. The system will respect these boundaries and will not attempt to access any restricted path.
+
+\paragraph{YouTube (Data API v3)}
+YouTube is supported via the official YouTube Data API v3, provided by Google. The API exposes structured endpoints for querying videos, comments, channels, and playlists, making it well-suited for collecting public discourse around specific topics or keywords.
+
+Authentication is handled through an API key issued via the Google Cloud Console. The API enforces a quota system rather than a traditional rate limit: each project is allocated 10,000 quota units per day by default, with different operations consuming different amounts.
+
+In addition, comment retreival can be disabled by the video uploader, so the system will handle this case by skipping videos where comments are not accessible.
+
+\subsubsection{Data Storage \& Retention}
+All data fetched from social media sites are stored locally in a PostgreSQL database. The system will not share or expose any of this data to third parties beyond the users of this application. Raw API responses are discarded once the relevant information is extracted. 
+
+All datasets are associated with one and only one user account, and the users themselves are responsible for uploading or fetching the data, analysing the data, and deleting the data when they are done. The system will not retain any data beyond what is necessary for the end-user to carry out their analysis, and users will have the option to delete their datasets at any time.
+
+The system will not store any personally identifiable information beyond what is necessary for the analysis, which includes only usernames and timestamps. The system will not attempt to de-anonymise content creators or link data across platforms.
+
+\subsubsection{User Security}
+Standard security practices will be followed to protect user data and prevent unauthorized access. This includes:
+\begin{itemize}
+    \item The hashing of all user passwords and no storage of plaintext passwords.
+    \item The use of JWTs for session management, with secure signing and an expiration time of 24 hours.
+    \item Access control on all analysis API endpoints to ensure that end-users can only access their own datasets and results.
+    \item Parameterised queries for all database interactions to prevent SQL injection attacks.
+\end{itemize}
+
 \subsection{Requirements}

 The following requirements are derived from the backend architecture, NLP processing pipeline, and the React-based frontend interface.
@@ -274,47 +380,6 @@ The following requirements are derived from the backend architecture, NLP proces
    \item The dataset reset functionality shall preserve data integrity.
 \end{itemize}

-\subsection{Feasibility Analysis}
-\subsubsection{NLP Limitations}
-Online communities often use sarcasm, irony or context-specific references, all of which will be challenging for NLP models, especially weaker ones, to correctly interpret. In a Cork-specific dataset, this will be especially apparent due to the use of regional slang or informal grammar. 
-
-Therefore, the outputs of the model for any single event should not be considered as definitive, but rather as an indicative pattern that is more likely to be correct when aggregated across the entire dataset. For example, while a single comment about a specific topic might be misclassified as positive, the overall sentiment of that topic across thousands of comments is more likely to reflect the true emotional tone of the community.
-
-To account for NLP limitations, the system will:
-\begin{itemize}
-    \item Rely on \textbf{aggregated results} rather than individual classifications.
-    \item Provide \textbf{context for outputs}, such as confidence scores where available.
-    \item Allow \textbf{access to original text} behind each NLP result.
-\end{itemize}
-
-Overall, while NLP provides powerful tools for analysing large datasets, its limitations must be acknowledged and mitigated through careful design and interpretation of results.
-
-\subsubsection{Data Normalisation}
-Different social media platforms will produce data in many different formats. For example, Reddit data will have a much different reply structure to a forum-based platform like Boards.ie where there are no nested replies. Therefore, a core design requirement of the system is to normalise all incoming data into a single unified internal data model. This allows the same analytical functions to be applied across all data sources, regardless of their original structure.
-
-Posts and comments are two different types of user-generated content, however when it comes to ethnographic analysis, they are both just "events" or information that is being shared by a user. From an ethnographic perspective, the distinction between a post and a comment is not particularly important, since they both represent user-generated content that contributes to the community discourse. Therefore, the system will normalise all posts and comments into a single "event" data model, which will allow the same analytical functions to be applied uniformly across all content. This also simplifies the data model and reduces the complexity of the analytical pipeline, since there is no need to maintain separate processing paths for posts and comments.
-
-Though separate processing paths are not needed, the system will still retain metadata that indicates whether an event was originally a post or a comment, as well as any relevant structural information (e.g., parent-child relationships in Reddit threads).
-
-\subsubsection{Scalability Constraints}
-This system should be scalable enough to handle large datasets, but there are practical limits to how much data can be processed within reasonable timeframes, especially given the computational demands of NLP models.
-
-To migiate this, the system will:
-\begin{itemize}
-    \item Utilise GPU acceleration where available for NLP inference.
-    \item Pre-compute some analytical results during data ingestion to speed up subsequent queries.
-    \item Store NLP outputs in the database to avoid redundant processing.
-    \item Implement asynchronous processing for long-running tasks.
-\end{itemize}
-
-Overall, while the system is designed to be scalable, it is important to set realistic expectations regarding performance and processing times, especially for very large datasets.
-
-
-\subsection{Ethics}
-
-\subsection{Design Tradeoffs}
-
-
 \newpage
 \section{Design}
 \subsection{System Architecture}
@@ -332,6 +397,25 @@ Overall, while the system is designed to be scalable, it is important to set rea
    \label{fig:schema}
 \end{figure}

+\subsection{Data Pipeline}
+As this project is focused on the collection and analysis of online community data, the primary component that must be well-designed is the data pipeline, which encompasses the processes of data ingestion, normalisation, enrichment, storage, and retrieval for analysis.
+
+A unified data model is used to represent all incoming data, regardless of its original source or structure. This ensures that the same pipeline works across YouTube, Reddit and boards.ie data, and can be easily extended to new sources in the future.
+
+
+
+\subsection{Connector Abstraction}
+While the system is designed around a Cork-based dataset, it is intentionally source-agnostic, meaning that additional data sources could be added in the future without changes to the core analytical pipeline.
+
+\textbf{Data Connectors} are components responsible for fetching and normalising data from specific sources. Each connector implements a standard interface for data retrieval, such as:
+\begin{itemize}
+    \item \texttt{get\_new\_posts()} — retrieves raw data from the source, either through API calls or web scraping.
+\end{itemize}
+
+Creating a base interface for what a connector should look like allows for the easy addition of new data sources in the future. For example, if a new social media platform becomes popular, a new connector can be implemented to fetch data from that platform without needing to modify the existing data pipeline or analytical modules.
+
+The connector registry is designed so that any new connector implementing \texttt{BaseConnector} is automatically discovered and registered at runtime, without requiring changes to any existing code. This allows for a modular and extensible architecture where new data sources can be integrated with minimal effort.
+
 \subsection{Client-Server Architecture}
 The system will follow a client-server architecture, with a Flask-based backend API and a React-based frontend interface. The backend will handle data processing, NLP analysis, and database interactions, while the frontend will provide an interactive user interface for data exploration and visualization.
Author	SHA1	Message	Date
Dylan De Faoite	b57a8d3c65	docs(report): add data pipeline and connector sections Also moved requirements to the end of design, where it is more appropriately placed. Requirements can be specified after discussing potential pitfalls.	2026-04-04 14:36:52 +01:00
Dylan De Faoite	ac65e26eab	docs(report): add ethics section	2026-04-04 13:52:56 +01:00