docs(readme): update readme

docs(report): add Declaration of Originality and Acknowledgements sections
refactor(connector): clean up comments
2026-04-19 13:54:09 +01:00 · 2026-04-18 22:10:16 +01:00 · 2026-04-18 22:10:03 +01:00
6 changed files with 80 additions and 37 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -13,3 +13,4 @@ dist/
 helper
 db
 report/build
 .DS_Store
--- a/README.md
+++ b/README.md
@@ -1,29 +1,49 @@
 # crosspost
-**crosspost** is a browser-based tool designed to support *digital ethnography*, the study of how people interact, communicate, and form culture in online spaces such as forums, social media platforms, and comment-driven communities.
+A web-based analytics platform for exploring online communities. Built as a final year CS project at UCC, crosspost ingests data from Reddit, YouTube, and Boards.ie, runs NLP analysis on it (emotion detection, topic classification, named entity recognition, stance markers), and surfaces the results through an interactive dashboard.
 The motivating use case is digital ethnography — studying how people talk, what they talk about, and how culture forms in online spaces. The included dataset is centred on Cork, Ireland.
-The project aims to make it easier for students, researchers, and journalists to collect, organise, and explore online discourse in a structured and ethical way, without requiring deep technical expertise.
+## What it does
 - Fetch posts and comments from Reddit, YouTube, and Boards.ie (or upload your own .jsonl file)
 - Normalise everything into a unified schema regardless of source
 - Run NLP analysis asynchronously in the background via Celery workers
 - Explore results through a tabbed dashboard: temporal patterns, word clouds, emotion breakdowns, user activity, interaction graphs, topic clusters, and more
 - Multi-user support — each user has their own datasets, isolated from everyone else
-By combining data ingestion, analysis, and visualisation in a single system, crosspost turns raw online interactions into meaningful insights about how conversations emerge, evolve, and spread across platforms.
+# Prerequisites
 - Docker & Docker Compose
 - A Reddit App (client id & secret)
 - YouTube Data v3 API Key
-## Goals for this project
+# Setup
- Collect data ethically: enable users to link/upload text, images, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well.
+1) **Clone the Repo**
- Organise content: Store gathered material in a structured database with tagging for themes, dates, and sources.
+```
-Analyse patterns: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
+git clone https://github.com/your-username/crosspost.git
- Visualise insights: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
+cd crosspost
- Have clearly stated and explained ethical and privacy guidelines for users. The student will design the architecture, implement data pipelines, integrate basic NLP models, and create an interactive dashboard. 
+```
-Beyond programming, the project involves applying ethical research principles, handling data responsibly, and designing for non-technical users. By the end, the project will demonstrate how computer science can bridge technology and social research — turning raw online interactions into meaningful cultural insights.
+2) **Configure Enviornment Vars**
 ```
 cp example.env .env
 ```
 Fill in each required empty env. Some are already filled in, these are sensible defaults that usually don't need to be changed
-## Scope
+3) **Start everything**
 ```
 docker compose up -d
 ```
-This project focuses on:
+This starts:
- Designing a modular data ingestion pipeline
+- `crosspost_db` — PostgreSQL on port 5432
- Implementing backend data processing and storage
+- `crosspost_redis` — Redis on port 6379
- Integrating lightweight NLP-based analysis
+- `crosspost_flask` — Flask API on port 5000
- Building a simple, accessible frontend for exploration and visualisation
+- `crosspost_worker` — Celery worker for background NLP/fetching tasks
 - `crosspost_frontend` — Vite dev server on port 5173
-# Requirements
+# Data Format for Manual Uploads
 If you want to upload your own data rather than fetch it via the connectors, the expected format is newline-delimited JSON (.jsonl) where each line is a post object:
 ```json
 {"id": "abc123", "author": "username", "title": "Post title", "content": "Post body", "url": "https://...", "timestamp": 1700000000.0, "source": "reddit", "comments": []}
 ```
- **Python** ≥ 3.9
+# Notes
- **Python packages** listed in `requirements.txt`
+- **GPU support**: The Celery worker is configured with `--pool=solo` to avoid memory conflicts when multiple NLP models are loaded. If you have an NVIDIA GPU, uncomment the deploy.resources block in docker-compose.yml and make sure the NVIDIA Container Toolkit is installed.
 - npm ≥ version 11 
--- a/dto/user.py
+++ b/dto/user.py
@@ -1,8 +0,0 @@
 # Generic User Data Transfer Object for social media platforms
 class User:
    def __init__(self, username: str, created_utc: int, ):
        self.username = username
        self.created_utc = created_utc
        # Optionals
        self.karma = None
--- a/report/img/signature.jpg
+++ b/report/img/signature.jpg
--- a/report/main.tex
+++ b/report/main.tex
@@ -45,9 +45,42 @@
 \end{titlepage}
 \pagenumbering{roman}
 \section*{Declaration of Originality}
 In signing this declaration, you are conforming, in writing, that the submitted work is entirely your own original work, except where clearly attributed otherwise, and that it has not been submitted partly or wholly for any other educational award.
 I hereby declare that:
 \begin{itemize}
    \item this is all my own work unless clearly indicated otherwise, with full and proper accreditation;
    \item with respect to my own work: none of it has been submitted at any education institution contributing in any way to an educational award;
    \item with respect to another’s work: all text, diagrams, code, or ideas, whether verbatim, paraphrased, or otherwise modified or adapted, have been duly attributed to the source in a scholarly manner, whether from books, papers, lecture notes or any other student’s work, whether published or unpublished, electronically or in print.
 \end{itemize}
 \vspace{0.5cm}
 \noindent Signed: \raisebox{-0.8cm}{\includegraphics[width=4cm]{img/signature.jpg}} \\[1.2cm]
 \noindent Date: 18 April 2026
 \newpage
 \section*{Acknowledgements}
 I would like to thank my supervisor, Paolo Palmieri, for his guidance and support throughout this project.
 I would also like to thank Mastoureh Fathi, Pooya Ghoddousi, and Martino Zibetti on the MIGDIS project for taking the time to provide valuable feedback on the project and suggestions for future work.
 \newpage
 \section*{Abstract}
 Online communities generate vast volumes of discourse that traditional ethnographic methods cannot analyse at scale. This project presents \textbf{Crosspost}, a web-based platform that applies computational methods to the study of online communities, bridging quantitative data analysis and qualitative digital ethnography.
 The system aggregates public discussion data from multiple social media platforms, enriching it with Natural Language Processing techniques including emotion classification, topic modelling, and named entity recognition. Six analytical perspectives: temporal, linguistic, emotional, user, interactional, and cultural; are analysed through an interactive dashboard, allowing researchers to explore community behaviour, identity signals, and affective tone across large datasets without sacrificing access to the underlying posts.
 The platform is evaluated against a Cork-specific dataset spanning Reddit, YouTube, and Boards.ie, showing the system's ability to generate ethnographic insights such as geographic identity, civic sentiment, and participation inequality across different online communities.
 \newpage
 \tableofcontents
 \newpage 
 \pagenumbering{arabic}
 \section{Introduction}
 This project presents the design and implementation of a web-based analytics engine for the exploration and analysis of online discussion data. Built using \textbf{Flask and Pandas}, and supplemented with \textbf{Natural Language Processing} (NLP) techniques, the system provides an API for extracting structural, temporal, linguistic, and emotional insights from social media posts. A web-based frontend delivers interactive visualizations. The backend architecture implements an analytical pipeline for the data, including data parsing, manipulation and analysis.
@@ -1322,7 +1355,7 @@ The analytical scope is the project's most visible limitation. Six analytical an
 Planning the project was a challenge, as generally I tend to work iteratively. I jump in and start building straight away, and I find that the process of building helps me to figure out what I actually want to build. This led to some awkward parts in the report where design and implementation often overlapped and were made in a non-linear fashion. Creating the design section was difficult when implementation had already started, and design was still changed throughout the implementation process.
-On a personal level, the project was a significant learning experience in terms of time management and project planning. The planning and implementation of the project was ambitious but easy to get carried away with, and I found myself spending a lot of time on features that were not essential to the core functionality of the system. The implementation was felt productive and visible in a way that the writing of a report was not, I found myself spending more time on the implementation than the report, and the report was pushed to the sidelines until the end of the project.
+On a personal level, the project was a significant learning experience in terms of time management and project planning. The planning and implementation of the project was ambitious but easy to get carried away with, and I found myself spending a lot of time on features that were not essential to the core functionality of the system. The implementation felt productive and visible in a way that the writing of a report was not, I found myself spending more time on the implementation than the report, and the report was pushed to the sidelines until the end of the project.
 \subsection{How the project was conducted}
 \begin{figure}[!h]
--- a/server/connectors/base.py
+++ b/server/connectors/base.py
@@ -1,21 +1,18 @@
 from abc import ABC, abstractmethod
 from dto.post import Post
 import os
 class BaseConnector(ABC):
-    # Each subclass declares these at the class level
+    source_name: str  # machine readable
-    source_name: str  # machine-readable: "reddit", "youtube"
+    display_name: str  # human readablee
-    display_name: str  # human-readable: "Reddit", "YouTube"
+    required_env: list[str] = []  
    required_env: list[str] = []  # env vars needed to activate
    search_enabled: bool
    categories_enabled: bool
    @classmethod
    def is_available(cls) -> bool:
        """Returns True if all required env vars are set."""
        import os
        return all(os.getenv(var) for var in cls.required_env)
    @abstractmethod
Author	SHA1	Message	Date
Dylan De Faoite	5970f555fa	docs(readme): update readme	2026-04-19 13:54:09 +01:00
Dylan De Faoite	9b7a51ff33	docs(report): add Declaration of Originality and Acknowledgements sections	2026-04-18 22:10:16 +01:00
Dylan De Faoite	2d39ea6e66	refactor(connector): clean up comments	2026-04-18 22:10:03 +01:00