Compare commits
13 Commits
ee9c7b4ab2
...
v1.0
| Author | SHA1 | Date | |
|---|---|---|---|
| 5970f555fa | |||
| 9b7a51ff33 | |||
| 2d39ea6e66 | |||
| c1e5482f55 | |||
| b2d7f6edaf | |||
| 10efa664df | |||
| 3db7c1d3ae | |||
| 72e17e900e | |||
| 7b9a17f395 | |||
| 0a396dd504 | |||
| c6e8144116 | |||
| 760d2daf7f | |||
| ca38b992eb |
1
.gitignore
vendored
@@ -13,3 +13,4 @@ dist/
|
|||||||
helper
|
helper
|
||||||
db
|
db
|
||||||
report/build
|
report/build
|
||||||
|
.DS_Store
|
||||||
60
README.md
@@ -1,29 +1,49 @@
|
|||||||
# crosspost
|
# crosspost
|
||||||
**crosspost** is a browser-based tool designed to support *digital ethnography*, the study of how people interact, communicate, and form culture in online spaces such as forums, social media platforms, and comment-driven communities.
|
A web-based analytics platform for exploring online communities. Built as a final year CS project at UCC, crosspost ingests data from Reddit, YouTube, and Boards.ie, runs NLP analysis on it (emotion detection, topic classification, named entity recognition, stance markers), and surfaces the results through an interactive dashboard.
|
||||||
|
The motivating use case is digital ethnography — studying how people talk, what they talk about, and how culture forms in online spaces. The included dataset is centred on Cork, Ireland.
|
||||||
|
|
||||||
The project aims to make it easier for students, researchers, and journalists to collect, organise, and explore online discourse in a structured and ethical way, without requiring deep technical expertise.
|
## What it does
|
||||||
|
- Fetch posts and comments from Reddit, YouTube, and Boards.ie (or upload your own .jsonl file)
|
||||||
|
- Normalise everything into a unified schema regardless of source
|
||||||
|
- Run NLP analysis asynchronously in the background via Celery workers
|
||||||
|
- Explore results through a tabbed dashboard: temporal patterns, word clouds, emotion breakdowns, user activity, interaction graphs, topic clusters, and more
|
||||||
|
- Multi-user support — each user has their own datasets, isolated from everyone else
|
||||||
|
|
||||||
By combining data ingestion, analysis, and visualisation in a single system, crosspost turns raw online interactions into meaningful insights about how conversations emerge, evolve, and spread across platforms.
|
# Prerequisites
|
||||||
|
- Docker & Docker Compose
|
||||||
|
- A Reddit App (client id & secret)
|
||||||
|
- YouTube Data v3 API Key
|
||||||
|
|
||||||
## Goals for this project
|
# Setup
|
||||||
- Collect data ethically: enable users to link/upload text, images, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well.
|
1) **Clone the Repo**
|
||||||
- Organise content: Store gathered material in a structured database with tagging for themes, dates, and sources.
|
```
|
||||||
Analyse patterns: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
|
git clone https://github.com/your-username/crosspost.git
|
||||||
- Visualise insights: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
|
cd crosspost
|
||||||
- Have clearly stated and explained ethical and privacy guidelines for users. The student will design the architecture, implement data pipelines, integrate basic NLP models, and create an interactive dashboard.
|
```
|
||||||
|
|
||||||
Beyond programming, the project involves applying ethical research principles, handling data responsibly, and designing for non-technical users. By the end, the project will demonstrate how computer science can bridge technology and social research — turning raw online interactions into meaningful cultural insights.
|
2) **Configure Enviornment Vars**
|
||||||
|
```
|
||||||
|
cp example.env .env
|
||||||
|
```
|
||||||
|
Fill in each required empty env. Some are already filled in, these are sensible defaults that usually don't need to be changed
|
||||||
|
|
||||||
## Scope
|
3) **Start everything**
|
||||||
|
```
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
This project focuses on:
|
This starts:
|
||||||
- Designing a modular data ingestion pipeline
|
- `crosspost_db` — PostgreSQL on port 5432
|
||||||
- Implementing backend data processing and storage
|
- `crosspost_redis` — Redis on port 6379
|
||||||
- Integrating lightweight NLP-based analysis
|
- `crosspost_flask` — Flask API on port 5000
|
||||||
- Building a simple, accessible frontend for exploration and visualisation
|
- `crosspost_worker` — Celery worker for background NLP/fetching tasks
|
||||||
|
- `crosspost_frontend` — Vite dev server on port 5173
|
||||||
|
|
||||||
# Requirements
|
# Data Format for Manual Uploads
|
||||||
|
If you want to upload your own data rather than fetch it via the connectors, the expected format is newline-delimited JSON (.jsonl) where each line is a post object:
|
||||||
|
```json
|
||||||
|
{"id": "abc123", "author": "username", "title": "Post title", "content": "Post body", "url": "https://...", "timestamp": 1700000000.0, "source": "reddit", "comments": []}
|
||||||
|
```
|
||||||
|
|
||||||
- **Python** ≥ 3.9
|
# Notes
|
||||||
- **Python packages** listed in `requirements.txt`
|
- **GPU support**: The Celery worker is configured with `--pool=solo` to avoid memory conflicts when multiple NLP models are loaded. If you have an NVIDIA GPU, uncomment the deploy.resources block in docker-compose.yml and make sure the NVIDIA Container Toolkit is installed.
|
||||||
- npm ≥ version 11
|
|
||||||
@@ -28,7 +28,7 @@ services:
|
|||||||
- .env
|
- .env
|
||||||
ports:
|
ports:
|
||||||
- "5000:5000"
|
- "5000:5000"
|
||||||
command: flask --app server.app run --host=0.0.0.0 --debug
|
command: gunicorn server.app:app --bind 0.0.0.0:5000 --workers 2 --threads 4
|
||||||
depends_on:
|
depends_on:
|
||||||
- postgres
|
- postgres
|
||||||
- redis
|
- redis
|
||||||
@@ -48,13 +48,13 @@ services:
|
|||||||
depends_on:
|
depends_on:
|
||||||
- postgres
|
- postgres
|
||||||
- redis
|
- redis
|
||||||
# deploy:
|
deploy:
|
||||||
# resources:
|
resources:
|
||||||
# reservations:
|
reservations:
|
||||||
# devices:
|
devices:
|
||||||
# - driver: nvidia
|
- driver: nvidia
|
||||||
# count: 1
|
count: 1
|
||||||
# capabilities: [gpu]
|
capabilities: [gpu]
|
||||||
|
|
||||||
frontend:
|
frontend:
|
||||||
build:
|
build:
|
||||||
|
|||||||
@@ -1,8 +0,0 @@
|
|||||||
# Generic User Data Transfer Object for social media platforms
|
|
||||||
class User:
|
|
||||||
def __init__(self, username: str, created_utc: int, ):
|
|
||||||
self.username = username
|
|
||||||
self.created_utc = created_utc
|
|
||||||
|
|
||||||
# Optionals
|
|
||||||
self.karma = None
|
|
||||||
BIN
report/img/cork_temporal.png
Normal file
|
After Width: | Height: | Size: 274 KiB |
BIN
report/img/flooding_posts.png
Normal file
|
After Width: | Height: | Size: 90 KiB |
|
Before Width: | Height: | Size: 50 KiB After Width: | Height: | Size: 50 KiB |
BIN
report/img/moods.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
report/img/ngrams.png
Normal file
|
After Width: | Height: | Size: 38 KiB |
BIN
report/img/signature.jpg
Normal file
|
After Width: | Height: | Size: 152 KiB |
BIN
report/img/topic_emotions.png
Normal file
|
After Width: | Height: | Size: 17 KiB |
525
report/main.tex
@@ -104,3 +104,46 @@
|
|||||||
pages = {183--204}
|
pages = {183--204}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@misc{cook2023ethnography,
|
||||||
|
author = {Cook, Chloe},
|
||||||
|
title = {What is the Difference Between Ethnography and Digital Ethnography?},
|
||||||
|
year = {2023},
|
||||||
|
month = jan,
|
||||||
|
day = {19},
|
||||||
|
howpublished = {\url{https://ethosapp.com/blog/what-is-the-difference-between-ethnography-and-digital-ethnography/}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {EthOS}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{giuffre2026sentiment,
|
||||||
|
author = {Giuffre, Steven},
|
||||||
|
title = {What is Sentiment Analysis?},
|
||||||
|
year = {2026},
|
||||||
|
month = mar,
|
||||||
|
howpublished = {\url{https://www.vonage.com/resources/articles/sentiment-analysis/}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {Vonage}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{mungalpara2022stemming,
|
||||||
|
author = {Mungalpara, Jaimin},
|
||||||
|
title = {Stemming Lemmatization Stopwords and {N}-Grams in {NLP}},
|
||||||
|
year = {2022},
|
||||||
|
month = jul,
|
||||||
|
day = {26},
|
||||||
|
howpublished = {\url{https://jaimin-ml2001.medium.com/stemming-lemmatization-stopwords-and-n-grams-in-nlp-96f8e8b6aa6f}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {Medium}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{chugani2025ethicalscraping,
|
||||||
|
author = {Chugani, Vinod},
|
||||||
|
title = {Ethical Web Scraping: Principles and Practices},
|
||||||
|
year = {2025},
|
||||||
|
month = apr,
|
||||||
|
day = {21},
|
||||||
|
howpublished = {\url{https://www.datacamp.com/blog/ethical-web-scraping}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {DataCamp}
|
||||||
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -16,3 +16,4 @@ Requests==2.32.5
|
|||||||
sentence_transformers==5.2.2
|
sentence_transformers==5.2.2
|
||||||
torch==2.10.0
|
torch==2.10.0
|
||||||
transformers==5.1.0
|
transformers==5.1.0
|
||||||
|
gunicorn==25.3.0
|
||||||
|
|||||||
@@ -1,21 +1,18 @@
|
|||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from dto.post import Post
|
from dto.post import Post
|
||||||
|
import os
|
||||||
|
|
||||||
|
|
||||||
class BaseConnector(ABC):
|
class BaseConnector(ABC):
|
||||||
# Each subclass declares these at the class level
|
source_name: str # machine readable
|
||||||
source_name: str # machine-readable: "reddit", "youtube"
|
display_name: str # human readablee
|
||||||
display_name: str # human-readable: "Reddit", "YouTube"
|
required_env: list[str] = []
|
||||||
required_env: list[str] = [] # env vars needed to activate
|
|
||||||
|
|
||||||
search_enabled: bool
|
search_enabled: bool
|
||||||
categories_enabled: bool
|
categories_enabled: bool
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def is_available(cls) -> bool:
|
def is_available(cls) -> bool:
|
||||||
"""Returns True if all required env vars are set."""
|
|
||||||
import os
|
|
||||||
|
|
||||||
return all(os.getenv(var) for var in cls.required_env)
|
return all(os.getenv(var) for var in cls.required_env)
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
|
|||||||
@@ -87,7 +87,7 @@ class BoardsAPI(BaseConnector):
|
|||||||
post = self._parse_thread(html, post_url)
|
post = self._parse_thread(html, post_url)
|
||||||
return post
|
return post
|
||||||
|
|
||||||
with ThreadPoolExecutor(max_workers=10) as executor:
|
with ThreadPoolExecutor(max_workers=5) as executor:
|
||||||
futures = {executor.submit(fetch_and_parse, url): url for url in urls}
|
futures = {executor.submit(fetch_and_parse, url): url for url in urls}
|
||||||
|
|
||||||
for i, future in enumerate(as_completed(futures)):
|
for i, future in enumerate(as_completed(futures)):
|
||||||
|
|||||||