Compare commits

...

22 Commits

Author SHA1 Message Date
5970f555fa docs(readme): update readme 2026-04-19 13:54:09 +01:00
9b7a51ff33 docs(report): add Declaration of Originality and Acknowledgements sections 2026-04-18 22:10:16 +01:00
2d39ea6e66 refactor(connector): clean up comments 2026-04-18 22:10:03 +01:00
c1e5482f55 docs(report): fix typos 2026-04-18 16:09:22 +01:00
b2d7f6edaf docs(report): add visualizations and emotional analysis for Cork dataset 2026-04-18 15:44:04 +01:00
10efa664df docs(report): fix typos and add more eval 2026-04-17 20:31:39 +01:00
3db7c1d3ae docs(report): add future work section 2026-04-16 16:54:18 +01:00
72e17e900e fix(report): correct typos 2026-04-16 16:41:27 +01:00
7b9a17f395 fix(connector): reduce ThreadPoolExecutor max_workers 2026-04-16 16:37:27 +01:00
0a396dd504 docs(report): add more citations 2026-04-16 16:23:36 +01:00
c6e8144116 docs(report): add traditionl vs digital ethnography reference 2026-04-16 16:08:59 +01:00
760d2daf7f docs(report): remove redundant phrasing 2026-04-16 15:59:24 +01:00
ca38b992eb build(docker): switch backend flask deployment to Gunicorn 2026-04-15 17:57:22 +01:00
ee9c7b4ab2 docs(report): finish evaluation & reflection 2026-04-15 17:52:54 +01:00
703a7c435c fix(youtube_api): video search capped at 50 2026-04-14 17:54:43 +01:00
02ba727d05 chore(connector): add buffer to ratelimit reset 2026-04-14 17:41:09 +01:00
76591bc89e feat(tasks): add fetch and NLP processing time logging to dataset status 2026-04-14 17:35:43 +01:00
e35e51d295 fix(reddit_api): handle rate limit wait time conversion error 2026-04-14 17:35:21 +01:00
d2fe637743 docs: update references for digital ethnography and further work on evaluation 2026-04-14 15:16:56 +01:00
e1831aab7d docs(report): add researcher feedback 2026-04-13 22:00:41 +01:00
a3ef5a5655 chore: add more defaults to example env 2026-04-13 22:00:19 +01:00
5f943ce733 Merge pull request 'Corpus Explorer Feature' (#11) from feat/corpus-explorer into main
Reviewed-on: #11
2026-04-13 19:02:45 +01:00
20 changed files with 625 additions and 348 deletions

3
.gitignore vendored
View File

@@ -12,4 +12,5 @@ dist/
helper helper
db db
report/build report/build
.DS_Store

View File

@@ -1,29 +1,49 @@
# crosspost # crosspost
**crosspost** is a browser-based tool designed to support *digital ethnography*, the study of how people interact, communicate, and form culture in online spaces such as forums, social media platforms, and comment-driven communities. A web-based analytics platform for exploring online communities. Built as a final year CS project at UCC, crosspost ingests data from Reddit, YouTube, and Boards.ie, runs NLP analysis on it (emotion detection, topic classification, named entity recognition, stance markers), and surfaces the results through an interactive dashboard.
The motivating use case is digital ethnography — studying how people talk, what they talk about, and how culture forms in online spaces. The included dataset is centred on Cork, Ireland.
The project aims to make it easier for students, researchers, and journalists to collect, organise, and explore online discourse in a structured and ethical way, without requiring deep technical expertise. ## What it does
- Fetch posts and comments from Reddit, YouTube, and Boards.ie (or upload your own .jsonl file)
- Normalise everything into a unified schema regardless of source
- Run NLP analysis asynchronously in the background via Celery workers
- Explore results through a tabbed dashboard: temporal patterns, word clouds, emotion breakdowns, user activity, interaction graphs, topic clusters, and more
- Multi-user support — each user has their own datasets, isolated from everyone else
By combining data ingestion, analysis, and visualisation in a single system, crosspost turns raw online interactions into meaningful insights about how conversations emerge, evolve, and spread across platforms. # Prerequisites
- Docker & Docker Compose
- A Reddit App (client id & secret)
- YouTube Data v3 API Key
## Goals for this project # Setup
- Collect data ethically: enable users to link/upload text, images, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well. 1) **Clone the Repo**
- Organise content: Store gathered material in a structured database with tagging for themes, dates, and sources. ```
Analyse patterns: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks. git clone https://github.com/your-username/crosspost.git
- Visualise insights: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve. cd crosspost
- Have clearly stated and explained ethical and privacy guidelines for users. The student will design the architecture, implement data pipelines, integrate basic NLP models, and create an interactive dashboard. ```
Beyond programming, the project involves applying ethical research principles, handling data responsibly, and designing for non-technical users. By the end, the project will demonstrate how computer science can bridge technology and social research — turning raw online interactions into meaningful cultural insights. 2) **Configure Enviornment Vars**
```
cp example.env .env
```
Fill in each required empty env. Some are already filled in, these are sensible defaults that usually don't need to be changed
## Scope 3) **Start everything**
```
docker compose up -d
```
This project focuses on: This starts:
- Designing a modular data ingestion pipeline - `crosspost_db` — PostgreSQL on port 5432
- Implementing backend data processing and storage - `crosspost_redis` — Redis on port 6379
- Integrating lightweight NLP-based analysis - `crosspost_flask` — Flask API on port 5000
- Building a simple, accessible frontend for exploration and visualisation - `crosspost_worker` — Celery worker for background NLP/fetching tasks
- `crosspost_frontend` — Vite dev server on port 5173
# Requirements # Data Format for Manual Uploads
If you want to upload your own data rather than fetch it via the connectors, the expected format is newline-delimited JSON (.jsonl) where each line is a post object:
```json
{"id": "abc123", "author": "username", "title": "Post title", "content": "Post body", "url": "https://...", "timestamp": 1700000000.0, "source": "reddit", "comments": []}
```
- **Python** ≥ 3.9 # Notes
- **Python packages** listed in `requirements.txt` - **GPU support**: The Celery worker is configured with `--pool=solo` to avoid memory conflicts when multiple NLP models are loaded. If you have an NVIDIA GPU, uncomment the deploy.resources block in docker-compose.yml and make sure the NVIDIA Container Toolkit is installed.
- npm ≥ version 11

View File

@@ -28,7 +28,7 @@ services:
- .env - .env
ports: ports:
- "5000:5000" - "5000:5000"
command: flask --app server.app run --host=0.0.0.0 --debug command: gunicorn server.app:app --bind 0.0.0.0:5000 --workers 2 --threads 4
depends_on: depends_on:
- postgres - postgres
- redis - redis
@@ -48,13 +48,13 @@ services:
depends_on: depends_on:
- postgres - postgres
- redis - redis
# deploy: deploy:
# resources: resources:
# reservations: reservations:
# devices: devices:
# - driver: nvidia - driver: nvidia
# count: 1 count: 1
# capabilities: [gpu] capabilities: [gpu]
frontend: frontend:
build: build:
@@ -69,4 +69,4 @@ services:
- backend - backend
volumes: volumes:
model_cache: model_cache:

View File

@@ -1,8 +0,0 @@
# Generic User Data Transfer Object for social media platforms
class User:
def __init__(self, username: str, created_utc: int, ):
self.username = username
self.created_utc = created_utc
# Optionals
self.karma = None

View File

@@ -4,12 +4,13 @@ REDDIT_CLIENT_ID=
REDDIT_CLIENT_SECRET= REDDIT_CLIENT_SECRET=
# Database # Database
POSTGRES_USER= # Database
POSTGRES_PASSWORD= POSTGRES_USER=postgres
POSTGRES_DB= POSTGRES_PASSWORD=postgres
POSTGRES_HOST= POSTGRES_DB=mydatabase
POSTGRES_HOST=postgres
POSTGRES_PORT=5432 POSTGRES_PORT=5432
POSTGRES_DIR= POSTGRES_DIR=./db
# JWT # JWT
JWT_SECRET_KEY= JWT_SECRET_KEY=

Binary file not shown.

After

Width:  |  Height:  |  Size: 274 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 90 KiB

BIN
report/img/gantt.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

BIN
report/img/moods.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

BIN
report/img/ngrams.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

BIN
report/img/signature.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

File diff suppressed because it is too large Load Diff

View File

@@ -40,3 +40,110 @@
title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
year = {2020} year = {2020}
} }
@article{dominguez2007virtual,
author = {Domínguez, Daniel and Beaulieu, Anne and Estalella, Adolfo and Gómez, Edgar and Schnettler, Bernt and Read, Rosie},
title = {Virtual Ethnography},
journal = {Forum Qualitative Sozialforschung / Forum: Qualitative Social Research},
year = {2007},
volume = {8},
number = {3},
url = {http://nbn-resolving.de/urn:nbn:de:0114-fqs0703E19}
}
@article{sun2014lurkers,
author = {Sun, Na and Rau, Pei-Luen Patrick and Ma, Liang},
title = {Understanding Lurkers in Online Communities: A Literature Review},
journal = {Computers in Human Behavior},
year = {2014},
volume = {38},
pages = {110--117},
doi = {10.1016/j.chb.2014.05.022}
}
@article{ahmad2024sentiment,
author = {Ahmad, Waqar and others},
title = {Recent Advancements and Challenges of NLP-based Sentiment Analysis: A State-of-the-art Review},
journal = {Natural Language Processing Journal},
year = {2024},
doi = {10.1016/j.nlp.2024.100059}
}
@article{coleman2010ethnographic,
ISSN = {00846570},
URL = {http://www.jstor.org/stable/25735124},
abstract = {This review surveys and divides the ethnographic corpus on digital media into three broad but overlapping categories: the cultural politics of digital media, the vernacular cultures of digital media, and the prosaics of digital media. Engaging these three categories of scholarship on digital media, I consider how ethnographers are exploring the complex relationships between the local practices and global implications of digital media, their materiality and politics, and thier banal, as well as profound, presence in cultural life and modes of communication. I consider the way these media have become central to the articulation of cherished beliefs, ritual practices, and modes of being in the world; the fact that digital media culturally matters is undeniable but showing how, where, and why it matters is necessary to push against peculiarly narrow presumptions about the universality of digital experience.},
author = {E. Gabriella Coleman},
journal = {Annual Review of Anthropology},
pages = {487--505},
publisher = {Annual Reviews},
title = {Ethnographic Approaches to Digital Media},
urldate = {2026-04-15},
volume = {39},
year = {2010}
}
@article{shen2021stance,
author = {Shen, Qian and Tao, Yating},
title = {Stance Markers in {English} Medical Research Articles and Newspaper Opinion Columns: A Comparative Corpus-Based Study},
journal = {PLOS ONE},
volume = {16},
number = {3},
pages = {e0247981},
year = {2021},
doi = {10.1371/journal.pone.0247981}
}
@incollection{medvedev2019anatomy,
author = {Medvedev, Alexey N. and Lambiotte, Renaud and Delvenne, Jean-Charles},
title = {The Anatomy of Reddit: An Overview of Academic Research},
booktitle = {Dynamics On and Of Complex Networks III},
series = {Springer Proceedings in Complexity},
publisher = {Springer},
year = {2019},
pages = {183--204}
}
@misc{cook2023ethnography,
author = {Cook, Chloe},
title = {What is the Difference Between Ethnography and Digital Ethnography?},
year = {2023},
month = jan,
day = {19},
howpublished = {\url{https://ethosapp.com/blog/what-is-the-difference-between-ethnography-and-digital-ethnography/}},
note = {Accessed: 2026-04-16},
organization = {EthOS}
}
@misc{giuffre2026sentiment,
author = {Giuffre, Steven},
title = {What is Sentiment Analysis?},
year = {2026},
month = mar,
howpublished = {\url{https://www.vonage.com/resources/articles/sentiment-analysis/}},
note = {Accessed: 2026-04-16},
organization = {Vonage}
}
@misc{mungalpara2022stemming,
author = {Mungalpara, Jaimin},
title = {Stemming Lemmatization Stopwords and {N}-Grams in {NLP}},
year = {2022},
month = jul,
day = {26},
howpublished = {\url{https://jaimin-ml2001.medium.com/stemming-lemmatization-stopwords-and-n-grams-in-nlp-96f8e8b6aa6f}},
note = {Accessed: 2026-04-16},
organization = {Medium}
}
@misc{chugani2025ethicalscraping,
author = {Chugani, Vinod},
title = {Ethical Web Scraping: Principles and Practices},
year = {2025},
month = apr,
day = {21},
howpublished = {\url{https://www.datacamp.com/blog/ethical-web-scraping}},
note = {Accessed: 2026-04-16},
organization = {DataCamp}
}

View File

@@ -16,3 +16,4 @@ Requests==2.32.5
sentence_transformers==5.2.2 sentence_transformers==5.2.2
torch==2.10.0 torch==2.10.0
transformers==5.1.0 transformers==5.1.0
gunicorn==25.3.0

View File

@@ -1,21 +1,18 @@
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
from dto.post import Post from dto.post import Post
import os
class BaseConnector(ABC): class BaseConnector(ABC):
# Each subclass declares these at the class level source_name: str # machine readable
source_name: str # machine-readable: "reddit", "youtube" display_name: str # human readablee
display_name: str # human-readable: "Reddit", "YouTube" required_env: list[str] = []
required_env: list[str] = [] # env vars needed to activate
search_enabled: bool search_enabled: bool
categories_enabled: bool categories_enabled: bool
@classmethod @classmethod
def is_available(cls) -> bool: def is_available(cls) -> bool:
"""Returns True if all required env vars are set."""
import os
return all(os.getenv(var) for var in cls.required_env) return all(os.getenv(var) for var in cls.required_env)
@abstractmethod @abstractmethod

View File

@@ -87,7 +87,7 @@ class BoardsAPI(BaseConnector):
post = self._parse_thread(html, post_url) post = self._parse_thread(html, post_url)
return post return post
with ThreadPoolExecutor(max_workers=10) as executor: with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(fetch_and_parse, url): url for url in urls} futures = {executor.submit(fetch_and_parse, url): url for url in urls}
for i, future in enumerate(as_completed(futures)): for i, future in enumerate(as_completed(futures)):

View File

@@ -232,7 +232,11 @@ class RedditAPI(BaseConnector):
) )
if response.status_code == 429: if response.status_code == 429:
wait_time = response.headers.get("X-Ratelimit-Reset", backoff) try:
wait_time = int(response.headers.get("X-Ratelimit-Reset", backoff))
wait_time += 1 # Add a small buffer to ensure the rate limit has reset
except ValueError:
wait_time = backoff
logger.warning( logger.warning(
f"Rate limited by Reddit API. Retrying in {wait_time} seconds..." f"Rate limited by Reddit API. Retrying in {wait_time} seconds..."

View File

@@ -1,5 +1,6 @@
import os import os
import datetime import datetime
import logging
from dotenv import load_dotenv from dotenv import load_dotenv
from googleapiclient.discovery import build from googleapiclient.discovery import build
@@ -9,9 +10,11 @@ from dto.comment import Comment
from server.connectors.base import BaseConnector from server.connectors.base import BaseConnector
load_dotenv() load_dotenv()
API_KEY = os.getenv("YOUTUBE_API_KEY") API_KEY = os.getenv("YOUTUBE_API_KEY")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
class YouTubeAPI(BaseConnector): class YouTubeAPI(BaseConnector):
source_name: str = "youtube" source_name: str = "youtube"
@@ -77,11 +80,30 @@ class YouTubeAPI(BaseConnector):
return True return True
def _search_videos(self, query, limit): def _search_videos(self, query, limit):
request = self.youtube.search().list( results = []
q=query, part="snippet", type="video", maxResults=limit next_page_token = None
)
response = request.execute() while len(results) < limit:
return response.get("items", []) batch_size = min(50, limit - len(results))
request = self.youtube.search().list(
q=query,
part="snippet",
type="video",
maxResults=batch_size,
pageToken=next_page_token
)
response = request.execute()
results.extend(response.get("items", []))
logging.info(f"Fetched {len(results)} out of {limit} videos for query '{query}'")
next_page_token = response.get("nextPageToken")
if not next_page_token:
logging.warning(f"No more pages of results available for query '{query}'")
break
return results[:limit]
def _get_video_comments(self, video_id): def _get_video_comments(self, video_id):
request = self.youtube.commentThreads().list( request = self.youtube.commentThreads().list(

View File

@@ -1,3 +1,5 @@
from time import time
import pandas as pd import pandas as pd
import logging import logging
@@ -46,6 +48,7 @@ def fetch_and_process_dataset(
try: try:
for metadata in source_info: for metadata in source_info:
fetch_start = time()
name = metadata["name"] name = metadata["name"]
search = metadata.get("search") search = metadata.get("search")
category = metadata.get("category") category = metadata.get("category")
@@ -57,8 +60,11 @@ def fetch_and_process_dataset(
) )
posts.extend(post.to_dict() for post in raw_posts) posts.extend(post.to_dict() for post in raw_posts)
fetch_time = time() - fetch_start
df = pd.DataFrame(posts) df = pd.DataFrame(posts)
nlp_start = time()
dataset_manager.set_dataset_status( dataset_manager.set_dataset_status(
dataset_id, "processing", "NLP Processing Started" dataset_id, "processing", "NLP Processing Started"
) )
@@ -66,9 +72,11 @@ def fetch_and_process_dataset(
processor = DatasetEnrichment(df, topics) processor = DatasetEnrichment(df, topics)
enriched_df = processor.enrich() enriched_df = processor.enrich()
nlp_time = time() - nlp_start
dataset_manager.save_dataset_content(dataset_id, enriched_df) dataset_manager.save_dataset_content(dataset_id, enriched_df)
dataset_manager.set_dataset_status( dataset_manager.set_dataset_status(
dataset_id, "complete", "NLP Processing Completed Successfully" dataset_id, "complete", f"Completed Successfully. Fetch time: {fetch_time:.2f}s, NLP time: {nlp_time:.2f}s"
) )
except Exception as e: except Exception as e:
dataset_manager.set_dataset_status( dataset_manager.set_dataset_status(