Compare commits
60 Commits
37d08c63b8
...
v1.0
| Author | SHA1 | Date | |
|---|---|---|---|
| 5970f555fa | |||
| 9b7a51ff33 | |||
| 2d39ea6e66 | |||
| c1e5482f55 | |||
| b2d7f6edaf | |||
| 10efa664df | |||
| 3db7c1d3ae | |||
| 72e17e900e | |||
| 7b9a17f395 | |||
| 0a396dd504 | |||
| c6e8144116 | |||
| 760d2daf7f | |||
| ca38b992eb | |||
| ee9c7b4ab2 | |||
| 703a7c435c | |||
| 02ba727d05 | |||
| 76591bc89e | |||
| e35e51d295 | |||
| d2fe637743 | |||
| e1831aab7d | |||
| a3ef5a5655 | |||
| 5f943ce733 | |||
| 9964a919c3 | |||
| c11434344a | |||
| bc356848ef | |||
| 047427432f | |||
| d0d02e9ebf | |||
| 68342606e3 | |||
| afae7f42a1 | |||
| 4dd2721e98 | |||
| 99afe82464 | |||
| 8c44df94c0 | |||
| 42905cc547 | |||
| ec64551881 | |||
| e274b8295a | |||
| 3df6776111 | |||
| a347869353 | |||
| 8b4e13702e | |||
| 8fa4f3fbdf | |||
| c6cae040f0 | |||
| addc1d4087 | |||
| 225133a074 | |||
| e903e1b738 | |||
| 0c4dc02852 | |||
| 33e4291def | |||
| cedbce128e | |||
| 107dae0e95 | |||
| 23833e2c5b | |||
| f2b6917f1f | |||
| b57a8d3c65 | |||
| ac65e26eab | |||
| 6efa75dfe6 | |||
| de61e7653f | |||
| 98aa04256b | |||
| 5f81c51979 | |||
| 361b532766 | |||
| 9ef96661fc | |||
| 9375abded5 | |||
| 74ecdf238a | |||
| b85987e179 |
1
.gitignore
vendored
@@ -13,3 +13,4 @@ dist/
|
|||||||
helper
|
helper
|
||||||
db
|
db
|
||||||
report/build
|
report/build
|
||||||
|
.DS_Store
|
||||||
60
README.md
@@ -1,29 +1,49 @@
|
|||||||
# crosspost
|
# crosspost
|
||||||
**crosspost** is a browser-based tool designed to support *digital ethnography*, the study of how people interact, communicate, and form culture in online spaces such as forums, social media platforms, and comment-driven communities.
|
A web-based analytics platform for exploring online communities. Built as a final year CS project at UCC, crosspost ingests data from Reddit, YouTube, and Boards.ie, runs NLP analysis on it (emotion detection, topic classification, named entity recognition, stance markers), and surfaces the results through an interactive dashboard.
|
||||||
|
The motivating use case is digital ethnography — studying how people talk, what they talk about, and how culture forms in online spaces. The included dataset is centred on Cork, Ireland.
|
||||||
|
|
||||||
The project aims to make it easier for students, researchers, and journalists to collect, organise, and explore online discourse in a structured and ethical way, without requiring deep technical expertise.
|
## What it does
|
||||||
|
- Fetch posts and comments from Reddit, YouTube, and Boards.ie (or upload your own .jsonl file)
|
||||||
|
- Normalise everything into a unified schema regardless of source
|
||||||
|
- Run NLP analysis asynchronously in the background via Celery workers
|
||||||
|
- Explore results through a tabbed dashboard: temporal patterns, word clouds, emotion breakdowns, user activity, interaction graphs, topic clusters, and more
|
||||||
|
- Multi-user support — each user has their own datasets, isolated from everyone else
|
||||||
|
|
||||||
By combining data ingestion, analysis, and visualisation in a single system, crosspost turns raw online interactions into meaningful insights about how conversations emerge, evolve, and spread across platforms.
|
# Prerequisites
|
||||||
|
- Docker & Docker Compose
|
||||||
|
- A Reddit App (client id & secret)
|
||||||
|
- YouTube Data v3 API Key
|
||||||
|
|
||||||
## Goals for this project
|
# Setup
|
||||||
- Collect data ethically: enable users to link/upload text, images, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well.
|
1) **Clone the Repo**
|
||||||
- Organise content: Store gathered material in a structured database with tagging for themes, dates, and sources.
|
```
|
||||||
Analyse patterns: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
|
git clone https://github.com/your-username/crosspost.git
|
||||||
- Visualise insights: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
|
cd crosspost
|
||||||
- Have clearly stated and explained ethical and privacy guidelines for users. The student will design the architecture, implement data pipelines, integrate basic NLP models, and create an interactive dashboard.
|
```
|
||||||
|
|
||||||
Beyond programming, the project involves applying ethical research principles, handling data responsibly, and designing for non-technical users. By the end, the project will demonstrate how computer science can bridge technology and social research — turning raw online interactions into meaningful cultural insights.
|
2) **Configure Enviornment Vars**
|
||||||
|
```
|
||||||
|
cp example.env .env
|
||||||
|
```
|
||||||
|
Fill in each required empty env. Some are already filled in, these are sensible defaults that usually don't need to be changed
|
||||||
|
|
||||||
## Scope
|
3) **Start everything**
|
||||||
|
```
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
This project focuses on:
|
This starts:
|
||||||
- Designing a modular data ingestion pipeline
|
- `crosspost_db` — PostgreSQL on port 5432
|
||||||
- Implementing backend data processing and storage
|
- `crosspost_redis` — Redis on port 6379
|
||||||
- Integrating lightweight NLP-based analysis
|
- `crosspost_flask` — Flask API on port 5000
|
||||||
- Building a simple, accessible frontend for exploration and visualisation
|
- `crosspost_worker` — Celery worker for background NLP/fetching tasks
|
||||||
|
- `crosspost_frontend` — Vite dev server on port 5173
|
||||||
|
|
||||||
# Requirements
|
# Data Format for Manual Uploads
|
||||||
|
If you want to upload your own data rather than fetch it via the connectors, the expected format is newline-delimited JSON (.jsonl) where each line is a post object:
|
||||||
|
```json
|
||||||
|
{"id": "abc123", "author": "username", "title": "Post title", "content": "Post body", "url": "https://...", "timestamp": 1700000000.0, "source": "reddit", "comments": []}
|
||||||
|
```
|
||||||
|
|
||||||
- **Python** ≥ 3.9
|
# Notes
|
||||||
- **Python packages** listed in `requirements.txt`
|
- **GPU support**: The Celery worker is configured with `--pool=solo` to avoid memory conflicts when multiple NLP models are loaded. If you have an NVIDIA GPU, uncomment the deploy.resources block in docker-compose.yml and make sure the NVIDIA Container Toolkit is installed.
|
||||||
- npm ≥ version 11
|
|
||||||
@@ -28,7 +28,7 @@ services:
|
|||||||
- .env
|
- .env
|
||||||
ports:
|
ports:
|
||||||
- "5000:5000"
|
- "5000:5000"
|
||||||
command: flask --app server.app run --host=0.0.0.0 --debug
|
command: gunicorn server.app:app --bind 0.0.0.0:5000 --workers 2 --threads 4
|
||||||
depends_on:
|
depends_on:
|
||||||
- postgres
|
- postgres
|
||||||
- redis
|
- redis
|
||||||
@@ -48,13 +48,13 @@ services:
|
|||||||
depends_on:
|
depends_on:
|
||||||
- postgres
|
- postgres
|
||||||
- redis
|
- redis
|
||||||
# deploy:
|
deploy:
|
||||||
# resources:
|
resources:
|
||||||
# reservations:
|
reservations:
|
||||||
# devices:
|
devices:
|
||||||
# - driver: nvidia
|
- driver: nvidia
|
||||||
# count: 1
|
count: 1
|
||||||
# capabilities: [gpu]
|
capabilities: [gpu]
|
||||||
|
|
||||||
frontend:
|
frontend:
|
||||||
build:
|
build:
|
||||||
|
|||||||
@@ -1,8 +0,0 @@
|
|||||||
# Generic User Data Transfer Object for social media platforms
|
|
||||||
class User:
|
|
||||||
def __init__(self, username: str, created_utc: int, ):
|
|
||||||
self.username = username
|
|
||||||
self.created_utc = created_utc
|
|
||||||
|
|
||||||
# Optionals
|
|
||||||
self.karma = None
|
|
||||||
13
example.env
@@ -1,13 +1,16 @@
|
|||||||
# API Keys
|
# API Keys
|
||||||
YOUTUBE_API_KEY=
|
YOUTUBE_API_KEY=
|
||||||
|
REDDIT_CLIENT_ID=
|
||||||
|
REDDIT_CLIENT_SECRET=
|
||||||
|
|
||||||
# Database
|
# Database
|
||||||
POSTGRES_USER=
|
# Database
|
||||||
POSTGRES_PASSWORD=
|
POSTGRES_USER=postgres
|
||||||
POSTGRES_DB=
|
POSTGRES_PASSWORD=postgres
|
||||||
POSTGRES_HOST=
|
POSTGRES_DB=mydatabase
|
||||||
|
POSTGRES_HOST=postgres
|
||||||
POSTGRES_PORT=5432
|
POSTGRES_PORT=5432
|
||||||
POSTGRES_DIR=
|
POSTGRES_DIR=./db
|
||||||
|
|
||||||
# JWT
|
# JWT
|
||||||
JWT_SECRET_KEY=
|
JWT_SECRET_KEY=
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
import { useEffect, useMemo, useState } from "react";
|
import { useEffect, useState } from "react";
|
||||||
import { Dialog, DialogPanel, DialogTitle } from "@headlessui/react";
|
import { Dialog, DialogPanel, DialogTitle } from "@headlessui/react";
|
||||||
|
|
||||||
import StatsStyling from "../styles/stats_styling";
|
import StatsStyling from "../styles/stats_styling";
|
||||||
@@ -103,11 +103,6 @@ const CorpusExplorer = ({
|
|||||||
}
|
}
|
||||||
}, [open, title, records.length]);
|
}, [open, title, records.length]);
|
||||||
|
|
||||||
const visibleRecords = useMemo(
|
|
||||||
() => records.slice(0, visibleCount),
|
|
||||||
[records, visibleCount],
|
|
||||||
);
|
|
||||||
|
|
||||||
const hasMoreRecords = visibleCount < records.length;
|
const hasMoreRecords = visibleCount < records.length;
|
||||||
|
|
||||||
return (
|
return (
|
||||||
@@ -158,7 +153,7 @@ const CorpusExplorer = ({
|
|||||||
paddingRight: 4,
|
paddingRight: 4,
|
||||||
}}
|
}}
|
||||||
>
|
>
|
||||||
{visibleRecords.map((record, index) => {
|
{records.slice(0, visibleCount).map((record, index) => {
|
||||||
const recordKey = getRecordKey(record, index);
|
const recordKey = getRecordKey(record, index);
|
||||||
const titleText = getRecordTitle(record);
|
const titleText = getRecordTitle(record);
|
||||||
const content = cleanText(record.content);
|
const content = cleanText(record.content);
|
||||||
|
|||||||
@@ -8,11 +8,11 @@ import {
|
|||||||
buildHedgeSpec,
|
buildHedgeSpec,
|
||||||
buildIdentityBucketSpec,
|
buildIdentityBucketSpec,
|
||||||
buildPermissionSpec,
|
buildPermissionSpec,
|
||||||
getExplorerButtonStyle,
|
|
||||||
type CorpusExplorerSpec,
|
type CorpusExplorerSpec,
|
||||||
} from "../utils/corpusExplorer";
|
} from "../utils/corpusExplorer";
|
||||||
|
|
||||||
const styles = StatsStyling;
|
const styles = StatsStyling;
|
||||||
|
const exploreButtonStyle = { padding: "4px 8px", fontSize: 12 };
|
||||||
|
|
||||||
type CulturalStatsProps = {
|
type CulturalStatsProps = {
|
||||||
data: CulturalAnalysisResponse;
|
data: CulturalAnalysisResponse;
|
||||||
@@ -22,7 +22,7 @@ type CulturalStatsProps = {
|
|||||||
const renderExploreButton = (onClick: () => void) => (
|
const renderExploreButton = (onClick: () => void) => (
|
||||||
<button
|
<button
|
||||||
onClick={onClick}
|
onClick={onClick}
|
||||||
style={{ ...styles.buttonSecondary, ...getExplorerButtonStyle() }}
|
style={{ ...styles.buttonSecondary, ...exploreButtonStyle }}
|
||||||
>
|
>
|
||||||
Explore
|
Explore
|
||||||
</button>
|
</button>
|
||||||
|
|||||||
@@ -26,12 +26,12 @@ import {
|
|||||||
buildDateBucketSpec,
|
buildDateBucketSpec,
|
||||||
buildOneTimeUsersSpec,
|
buildOneTimeUsersSpec,
|
||||||
buildUserSpec,
|
buildUserSpec,
|
||||||
getExplorerButtonStyle,
|
|
||||||
type CorpusExplorerSpec,
|
type CorpusExplorerSpec,
|
||||||
} from "../utils/corpusExplorer";
|
} from "../utils/corpusExplorer";
|
||||||
|
|
||||||
const styles = StatsStyling;
|
const styles = StatsStyling;
|
||||||
const MAX_WORDCLOUD_WORDS = 250;
|
const MAX_WORDCLOUD_WORDS = 250;
|
||||||
|
const exploreButtonStyle = { padding: "4px 8px", fontSize: 12 };
|
||||||
|
|
||||||
const WORDCLOUD_OPTIONS = {
|
const WORDCLOUD_OPTIONS = {
|
||||||
rotations: 2,
|
rotations: 2,
|
||||||
@@ -80,7 +80,7 @@ function convertFrequencyData(data: FrequencyWord[]) {
|
|||||||
const renderExploreButton = (onClick: () => void) => (
|
const renderExploreButton = (onClick: () => void) => (
|
||||||
<button
|
<button
|
||||||
onClick={onClick}
|
onClick={onClick}
|
||||||
style={{ ...styles.buttonSecondary, ...getExplorerButtonStyle() }}
|
style={{ ...styles.buttonSecondary, ...exploreButtonStyle }}
|
||||||
>
|
>
|
||||||
Explore
|
Explore
|
||||||
</button>
|
</button>
|
||||||
|
|||||||
@@ -88,6 +88,15 @@ export default function UserModal({
|
|||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
) : null}
|
) : null}
|
||||||
|
|
||||||
|
{userData.dominant_topic ? (
|
||||||
|
<div style={styles.topUserItem}>
|
||||||
|
<div style={styles.topUserName}>Most Common Topic</div>
|
||||||
|
<div style={styles.topUserMeta}>
|
||||||
|
{userData.dominant_topic.topic} ({userData.dominant_topic.count} events)
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
) : null}
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
</DialogPanel>
|
</DialogPanel>
|
||||||
|
|||||||
@@ -20,7 +20,7 @@ type GraphLink = {
|
|||||||
value: number;
|
value: number;
|
||||||
};
|
};
|
||||||
|
|
||||||
function ApiToGraphData(apiData: InteractionGraph) {
|
function toGraphData(apiData: InteractionGraph) {
|
||||||
const links: GraphLink[] = [];
|
const links: GraphLink[] = [];
|
||||||
const connectedNodeIds = new Set<string>();
|
const connectedNodeIds = new Set<string>();
|
||||||
|
|
||||||
@@ -56,7 +56,7 @@ const UserStats = ({
|
|||||||
onExplore,
|
onExplore,
|
||||||
}: UserStatsProps) => {
|
}: UserStatsProps) => {
|
||||||
const graphData = useMemo(
|
const graphData = useMemo(
|
||||||
() => ApiToGraphData(interactionGraph),
|
() => toGraphData(interactionGraph),
|
||||||
[interactionGraph],
|
[interactionGraph],
|
||||||
);
|
);
|
||||||
const graphContainerRef = useRef<HTMLDivElement | null>(null);
|
const graphContainerRef = useRef<HTMLDivElement | null>(null);
|
||||||
|
|||||||
@@ -66,6 +66,26 @@ const EMPTY_EXPLORER_STATE: ExplorerState = {
|
|||||||
error: "",
|
error: "",
|
||||||
};
|
};
|
||||||
|
|
||||||
|
const createExplorerState = (
|
||||||
|
spec: CorpusExplorerSpec,
|
||||||
|
patch: Partial<ExplorerState> = {},
|
||||||
|
): ExplorerState => ({
|
||||||
|
open: true,
|
||||||
|
title: spec.title,
|
||||||
|
description: spec.description,
|
||||||
|
emptyMessage: spec.emptyMessage ?? "No matching records found.",
|
||||||
|
records: [],
|
||||||
|
loading: false,
|
||||||
|
error: "",
|
||||||
|
...patch,
|
||||||
|
});
|
||||||
|
|
||||||
|
const compareRecordsByNewest = (a: DatasetRecord, b: DatasetRecord) => {
|
||||||
|
const aValue = String(a.dt ?? a.date ?? a.timestamp ?? "");
|
||||||
|
const bValue = String(b.dt ?? b.date ?? b.timestamp ?? "");
|
||||||
|
return bValue.localeCompare(aValue);
|
||||||
|
};
|
||||||
|
|
||||||
const parseJsonLikePayload = (value: string): unknown => {
|
const parseJsonLikePayload = (value: string): unknown => {
|
||||||
const normalized = value
|
const normalized = value
|
||||||
.replace(/\uFEFF/g, "")
|
.replace(/\uFEFF/g, "")
|
||||||
@@ -86,16 +106,23 @@ const parseJsonLikePayload = (value: string): unknown => {
|
|||||||
return JSON.parse(normalized);
|
return JSON.parse(normalized);
|
||||||
};
|
};
|
||||||
|
|
||||||
|
const tryParseRecords = (value: string) => {
|
||||||
|
try {
|
||||||
|
return normalizeRecordPayload(parseJsonLikePayload(value));
|
||||||
|
} catch {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
const parseRecordStringPayload = (payload: string): DatasetRecord[] | null => {
|
const parseRecordStringPayload = (payload: string): DatasetRecord[] | null => {
|
||||||
const trimmed = payload.trim();
|
const trimmed = payload.trim();
|
||||||
if (!trimmed) {
|
if (!trimmed) {
|
||||||
return [];
|
return [];
|
||||||
}
|
}
|
||||||
|
|
||||||
try {
|
const direct = tryParseRecords(trimmed);
|
||||||
return normalizeRecordPayload(parseJsonLikePayload(trimmed));
|
if (direct) {
|
||||||
} catch {
|
return direct;
|
||||||
// Continue with additional fallback formats below.
|
|
||||||
}
|
}
|
||||||
|
|
||||||
const ndjsonLines = trimmed
|
const ndjsonLines = trimmed
|
||||||
@@ -106,29 +133,24 @@ const parseRecordStringPayload = (payload: string): DatasetRecord[] | null => {
|
|||||||
try {
|
try {
|
||||||
return ndjsonLines.map((line) => parseJsonLikePayload(line)) as DatasetRecord[];
|
return ndjsonLines.map((line) => parseJsonLikePayload(line)) as DatasetRecord[];
|
||||||
} catch {
|
} catch {
|
||||||
// Continue with wrapped JSON extraction.
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
const bracketStart = trimmed.indexOf("[");
|
const bracketStart = trimmed.indexOf("[");
|
||||||
const bracketEnd = trimmed.lastIndexOf("]");
|
const bracketEnd = trimmed.lastIndexOf("]");
|
||||||
if (bracketStart !== -1 && bracketEnd > bracketStart) {
|
if (bracketStart !== -1 && bracketEnd > bracketStart) {
|
||||||
const candidate = trimmed.slice(bracketStart, bracketEnd + 1);
|
const parsed = tryParseRecords(trimmed.slice(bracketStart, bracketEnd + 1));
|
||||||
try {
|
if (parsed) {
|
||||||
return normalizeRecordPayload(parseJsonLikePayload(candidate));
|
return parsed;
|
||||||
} catch {
|
|
||||||
// Continue with object extraction.
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
const braceStart = trimmed.indexOf("{");
|
const braceStart = trimmed.indexOf("{");
|
||||||
const braceEnd = trimmed.lastIndexOf("}");
|
const braceEnd = trimmed.lastIndexOf("}");
|
||||||
if (braceStart !== -1 && braceEnd > braceStart) {
|
if (braceStart !== -1 && braceEnd > braceStart) {
|
||||||
const candidate = trimmed.slice(braceStart, braceEnd + 1);
|
const parsed = tryParseRecords(trimmed.slice(braceStart, braceEnd + 1));
|
||||||
try {
|
if (parsed) {
|
||||||
return normalizeRecordPayload(parseJsonLikePayload(candidate));
|
return parsed;
|
||||||
} catch {
|
|
||||||
return null;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -316,45 +338,22 @@ const StatPage = () => {
|
|||||||
};
|
};
|
||||||
|
|
||||||
const openExplorer = async (spec: CorpusExplorerSpec) => {
|
const openExplorer = async (spec: CorpusExplorerSpec) => {
|
||||||
setExplorerState({
|
setExplorerState(createExplorerState(spec, { loading: true }));
|
||||||
open: true,
|
|
||||||
title: spec.title,
|
|
||||||
description: spec.description,
|
|
||||||
emptyMessage: spec.emptyMessage ?? "No matching records found.",
|
|
||||||
records: [],
|
|
||||||
loading: true,
|
|
||||||
error: "",
|
|
||||||
});
|
|
||||||
|
|
||||||
try {
|
try {
|
||||||
const records = await ensureFilteredRecords();
|
const records = await ensureFilteredRecords();
|
||||||
const context = buildExplorerContext(records);
|
const context = buildExplorerContext(records);
|
||||||
const matched = records.filter((record) => spec.matcher(record, context));
|
const matched = records
|
||||||
matched.sort((a, b) => {
|
.filter((record) => spec.matcher(record, context))
|
||||||
const aValue = String(a.dt ?? a.date ?? a.timestamp ?? "");
|
.sort(compareRecordsByNewest);
|
||||||
const bValue = String(b.dt ?? b.date ?? b.timestamp ?? "");
|
|
||||||
return bValue.localeCompare(aValue);
|
|
||||||
});
|
|
||||||
|
|
||||||
setExplorerState({
|
setExplorerState(createExplorerState(spec, { records: matched }));
|
||||||
open: true,
|
|
||||||
title: spec.title,
|
|
||||||
description: spec.description,
|
|
||||||
emptyMessage: spec.emptyMessage ?? "No matching records found.",
|
|
||||||
records: matched,
|
|
||||||
loading: false,
|
|
||||||
error: "",
|
|
||||||
});
|
|
||||||
} catch (e) {
|
} catch (e) {
|
||||||
setExplorerState({
|
setExplorerState(
|
||||||
open: true,
|
createExplorerState(spec, {
|
||||||
title: spec.title,
|
error: `Failed to load corpus records: ${String(e)}`,
|
||||||
description: spec.description,
|
}),
|
||||||
emptyMessage: spec.emptyMessage ?? "No matching records found.",
|
);
|
||||||
records: [],
|
|
||||||
loading: false,
|
|
||||||
error: `Failed to load corpus records: ${String(e)}`,
|
|
||||||
});
|
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -34,6 +34,11 @@ type Vocab = {
|
|||||||
top_words: FrequencyWord[];
|
top_words: FrequencyWord[];
|
||||||
};
|
};
|
||||||
|
|
||||||
|
type DominantTopic = {
|
||||||
|
topic: string;
|
||||||
|
count: number;
|
||||||
|
};
|
||||||
|
|
||||||
type User = {
|
type User = {
|
||||||
author: string;
|
author: string;
|
||||||
post: number;
|
post: number;
|
||||||
@@ -41,6 +46,7 @@ type User = {
|
|||||||
comment_post_ratio: number;
|
comment_post_ratio: number;
|
||||||
comment_share: number;
|
comment_share: number;
|
||||||
avg_emotions?: Record<string, number>;
|
avg_emotions?: Record<string, number>;
|
||||||
|
dominant_topic?: DominantTopic | null;
|
||||||
vocab?: Vocab | null;
|
vocab?: Vocab | null;
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -162,6 +168,10 @@ type StanceMarkers = {
|
|||||||
certainty_per_1k_tokens: number;
|
certainty_per_1k_tokens: number;
|
||||||
deontic_per_1k_tokens: number;
|
deontic_per_1k_tokens: number;
|
||||||
permission_per_1k_tokens: number;
|
permission_per_1k_tokens: number;
|
||||||
|
hedge_emotion_avg?: Record<string, number>;
|
||||||
|
certainty_emotion_avg?: Record<string, number>;
|
||||||
|
deontic_emotion_avg?: Record<string, number>;
|
||||||
|
permission_emotion_avg?: Record<string, number>;
|
||||||
};
|
};
|
||||||
|
|
||||||
type EntityEmotionAggregate = {
|
type EntityEmotionAggregate = {
|
||||||
@@ -202,6 +212,7 @@ type FilterResponse = {
|
|||||||
|
|
||||||
export type {
|
export type {
|
||||||
TopUser,
|
TopUser,
|
||||||
|
DominantTopic,
|
||||||
Vocab,
|
Vocab,
|
||||||
User,
|
User,
|
||||||
InteractionGraph,
|
InteractionGraph,
|
||||||
|
|||||||
@@ -1,5 +1,3 @@
|
|||||||
import type { CSSProperties } from "react";
|
|
||||||
|
|
||||||
type EntityRecord = {
|
type EntityRecord = {
|
||||||
text?: string;
|
text?: string;
|
||||||
[key: string]: unknown;
|
[key: string]: unknown;
|
||||||
@@ -58,11 +56,6 @@ const EMOTION_KEYS = [
|
|||||||
"emotion_sadness",
|
"emotion_sadness",
|
||||||
] as const;
|
] as const;
|
||||||
|
|
||||||
const shrinkButtonStyle: CSSProperties = {
|
|
||||||
padding: "4px 8px",
|
|
||||||
fontSize: 12,
|
|
||||||
};
|
|
||||||
|
|
||||||
const toText = (value: unknown) => {
|
const toText = (value: unknown) => {
|
||||||
if (typeof value === "string") {
|
if (typeof value === "string") {
|
||||||
return value;
|
return value;
|
||||||
@@ -83,6 +76,7 @@ const toText = (value: unknown) => {
|
|||||||
};
|
};
|
||||||
|
|
||||||
const normalize = (value: unknown) => toText(value).trim().toLowerCase();
|
const normalize = (value: unknown) => toText(value).trim().toLowerCase();
|
||||||
|
const getAuthor = (record: DatasetRecord) => toText(record.author).trim();
|
||||||
|
|
||||||
const getRecordText = (record: DatasetRecord) =>
|
const getRecordText = (record: DatasetRecord) =>
|
||||||
`${record.title ?? ""} ${record.content ?? ""}`.trim();
|
`${record.title ?? ""} ${record.content ?? ""}`.trim();
|
||||||
@@ -152,11 +146,11 @@ const matchesPhrase = (record: DatasetRecord, phrase: string) => {
|
|||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
return pattern.test(getRecordText(record).toLowerCase());
|
return pattern.test(getRecordText(record));
|
||||||
};
|
};
|
||||||
|
|
||||||
const recordIdentityBucket = (record: DatasetRecord) => {
|
const recordIdentityBucket = (record: DatasetRecord) => {
|
||||||
const text = getRecordText(record).toLowerCase();
|
const text = getRecordText(record);
|
||||||
const inHits = countMatches(IN_GROUP_PATTERN, text);
|
const inHits = countMatches(IN_GROUP_PATTERN, text);
|
||||||
const outHits = countMatches(OUT_GROUP_PATTERN, text);
|
const outHits = countMatches(OUT_GROUP_PATTERN, text);
|
||||||
|
|
||||||
@@ -171,48 +165,30 @@ const recordIdentityBucket = (record: DatasetRecord) => {
|
|||||||
return "tie";
|
return "tie";
|
||||||
};
|
};
|
||||||
|
|
||||||
const createAuthorEventCounts = (records: DatasetRecord[]) => {
|
const buildExplorerContext = (records: DatasetRecord[]): CorpusExplorerContext => {
|
||||||
const counts = new Map<string, number>();
|
const authorByPostId = new Map<string, string>();
|
||||||
|
const authorEventCounts = new Map<string, number>();
|
||||||
|
const authorCommentCounts = new Map<string, number>();
|
||||||
|
|
||||||
for (const record of records) {
|
for (const record of records) {
|
||||||
const author = toText(record.author).trim();
|
const author = getAuthor(record);
|
||||||
if (!author) {
|
if (!author) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
counts.set(author, (counts.get(author) ?? 0) + 1);
|
|
||||||
}
|
|
||||||
return counts;
|
|
||||||
};
|
|
||||||
|
|
||||||
const createAuthorCommentCounts = (records: DatasetRecord[]) => {
|
authorEventCounts.set(author, (authorEventCounts.get(author) ?? 0) + 1);
|
||||||
const counts = new Map<string, number>();
|
|
||||||
for (const record of records) {
|
if (record.type === "comment") {
|
||||||
const author = toText(record.author).trim();
|
authorCommentCounts.set(author, (authorCommentCounts.get(author) ?? 0) + 1);
|
||||||
if (!author || record.type !== "comment") {
|
|
||||||
continue;
|
|
||||||
}
|
}
|
||||||
counts.set(author, (counts.get(author) ?? 0) + 1);
|
|
||||||
}
|
|
||||||
return counts;
|
|
||||||
};
|
|
||||||
|
|
||||||
const createAuthorByPostId = (records: DatasetRecord[]) => {
|
if (record.post_id !== null && record.post_id !== undefined) {
|
||||||
const map = new Map<string, string>();
|
authorByPostId.set(String(record.post_id), author);
|
||||||
for (const record of records) {
|
|
||||||
const postId = record.post_id;
|
|
||||||
const author = toText(record.author).trim();
|
|
||||||
if (postId === null || postId === undefined || !author) {
|
|
||||||
continue;
|
|
||||||
}
|
}
|
||||||
map.set(String(postId), author);
|
|
||||||
}
|
}
|
||||||
return map;
|
|
||||||
};
|
|
||||||
|
|
||||||
const buildExplorerContext = (records: DatasetRecord[]): CorpusExplorerContext => ({
|
return { authorByPostId, authorEventCounts, authorCommentCounts };
|
||||||
authorByPostId: createAuthorByPostId(records),
|
};
|
||||||
authorEventCounts: createAuthorEventCounts(records),
|
|
||||||
authorCommentCounts: createAuthorCommentCounts(records),
|
|
||||||
});
|
|
||||||
|
|
||||||
const buildAllRecordsSpec = (): CorpusExplorerSpec => ({
|
const buildAllRecordsSpec = (): CorpusExplorerSpec => ({
|
||||||
title: "Corpus Explorer",
|
title: "Corpus Explorer",
|
||||||
@@ -221,19 +197,27 @@ const buildAllRecordsSpec = (): CorpusExplorerSpec => ({
|
|||||||
matcher: () => true,
|
matcher: () => true,
|
||||||
});
|
});
|
||||||
|
|
||||||
const buildUserSpec = (author: string): CorpusExplorerSpec => ({
|
const buildUserSpec = (author: string): CorpusExplorerSpec => {
|
||||||
title: `User: ${author}`,
|
const target = normalize(author);
|
||||||
description: `All records authored by ${author}.`,
|
|
||||||
emptyMessage: `No records found for ${author}.`,
|
|
||||||
matcher: (record) => normalize(record.author) === normalize(author),
|
|
||||||
});
|
|
||||||
|
|
||||||
const buildTopicSpec = (topic: string): CorpusExplorerSpec => ({
|
return {
|
||||||
title: `Topic: ${topic}`,
|
title: `User: ${author}`,
|
||||||
description: `Records assigned to the ${topic} topic bucket.`,
|
description: `All records authored by ${author}.`,
|
||||||
emptyMessage: `No records found in the ${topic} topic bucket.`,
|
emptyMessage: `No records found for ${author}.`,
|
||||||
matcher: (record) => normalize(record.topic) === normalize(topic),
|
matcher: (record) => normalize(record.author) === target,
|
||||||
});
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
const buildTopicSpec = (topic: string): CorpusExplorerSpec => {
|
||||||
|
const target = normalize(topic);
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: `Topic: ${topic}`,
|
||||||
|
description: `Records assigned to the ${topic} topic bucket.`,
|
||||||
|
emptyMessage: `No records found in the ${topic} topic bucket.`,
|
||||||
|
matcher: (record) => normalize(record.topic) === target,
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
const buildDateBucketSpec = (date: string): CorpusExplorerSpec => ({
|
const buildDateBucketSpec = (date: string): CorpusExplorerSpec => ({
|
||||||
title: `Date Bucket: ${date}`,
|
title: `Date Bucket: ${date}`,
|
||||||
@@ -256,88 +240,75 @@ const buildNgramSpec = (ngram: string): CorpusExplorerSpec => ({
|
|||||||
matcher: (record) => matchesPhrase(record, ngram),
|
matcher: (record) => matchesPhrase(record, ngram),
|
||||||
});
|
});
|
||||||
|
|
||||||
const buildEntitySpec = (entity: string): CorpusExplorerSpec => ({
|
const buildEntitySpec = (entity: string): CorpusExplorerSpec => {
|
||||||
title: `Entity: ${entity}`,
|
const target = normalize(entity);
|
||||||
description: `Records mentioning the ${entity} entity.`,
|
|
||||||
emptyMessage: `No records found for the ${entity} entity.`,
|
|
||||||
matcher: (record) => {
|
|
||||||
const target = normalize(entity);
|
|
||||||
const entities = Array.isArray(record.ner_entities) ? record.ner_entities : [];
|
|
||||||
return entities.some((item) => normalize(item?.text) === target) || matchesPhrase(record, entity);
|
|
||||||
},
|
|
||||||
});
|
|
||||||
|
|
||||||
const buildSourceSpec = (source: string): CorpusExplorerSpec => ({
|
return {
|
||||||
title: `Source: ${source}`,
|
title: `Entity: ${entity}`,
|
||||||
description: `Records from the ${source} source.`,
|
description: `Records mentioning the ${entity} entity.`,
|
||||||
emptyMessage: `No records found for ${source}.`,
|
emptyMessage: `No records found for the ${entity} entity.`,
|
||||||
matcher: (record) => normalize(record.source) === normalize(source),
|
matcher: (record) => {
|
||||||
});
|
const entities = Array.isArray(record.ner_entities) ? record.ner_entities : [];
|
||||||
|
return entities.some((item) => normalize(item?.text) === target) || matchesPhrase(record, entity);
|
||||||
|
},
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
const buildDominantEmotionSpec = (emotion: string): CorpusExplorerSpec => ({
|
const buildSourceSpec = (source: string): CorpusExplorerSpec => {
|
||||||
title: `Dominant Emotion: ${emotion}`,
|
const target = normalize(source);
|
||||||
description: `Records where ${emotion} is the strongest emotion score.`,
|
|
||||||
emptyMessage: `No records found with dominant emotion ${emotion}.`,
|
|
||||||
matcher: (record) => getDominantEmotion(record) === normalize(emotion),
|
|
||||||
});
|
|
||||||
|
|
||||||
const buildReplyPairSpec = (source: string, target: string): CorpusExplorerSpec => ({
|
return {
|
||||||
title: `Reply Path: ${source} -> ${target}`,
|
title: `Source: ${source}`,
|
||||||
description: `Reply records authored by ${source} in response to ${target}.`,
|
description: `Records from the ${source} source.`,
|
||||||
emptyMessage: `No reply records found for ${source} -> ${target}.`,
|
emptyMessage: `No records found for ${source}.`,
|
||||||
matcher: (record, context) => {
|
matcher: (record) => normalize(record.source) === target,
|
||||||
if (normalize(record.author) !== normalize(source)) {
|
};
|
||||||
return false;
|
};
|
||||||
}
|
|
||||||
|
|
||||||
const replyTo = record.reply_to;
|
const buildDominantEmotionSpec = (emotion: string): CorpusExplorerSpec => {
|
||||||
if (replyTo === null || replyTo === undefined || replyTo === "") {
|
const target = normalize(emotion);
|
||||||
return false;
|
|
||||||
}
|
|
||||||
|
|
||||||
const replyTarget = context.authorByPostId.get(String(replyTo));
|
return {
|
||||||
return normalize(replyTarget) === normalize(target);
|
title: `Dominant Emotion: ${emotion}`,
|
||||||
},
|
description: `Records where ${emotion} is the strongest emotion score.`,
|
||||||
});
|
emptyMessage: `No records found with dominant emotion ${emotion}.`,
|
||||||
|
matcher: (record) => getDominantEmotion(record) === target,
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
const buildReplyPairSpec = (source: string, target: string): CorpusExplorerSpec => {
|
||||||
|
const sourceName = normalize(source);
|
||||||
|
const targetName = normalize(target);
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: `Reply Path: ${source} -> ${target}`,
|
||||||
|
description: `Reply records authored by ${source} in response to ${target}.`,
|
||||||
|
emptyMessage: `No reply records found for ${source} -> ${target}.`,
|
||||||
|
matcher: (record, context) => {
|
||||||
|
if (normalize(record.author) !== sourceName) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
const replyTo = record.reply_to;
|
||||||
|
if (replyTo === null || replyTo === undefined || replyTo === "") {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
return normalize(context.authorByPostId.get(String(replyTo))) === targetName;
|
||||||
|
},
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
const buildOneTimeUsersSpec = (): CorpusExplorerSpec => ({
|
const buildOneTimeUsersSpec = (): CorpusExplorerSpec => ({
|
||||||
title: "One-Time Users",
|
title: "One-Time Users",
|
||||||
description: "Records written by authors who appear exactly once in the filtered corpus.",
|
description: "Records written by authors who appear exactly once in the filtered corpus.",
|
||||||
emptyMessage: "No one-time-user records found.",
|
emptyMessage: "No one-time-user records found.",
|
||||||
matcher: (record, context) => {
|
matcher: (record, context) => {
|
||||||
const author = toText(record.author).trim();
|
const author = getAuthor(record);
|
||||||
return !!author && context.authorEventCounts.get(author) === 1;
|
return !!author && context.authorEventCounts.get(author) === 1;
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
|
|
||||||
const buildTopCommentersSpec = (topAuthorCount: number): CorpusExplorerSpec => ({
|
|
||||||
title: "Top Commenters",
|
|
||||||
description: `Comment records from the top ${topAuthorCount} commenters in the filtered corpus.`,
|
|
||||||
emptyMessage: "No top-commenter records found.",
|
|
||||||
matcher: (record, context) => {
|
|
||||||
if (record.type !== "comment") {
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
|
|
||||||
const rankedAuthors = Array.from(context.authorCommentCounts.entries())
|
|
||||||
.sort((a, b) => b[1] - a[1])
|
|
||||||
.slice(0, topAuthorCount)
|
|
||||||
.map(([author]) => author);
|
|
||||||
|
|
||||||
return rankedAuthors.includes(toText(record.author).trim());
|
|
||||||
},
|
|
||||||
});
|
|
||||||
|
|
||||||
const buildSingleCommentAuthorsSpec = (): CorpusExplorerSpec => ({
|
|
||||||
title: "Single-Comment Authors",
|
|
||||||
description: "Comment records from authors who commented exactly once.",
|
|
||||||
emptyMessage: "No single-comment-author records found.",
|
|
||||||
matcher: (record, context) => {
|
|
||||||
const author = toText(record.author).trim();
|
|
||||||
return record.type === "comment" && !!author && context.authorCommentCounts.get(author) === 1;
|
|
||||||
},
|
|
||||||
});
|
|
||||||
|
|
||||||
const buildIdentityBucketSpec = (bucket: "in" | "out" | "tie"): CorpusExplorerSpec => {
|
const buildIdentityBucketSpec = (bucket: "in" | "out" | "tie"): CorpusExplorerSpec => {
|
||||||
const labels = {
|
const labels = {
|
||||||
in: "In-Group Posts",
|
in: "In-Group Posts",
|
||||||
@@ -376,9 +347,7 @@ const buildDeonticSpec = () =>
|
|||||||
const buildPermissionSpec = () =>
|
const buildPermissionSpec = () =>
|
||||||
buildPatternSpec("Permission Words", "Records containing permission language.", PERMISSION_PATTERN);
|
buildPatternSpec("Permission Words", "Records containing permission language.", PERMISSION_PATTERN);
|
||||||
|
|
||||||
const getExplorerButtonStyle = () => shrinkButtonStyle;
|
export type { DatasetRecord, CorpusExplorerSpec };
|
||||||
|
|
||||||
export type { DatasetRecord, CorpusExplorerContext, CorpusExplorerSpec };
|
|
||||||
export {
|
export {
|
||||||
buildAllRecordsSpec,
|
buildAllRecordsSpec,
|
||||||
buildCertaintySpec,
|
buildCertaintySpec,
|
||||||
@@ -393,13 +362,10 @@ export {
|
|||||||
buildOneTimeUsersSpec,
|
buildOneTimeUsersSpec,
|
||||||
buildPermissionSpec,
|
buildPermissionSpec,
|
||||||
buildReplyPairSpec,
|
buildReplyPairSpec,
|
||||||
buildSingleCommentAuthorsSpec,
|
|
||||||
buildSourceSpec,
|
buildSourceSpec,
|
||||||
buildTopicSpec,
|
buildTopicSpec,
|
||||||
buildTopCommentersSpec,
|
|
||||||
buildUserSpec,
|
buildUserSpec,
|
||||||
buildWordSpec,
|
buildWordSpec,
|
||||||
getDateBucket,
|
getDateBucket,
|
||||||
getExplorerButtonStyle,
|
|
||||||
toText,
|
toText,
|
||||||
};
|
};
|
||||||
|
|||||||
BIN
report/img/analysis_bar.png
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
report/img/architecture.png
Normal file
|
After Width: | Height: | Size: 70 KiB |
BIN
report/img/cork_temporal.png
Normal file
|
After Width: | Height: | Size: 274 KiB |
BIN
report/img/flooding_posts.png
Normal file
|
After Width: | Height: | Size: 90 KiB |
BIN
report/img/frontend.png
Normal file
|
After Width: | Height: | Size: 302 KiB |
BIN
report/img/gantt.png
Normal file
|
After Width: | Height: | Size: 50 KiB |
BIN
report/img/heatmap.png
Normal file
|
After Width: | Height: | Size: 86 KiB |
BIN
report/img/interaction_graph.png
Normal file
|
After Width: | Height: | Size: 114 KiB |
BIN
report/img/kpi_card.png
Normal file
|
After Width: | Height: | Size: 8.7 KiB |
BIN
report/img/moods.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
report/img/navbar.png
Normal file
|
After Width: | Height: | Size: 14 KiB |
BIN
report/img/ngrams.png
Normal file
|
After Width: | Height: | Size: 38 KiB |
BIN
report/img/nlp_backoff.png
Normal file
|
After Width: | Height: | Size: 143 KiB |
BIN
report/img/pipeline.png
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
report/img/reddit_bot.png
Normal file
|
After Width: | Height: | Size: 232 KiB |
BIN
report/img/schema.png
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
report/img/signature.jpg
Normal file
|
After Width: | Height: | Size: 152 KiB |
BIN
report/img/stance_markers.png
Normal file
|
After Width: | Height: | Size: 111 KiB |
BIN
report/img/topic_emotions.png
Normal file
|
After Width: | Height: | Size: 17 KiB |
BIN
report/img/ucc_crest.png
Normal file
|
After Width: | Height: | Size: 27 KiB |
1371
report/main.tex
149
report/references.bib
Normal file
@@ -0,0 +1,149 @@
|
|||||||
|
@online{reddit_api,
|
||||||
|
author = {{Reddit Inc.}},
|
||||||
|
title = {Reddit API Documentation},
|
||||||
|
year = {2025},
|
||||||
|
url = {https://www.reddit.com/dev/api/},
|
||||||
|
urldate = {2026-04-08}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{hartmann2022emotionenglish,
|
||||||
|
author={Hartmann, Jochen},
|
||||||
|
title={Emotion English DistilRoBERTa-base},
|
||||||
|
year={2022},
|
||||||
|
howpublished = {\url{https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/}},
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{all_mpnet_base_v2,
|
||||||
|
author={Microsoft Research},
|
||||||
|
title={All-MPNet-Base-V2},
|
||||||
|
year={2021},
|
||||||
|
howpublished = {\url{https://huggingface.co/sentence-transformers/all-mpnet-base-v2}},
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{minilm_l6_v2,
|
||||||
|
author={Microsoft Research},
|
||||||
|
title={MiniLM-L6-V2},
|
||||||
|
year={2021},
|
||||||
|
howpublished = {\url{https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2}},
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{dslim_bert_base_ner,
|
||||||
|
author={deepset},
|
||||||
|
title={dslim/bert-base-NER},
|
||||||
|
year={2018},
|
||||||
|
howpublished = {\url{https://huggingface.co/dslim/bert-base-NER}},
|
||||||
|
}
|
||||||
|
|
||||||
|
@inproceedings{demszky2020goemotions,
|
||||||
|
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
|
||||||
|
booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)},
|
||||||
|
title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
|
||||||
|
year = {2020}
|
||||||
|
}
|
||||||
|
|
||||||
|
@article{dominguez2007virtual,
|
||||||
|
author = {Domínguez, Daniel and Beaulieu, Anne and Estalella, Adolfo and Gómez, Edgar and Schnettler, Bernt and Read, Rosie},
|
||||||
|
title = {Virtual Ethnography},
|
||||||
|
journal = {Forum Qualitative Sozialforschung / Forum: Qualitative Social Research},
|
||||||
|
year = {2007},
|
||||||
|
volume = {8},
|
||||||
|
number = {3},
|
||||||
|
url = {http://nbn-resolving.de/urn:nbn:de:0114-fqs0703E19}
|
||||||
|
}
|
||||||
|
|
||||||
|
@article{sun2014lurkers,
|
||||||
|
author = {Sun, Na and Rau, Pei-Luen Patrick and Ma, Liang},
|
||||||
|
title = {Understanding Lurkers in Online Communities: A Literature Review},
|
||||||
|
journal = {Computers in Human Behavior},
|
||||||
|
year = {2014},
|
||||||
|
volume = {38},
|
||||||
|
pages = {110--117},
|
||||||
|
doi = {10.1016/j.chb.2014.05.022}
|
||||||
|
}
|
||||||
|
|
||||||
|
@article{ahmad2024sentiment,
|
||||||
|
author = {Ahmad, Waqar and others},
|
||||||
|
title = {Recent Advancements and Challenges of NLP-based Sentiment Analysis: A State-of-the-art Review},
|
||||||
|
journal = {Natural Language Processing Journal},
|
||||||
|
year = {2024},
|
||||||
|
doi = {10.1016/j.nlp.2024.100059}
|
||||||
|
}
|
||||||
|
|
||||||
|
@article{coleman2010ethnographic,
|
||||||
|
ISSN = {00846570},
|
||||||
|
URL = {http://www.jstor.org/stable/25735124},
|
||||||
|
abstract = {This review surveys and divides the ethnographic corpus on digital media into three broad but overlapping categories: the cultural politics of digital media, the vernacular cultures of digital media, and the prosaics of digital media. Engaging these three categories of scholarship on digital media, I consider how ethnographers are exploring the complex relationships between the local practices and global implications of digital media, their materiality and politics, and thier banal, as well as profound, presence in cultural life and modes of communication. I consider the way these media have become central to the articulation of cherished beliefs, ritual practices, and modes of being in the world; the fact that digital media culturally matters is undeniable but showing how, where, and why it matters is necessary to push against peculiarly narrow presumptions about the universality of digital experience.},
|
||||||
|
author = {E. Gabriella Coleman},
|
||||||
|
journal = {Annual Review of Anthropology},
|
||||||
|
pages = {487--505},
|
||||||
|
publisher = {Annual Reviews},
|
||||||
|
title = {Ethnographic Approaches to Digital Media},
|
||||||
|
urldate = {2026-04-15},
|
||||||
|
volume = {39},
|
||||||
|
year = {2010}
|
||||||
|
}
|
||||||
|
|
||||||
|
@article{shen2021stance,
|
||||||
|
author = {Shen, Qian and Tao, Yating},
|
||||||
|
title = {Stance Markers in {English} Medical Research Articles and Newspaper Opinion Columns: A Comparative Corpus-Based Study},
|
||||||
|
journal = {PLOS ONE},
|
||||||
|
volume = {16},
|
||||||
|
number = {3},
|
||||||
|
pages = {e0247981},
|
||||||
|
year = {2021},
|
||||||
|
doi = {10.1371/journal.pone.0247981}
|
||||||
|
}
|
||||||
|
|
||||||
|
@incollection{medvedev2019anatomy,
|
||||||
|
author = {Medvedev, Alexey N. and Lambiotte, Renaud and Delvenne, Jean-Charles},
|
||||||
|
title = {The Anatomy of Reddit: An Overview of Academic Research},
|
||||||
|
booktitle = {Dynamics On and Of Complex Networks III},
|
||||||
|
series = {Springer Proceedings in Complexity},
|
||||||
|
publisher = {Springer},
|
||||||
|
year = {2019},
|
||||||
|
pages = {183--204}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{cook2023ethnography,
|
||||||
|
author = {Cook, Chloe},
|
||||||
|
title = {What is the Difference Between Ethnography and Digital Ethnography?},
|
||||||
|
year = {2023},
|
||||||
|
month = jan,
|
||||||
|
day = {19},
|
||||||
|
howpublished = {\url{https://ethosapp.com/blog/what-is-the-difference-between-ethnography-and-digital-ethnography/}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {EthOS}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{giuffre2026sentiment,
|
||||||
|
author = {Giuffre, Steven},
|
||||||
|
title = {What is Sentiment Analysis?},
|
||||||
|
year = {2026},
|
||||||
|
month = mar,
|
||||||
|
howpublished = {\url{https://www.vonage.com/resources/articles/sentiment-analysis/}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {Vonage}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{mungalpara2022stemming,
|
||||||
|
author = {Mungalpara, Jaimin},
|
||||||
|
title = {Stemming Lemmatization Stopwords and {N}-Grams in {NLP}},
|
||||||
|
year = {2022},
|
||||||
|
month = jul,
|
||||||
|
day = {26},
|
||||||
|
howpublished = {\url{https://jaimin-ml2001.medium.com/stemming-lemmatization-stopwords-and-n-grams-in-nlp-96f8e8b6aa6f}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {Medium}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{chugani2025ethicalscraping,
|
||||||
|
author = {Chugani, Vinod},
|
||||||
|
title = {Ethical Web Scraping: Principles and Practices},
|
||||||
|
year = {2025},
|
||||||
|
month = apr,
|
||||||
|
day = {21},
|
||||||
|
howpublished = {\url{https://www.datacamp.com/blog/ethical-web-scraping}},
|
||||||
|
note = {Accessed: 2026-04-16},
|
||||||
|
organization = {DataCamp}
|
||||||
|
}
|
||||||
|
|
||||||
@@ -16,3 +16,4 @@ Requests==2.32.5
|
|||||||
sentence_transformers==5.2.2
|
sentence_transformers==5.2.2
|
||||||
torch==2.10.0
|
torch==2.10.0
|
||||||
transformers==5.1.0
|
transformers==5.1.0
|
||||||
|
gunicorn==25.3.0
|
||||||
|
|||||||
@@ -67,6 +67,12 @@ class CulturalAnalysis:
|
|||||||
|
|
||||||
def get_stance_markers(self, df: pd.DataFrame) -> dict[str, Any]:
|
def get_stance_markers(self, df: pd.DataFrame) -> dict[str, Any]:
|
||||||
s = df[self.content_col].fillna("").astype(str)
|
s = df[self.content_col].fillna("").astype(str)
|
||||||
|
emotion_exclusions = {"emotion_neutral", "emotion_surprise"}
|
||||||
|
emotion_cols = [
|
||||||
|
c
|
||||||
|
for c in df.columns
|
||||||
|
if c.startswith("emotion_") and c not in emotion_exclusions
|
||||||
|
]
|
||||||
|
|
||||||
hedge_pattern = re.compile(
|
hedge_pattern = re.compile(
|
||||||
r"\b(maybe|perhaps|possibly|probably|likely|seems|seem|i think|i feel|i guess|kind of|sort of|somewhat)\b"
|
r"\b(maybe|perhaps|possibly|probably|likely|seems|seem|i think|i feel|i guess|kind of|sort of|somewhat)\b"
|
||||||
@@ -88,7 +94,7 @@ class CulturalAnalysis:
|
|||||||
0, 1
|
0, 1
|
||||||
)
|
)
|
||||||
|
|
||||||
return {
|
result = {
|
||||||
"hedge_total": int(hedge_counts.sum()),
|
"hedge_total": int(hedge_counts.sum()),
|
||||||
"certainty_total": int(certainty_counts.sum()),
|
"certainty_total": int(certainty_counts.sum()),
|
||||||
"deontic_total": int(deontic_counts.sum()),
|
"deontic_total": int(deontic_counts.sum()),
|
||||||
@@ -107,6 +113,32 @@ class CulturalAnalysis:
|
|||||||
),
|
),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if emotion_cols:
|
||||||
|
emo = df[emotion_cols].apply(pd.to_numeric, errors="coerce").fillna(0.0)
|
||||||
|
|
||||||
|
result["hedge_emotion_avg"] = (
|
||||||
|
emo.loc[hedge_counts > 0].mean()
|
||||||
|
if (hedge_counts > 0).any()
|
||||||
|
else pd.Series(0.0, index=emotion_cols)
|
||||||
|
).to_dict()
|
||||||
|
result["certainty_emotion_avg"] = (
|
||||||
|
emo.loc[certainty_counts > 0].mean()
|
||||||
|
if (certainty_counts > 0).any()
|
||||||
|
else pd.Series(0.0, index=emotion_cols)
|
||||||
|
).to_dict()
|
||||||
|
result["deontic_emotion_avg"] = (
|
||||||
|
emo.loc[deontic_counts > 0].mean()
|
||||||
|
if (deontic_counts > 0).any()
|
||||||
|
else pd.Series(0.0, index=emotion_cols)
|
||||||
|
).to_dict()
|
||||||
|
result["permission_emotion_avg"] = (
|
||||||
|
emo.loc[perm_counts > 0].mean()
|
||||||
|
if (perm_counts > 0).any()
|
||||||
|
else pd.Series(0.0, index=emotion_cols)
|
||||||
|
).to_dict()
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
def get_avg_emotions_per_entity(
|
def get_avg_emotions_per_entity(
|
||||||
self, df: pd.DataFrame, top_n: int = 25, min_posts: int = 10
|
self, df: pd.DataFrame, top_n: int = 25, min_posts: int = 10
|
||||||
) -> dict[str, Any]:
|
) -> dict[str, Any]:
|
||||||
|
|||||||
@@ -71,6 +71,7 @@ class UserAnalysis:
|
|||||||
per_user = df.groupby(["author", "type"]).size().unstack(fill_value=0)
|
per_user = df.groupby(["author", "type"]).size().unstack(fill_value=0)
|
||||||
|
|
||||||
emotion_cols = [col for col in df.columns if col.startswith("emotion_")]
|
emotion_cols = [col for col in df.columns if col.startswith("emotion_")]
|
||||||
|
dominant_topic_by_author = {}
|
||||||
|
|
||||||
avg_emotions_by_author = {}
|
avg_emotions_by_author = {}
|
||||||
if emotion_cols:
|
if emotion_cols:
|
||||||
@@ -80,6 +81,31 @@ class UserAnalysis:
|
|||||||
for author, row in avg_emotions.iterrows()
|
for author, row in avg_emotions.iterrows()
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if "topic" in df.columns:
|
||||||
|
topic_df = df[
|
||||||
|
df["topic"].notna()
|
||||||
|
& (df["topic"] != "")
|
||||||
|
& (df["topic"] != "Misc")
|
||||||
|
]
|
||||||
|
if not topic_df.empty:
|
||||||
|
topic_counts = (
|
||||||
|
topic_df.groupby(["author", "topic"])
|
||||||
|
.size()
|
||||||
|
.reset_index(name="count")
|
||||||
|
.sort_values(
|
||||||
|
["author", "count", "topic"],
|
||||||
|
ascending=[True, False, True],
|
||||||
|
)
|
||||||
|
.drop_duplicates(subset=["author"])
|
||||||
|
)
|
||||||
|
dominant_topic_by_author = {
|
||||||
|
row["author"]: {
|
||||||
|
"topic": row["topic"],
|
||||||
|
"count": int(row["count"]),
|
||||||
|
}
|
||||||
|
for _, row in topic_counts.iterrows()
|
||||||
|
}
|
||||||
|
|
||||||
# ensure columns always exist
|
# ensure columns always exist
|
||||||
for col in ("post", "comment"):
|
for col in ("post", "comment"):
|
||||||
if col not in per_user.columns:
|
if col not in per_user.columns:
|
||||||
@@ -109,6 +135,7 @@ class UserAnalysis:
|
|||||||
"comment_post_ratio": float(row.get("comment_post_ratio", 0)),
|
"comment_post_ratio": float(row.get("comment_post_ratio", 0)),
|
||||||
"comment_share": float(row.get("comment_share", 0)),
|
"comment_share": float(row.get("comment_share", 0)),
|
||||||
"avg_emotions": avg_emotions_by_author.get(author, {}),
|
"avg_emotions": avg_emotions_by_author.get(author, {}),
|
||||||
|
"dominant_topic": dominant_topic_by_author.get(author),
|
||||||
"vocab": vocab_by_author.get(
|
"vocab": vocab_by_author.get(
|
||||||
author,
|
author,
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -1,21 +1,18 @@
|
|||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from dto.post import Post
|
from dto.post import Post
|
||||||
|
import os
|
||||||
|
|
||||||
|
|
||||||
class BaseConnector(ABC):
|
class BaseConnector(ABC):
|
||||||
# Each subclass declares these at the class level
|
source_name: str # machine readable
|
||||||
source_name: str # machine-readable: "reddit", "youtube"
|
display_name: str # human readablee
|
||||||
display_name: str # human-readable: "Reddit", "YouTube"
|
required_env: list[str] = []
|
||||||
required_env: list[str] = [] # env vars needed to activate
|
|
||||||
|
|
||||||
search_enabled: bool
|
search_enabled: bool
|
||||||
categories_enabled: bool
|
categories_enabled: bool
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def is_available(cls) -> bool:
|
def is_available(cls) -> bool:
|
||||||
"""Returns True if all required env vars are set."""
|
|
||||||
import os
|
|
||||||
|
|
||||||
return all(os.getenv(var) for var in cls.required_env)
|
return all(os.getenv(var) for var in cls.required_env)
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
|
|||||||
@@ -11,8 +11,7 @@ from server.connectors.base import BaseConnector
|
|||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; ForumFetcher/1.0)"}
|
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; Digital-Ethnography-Aid/1.0)"}
|
||||||
|
|
||||||
|
|
||||||
class BoardsAPI(BaseConnector):
|
class BoardsAPI(BaseConnector):
|
||||||
source_name: str = "boards.ie"
|
source_name: str = "boards.ie"
|
||||||
@@ -88,7 +87,7 @@ class BoardsAPI(BaseConnector):
|
|||||||
post = self._parse_thread(html, post_url)
|
post = self._parse_thread(html, post_url)
|
||||||
return post
|
return post
|
||||||
|
|
||||||
with ThreadPoolExecutor(max_workers=30) as executor:
|
with ThreadPoolExecutor(max_workers=5) as executor:
|
||||||
futures = {executor.submit(fetch_and_parse, url): url for url in urls}
|
futures = {executor.submit(fetch_and_parse, url): url for url in urls}
|
||||||
|
|
||||||
for i, future in enumerate(as_completed(futures)):
|
for i, future in enumerate(as_completed(futures)):
|
||||||
|
|||||||
@@ -1,6 +1,10 @@
|
|||||||
import requests
|
import requests
|
||||||
import logging
|
import logging
|
||||||
import time
|
import time
|
||||||
|
import os
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from requests.auth import HTTPBasicAuth
|
||||||
|
|
||||||
from dto.post import Post
|
from dto.post import Post
|
||||||
from dto.user import User
|
from dto.user import User
|
||||||
@@ -9,6 +13,8 @@ from server.connectors.base import BaseConnector
|
|||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")
|
||||||
|
CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")
|
||||||
|
|
||||||
class RedditAPI(BaseConnector):
|
class RedditAPI(BaseConnector):
|
||||||
source_name: str = "reddit"
|
source_name: str = "reddit"
|
||||||
@@ -18,6 +24,8 @@ class RedditAPI(BaseConnector):
|
|||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.url = "https://www.reddit.com/"
|
self.url = "https://www.reddit.com/"
|
||||||
|
self.token = None
|
||||||
|
self.token_expiry = 0
|
||||||
|
|
||||||
# Public Methods #
|
# Public Methods #
|
||||||
def get_new_posts_by_search(
|
def get_new_posts_by_search(
|
||||||
@@ -172,8 +180,43 @@ class RedditAPI(BaseConnector):
|
|||||||
user.karma = user_data["total_karma"]
|
user.karma = user_data["total_karma"]
|
||||||
return user
|
return user
|
||||||
|
|
||||||
|
def _get_token(self):
|
||||||
|
if self.token and time.time() < self.token_expiry:
|
||||||
|
return self.token
|
||||||
|
|
||||||
|
logger.info("Fetching new Reddit access token...")
|
||||||
|
|
||||||
|
auth = HTTPBasicAuth(CLIENT_ID, CLIENT_SECRET)
|
||||||
|
|
||||||
|
data = {
|
||||||
|
"grant_type": "client_credentials"
|
||||||
|
}
|
||||||
|
|
||||||
|
headers = {
|
||||||
|
"User-Agent": "python:ethnography-college-project:0.1 (by /u/ThisBirchWood)"
|
||||||
|
}
|
||||||
|
|
||||||
|
response = requests.post(
|
||||||
|
"https://www.reddit.com/api/v1/access_token",
|
||||||
|
auth=auth,
|
||||||
|
data=data,
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
|
||||||
|
response.raise_for_status()
|
||||||
|
token_json = response.json()
|
||||||
|
|
||||||
|
self.token = token_json["access_token"]
|
||||||
|
self.token_expiry = time.time() + token_json["expires_in"] - 60
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
f"Obtained new Reddit access token (expires in {token_json['expires_in']}s)"
|
||||||
|
)
|
||||||
|
|
||||||
|
return self.token
|
||||||
|
|
||||||
def _fetch_post_overviews(self, endpoint: str, params: dict) -> dict:
|
def _fetch_post_overviews(self, endpoint: str, params: dict) -> dict:
|
||||||
url = f"{self.url}{endpoint}"
|
url = f"https://oauth.reddit.com/{endpoint.lstrip('/')}"
|
||||||
max_retries = 15
|
max_retries = 15
|
||||||
backoff = 1 # seconds
|
backoff = 1 # seconds
|
||||||
|
|
||||||
@@ -182,13 +225,18 @@ class RedditAPI(BaseConnector):
|
|||||||
response = requests.get(
|
response = requests.get(
|
||||||
url,
|
url,
|
||||||
headers={
|
headers={
|
||||||
"User-agent": "python:ethnography-college-project:0.1 (by /u/ThisBirchWood)"
|
"User-agent": "python:ethnography-college-project:0.1 (by /u/ThisBirchWood)",
|
||||||
|
"Authorization": f"Bearer {self._get_token()}",
|
||||||
},
|
},
|
||||||
params=params,
|
params=params,
|
||||||
)
|
)
|
||||||
|
|
||||||
if response.status_code == 429:
|
if response.status_code == 429:
|
||||||
wait_time = response.headers.get("Retry-After", backoff)
|
try:
|
||||||
|
wait_time = int(response.headers.get("X-Ratelimit-Reset", backoff))
|
||||||
|
wait_time += 1 # Add a small buffer to ensure the rate limit has reset
|
||||||
|
except ValueError:
|
||||||
|
wait_time = backoff
|
||||||
|
|
||||||
logger.warning(
|
logger.warning(
|
||||||
f"Rate limited by Reddit API. Retrying in {wait_time} seconds..."
|
f"Rate limited by Reddit API. Retrying in {wait_time} seconds..."
|
||||||
|
|||||||
@@ -1,5 +1,6 @@
|
|||||||
import os
|
import os
|
||||||
import datetime
|
import datetime
|
||||||
|
import logging
|
||||||
|
|
||||||
from dotenv import load_dotenv
|
from dotenv import load_dotenv
|
||||||
from googleapiclient.discovery import build
|
from googleapiclient.discovery import build
|
||||||
@@ -9,9 +10,11 @@ from dto.comment import Comment
|
|||||||
from server.connectors.base import BaseConnector
|
from server.connectors.base import BaseConnector
|
||||||
|
|
||||||
load_dotenv()
|
load_dotenv()
|
||||||
|
|
||||||
API_KEY = os.getenv("YOUTUBE_API_KEY")
|
API_KEY = os.getenv("YOUTUBE_API_KEY")
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
logger.setLevel(logging.INFO)
|
||||||
|
|
||||||
|
|
||||||
class YouTubeAPI(BaseConnector):
|
class YouTubeAPI(BaseConnector):
|
||||||
source_name: str = "youtube"
|
source_name: str = "youtube"
|
||||||
@@ -77,11 +80,30 @@ class YouTubeAPI(BaseConnector):
|
|||||||
return True
|
return True
|
||||||
|
|
||||||
def _search_videos(self, query, limit):
|
def _search_videos(self, query, limit):
|
||||||
request = self.youtube.search().list(
|
results = []
|
||||||
q=query, part="snippet", type="video", maxResults=limit
|
next_page_token = None
|
||||||
)
|
|
||||||
response = request.execute()
|
while len(results) < limit:
|
||||||
return response.get("items", [])
|
batch_size = min(50, limit - len(results))
|
||||||
|
|
||||||
|
request = self.youtube.search().list(
|
||||||
|
q=query,
|
||||||
|
part="snippet",
|
||||||
|
type="video",
|
||||||
|
maxResults=batch_size,
|
||||||
|
pageToken=next_page_token
|
||||||
|
)
|
||||||
|
|
||||||
|
response = request.execute()
|
||||||
|
results.extend(response.get("items", []))
|
||||||
|
logging.info(f"Fetched {len(results)} out of {limit} videos for query '{query}'")
|
||||||
|
|
||||||
|
next_page_token = response.get("nextPageToken")
|
||||||
|
if not next_page_token:
|
||||||
|
logging.warning(f"No more pages of results available for query '{query}'")
|
||||||
|
break
|
||||||
|
|
||||||
|
return results[:limit]
|
||||||
|
|
||||||
def _get_video_comments(self, video_id):
|
def _get_video_comments(self, video_id):
|
||||||
request = self.youtube.commentThreads().list(
|
request = self.youtube.commentThreads().list(
|
||||||
|
|||||||
@@ -1,3 +1,5 @@
|
|||||||
|
from time import time
|
||||||
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
import logging
|
import logging
|
||||||
|
|
||||||
@@ -46,6 +48,7 @@ def fetch_and_process_dataset(
|
|||||||
|
|
||||||
try:
|
try:
|
||||||
for metadata in source_info:
|
for metadata in source_info:
|
||||||
|
fetch_start = time()
|
||||||
name = metadata["name"]
|
name = metadata["name"]
|
||||||
search = metadata.get("search")
|
search = metadata.get("search")
|
||||||
category = metadata.get("category")
|
category = metadata.get("category")
|
||||||
@@ -57,8 +60,11 @@ def fetch_and_process_dataset(
|
|||||||
)
|
)
|
||||||
posts.extend(post.to_dict() for post in raw_posts)
|
posts.extend(post.to_dict() for post in raw_posts)
|
||||||
|
|
||||||
|
fetch_time = time() - fetch_start
|
||||||
df = pd.DataFrame(posts)
|
df = pd.DataFrame(posts)
|
||||||
|
|
||||||
|
nlp_start = time()
|
||||||
|
|
||||||
dataset_manager.set_dataset_status(
|
dataset_manager.set_dataset_status(
|
||||||
dataset_id, "processing", "NLP Processing Started"
|
dataset_id, "processing", "NLP Processing Started"
|
||||||
)
|
)
|
||||||
@@ -66,9 +72,11 @@ def fetch_and_process_dataset(
|
|||||||
processor = DatasetEnrichment(df, topics)
|
processor = DatasetEnrichment(df, topics)
|
||||||
enriched_df = processor.enrich()
|
enriched_df = processor.enrich()
|
||||||
|
|
||||||
|
nlp_time = time() - nlp_start
|
||||||
|
|
||||||
dataset_manager.save_dataset_content(dataset_id, enriched_df)
|
dataset_manager.save_dataset_content(dataset_id, enriched_df)
|
||||||
dataset_manager.set_dataset_status(
|
dataset_manager.set_dataset_status(
|
||||||
dataset_id, "complete", "NLP Processing Completed Successfully"
|
dataset_id, "complete", f"Completed Successfully. Fetch time: {fetch_time:.2f}s, NLP time: {nlp_time:.2f}s"
|
||||||
)
|
)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
dataset_manager.set_dataset_status(
|
dataset_manager.set_dataset_status(
|
||||||
|
|||||||