Compare commits
58 Commits
de61e7653f
...
v1.0
| Author | SHA1 | Date | |
|---|---|---|---|
| 5970f555fa | |||
| 9b7a51ff33 | |||
| 2d39ea6e66 | |||
| c1e5482f55 | |||
| b2d7f6edaf | |||
| 10efa664df | |||
| 3db7c1d3ae | |||
| 72e17e900e | |||
| 7b9a17f395 | |||
| 0a396dd504 | |||
| c6e8144116 | |||
| 760d2daf7f | |||
| ca38b992eb | |||
| ee9c7b4ab2 | |||
| 703a7c435c | |||
| 02ba727d05 | |||
| 76591bc89e | |||
| e35e51d295 | |||
| d2fe637743 | |||
| e1831aab7d | |||
| a3ef5a5655 | |||
| 5f943ce733 | |||
| 9964a919c3 | |||
| c11434344a | |||
| bc356848ef | |||
| 047427432f | |||
| d0d02e9ebf | |||
| 68342606e3 | |||
| afae7f42a1 | |||
| 4dd2721e98 | |||
| 99afe82464 | |||
| 8c44df94c0 | |||
| 42905cc547 | |||
| ec64551881 | |||
| e274b8295a | |||
| 3df6776111 | |||
| a347869353 | |||
| 8b4e13702e | |||
| 8fa4f3fbdf | |||
| c6cae040f0 | |||
| addc1d4087 | |||
| 225133a074 | |||
| e903e1b738 | |||
| 0c4dc02852 | |||
| 33e4291def | |||
| cedbce128e | |||
| 107dae0e95 | |||
| 23833e2c5b | |||
| f2b6917f1f | |||
| b57a8d3c65 | |||
| ac65e26eab | |||
| 6efa75dfe6 | |||
| 37d08c63b8 | |||
| 1482e96051 | |||
| cd6030a760 | |||
| 6378015726 | |||
| 430793cd09 | |||
| b270ed03ae |
3
.gitignore
vendored
@@ -12,4 +12,5 @@ dist/
|
||||
|
||||
helper
|
||||
db
|
||||
report/build
|
||||
report/build
|
||||
.DS_Store
|
||||
60
README.md
@@ -1,29 +1,49 @@
|
||||
# crosspost
|
||||
**crosspost** is a browser-based tool designed to support *digital ethnography*, the study of how people interact, communicate, and form culture in online spaces such as forums, social media platforms, and comment-driven communities.
|
||||
A web-based analytics platform for exploring online communities. Built as a final year CS project at UCC, crosspost ingests data from Reddit, YouTube, and Boards.ie, runs NLP analysis on it (emotion detection, topic classification, named entity recognition, stance markers), and surfaces the results through an interactive dashboard.
|
||||
The motivating use case is digital ethnography — studying how people talk, what they talk about, and how culture forms in online spaces. The included dataset is centred on Cork, Ireland.
|
||||
|
||||
The project aims to make it easier for students, researchers, and journalists to collect, organise, and explore online discourse in a structured and ethical way, without requiring deep technical expertise.
|
||||
## What it does
|
||||
- Fetch posts and comments from Reddit, YouTube, and Boards.ie (or upload your own .jsonl file)
|
||||
- Normalise everything into a unified schema regardless of source
|
||||
- Run NLP analysis asynchronously in the background via Celery workers
|
||||
- Explore results through a tabbed dashboard: temporal patterns, word clouds, emotion breakdowns, user activity, interaction graphs, topic clusters, and more
|
||||
- Multi-user support — each user has their own datasets, isolated from everyone else
|
||||
|
||||
By combining data ingestion, analysis, and visualisation in a single system, crosspost turns raw online interactions into meaningful insights about how conversations emerge, evolve, and spread across platforms.
|
||||
# Prerequisites
|
||||
- Docker & Docker Compose
|
||||
- A Reddit App (client id & secret)
|
||||
- YouTube Data v3 API Key
|
||||
|
||||
## Goals for this project
|
||||
- Collect data ethically: enable users to link/upload text, images, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well.
|
||||
- Organise content: Store gathered material in a structured database with tagging for themes, dates, and sources.
|
||||
Analyse patterns: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
|
||||
- Visualise insights: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
|
||||
- Have clearly stated and explained ethical and privacy guidelines for users. The student will design the architecture, implement data pipelines, integrate basic NLP models, and create an interactive dashboard.
|
||||
# Setup
|
||||
1) **Clone the Repo**
|
||||
```
|
||||
git clone https://github.com/your-username/crosspost.git
|
||||
cd crosspost
|
||||
```
|
||||
|
||||
Beyond programming, the project involves applying ethical research principles, handling data responsibly, and designing for non-technical users. By the end, the project will demonstrate how computer science can bridge technology and social research — turning raw online interactions into meaningful cultural insights.
|
||||
2) **Configure Enviornment Vars**
|
||||
```
|
||||
cp example.env .env
|
||||
```
|
||||
Fill in each required empty env. Some are already filled in, these are sensible defaults that usually don't need to be changed
|
||||
|
||||
## Scope
|
||||
3) **Start everything**
|
||||
```
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
This project focuses on:
|
||||
- Designing a modular data ingestion pipeline
|
||||
- Implementing backend data processing and storage
|
||||
- Integrating lightweight NLP-based analysis
|
||||
- Building a simple, accessible frontend for exploration and visualisation
|
||||
This starts:
|
||||
- `crosspost_db` — PostgreSQL on port 5432
|
||||
- `crosspost_redis` — Redis on port 6379
|
||||
- `crosspost_flask` — Flask API on port 5000
|
||||
- `crosspost_worker` — Celery worker for background NLP/fetching tasks
|
||||
- `crosspost_frontend` — Vite dev server on port 5173
|
||||
|
||||
# Requirements
|
||||
# Data Format for Manual Uploads
|
||||
If you want to upload your own data rather than fetch it via the connectors, the expected format is newline-delimited JSON (.jsonl) where each line is a post object:
|
||||
```json
|
||||
{"id": "abc123", "author": "username", "title": "Post title", "content": "Post body", "url": "https://...", "timestamp": 1700000000.0, "source": "reddit", "comments": []}
|
||||
```
|
||||
|
||||
- **Python** ≥ 3.9
|
||||
- **Python packages** listed in `requirements.txt`
|
||||
- npm ≥ version 11
|
||||
# Notes
|
||||
- **GPU support**: The Celery worker is configured with `--pool=solo` to avoid memory conflicts when multiple NLP models are loaded. If you have an NVIDIA GPU, uncomment the deploy.resources block in docker-compose.yml and make sure the NVIDIA Container Toolkit is installed.
|
||||
@@ -28,7 +28,7 @@ services:
|
||||
- .env
|
||||
ports:
|
||||
- "5000:5000"
|
||||
command: flask --app server.app run --host=0.0.0.0 --debug
|
||||
command: gunicorn server.app:app --bind 0.0.0.0:5000 --workers 2 --threads 4
|
||||
depends_on:
|
||||
- postgres
|
||||
- redis
|
||||
@@ -48,13 +48,13 @@ services:
|
||||
depends_on:
|
||||
- postgres
|
||||
- redis
|
||||
# deploy:
|
||||
# resources:
|
||||
# reservations:
|
||||
# devices:
|
||||
# - driver: nvidia
|
||||
# count: 1
|
||||
# capabilities: [gpu]
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
|
||||
frontend:
|
||||
build:
|
||||
@@ -69,4 +69,4 @@ services:
|
||||
- backend
|
||||
|
||||
volumes:
|
||||
model_cache:
|
||||
model_cache:
|
||||
|
||||
@@ -1,8 +0,0 @@
|
||||
# Generic User Data Transfer Object for social media platforms
|
||||
class User:
|
||||
def __init__(self, username: str, created_utc: int, ):
|
||||
self.username = username
|
||||
self.created_utc = created_utc
|
||||
|
||||
# Optionals
|
||||
self.karma = None
|
||||
11
example.env
@@ -4,12 +4,13 @@ REDDIT_CLIENT_ID=
|
||||
REDDIT_CLIENT_SECRET=
|
||||
|
||||
# Database
|
||||
POSTGRES_USER=
|
||||
POSTGRES_PASSWORD=
|
||||
POSTGRES_DB=
|
||||
POSTGRES_HOST=
|
||||
# Database
|
||||
POSTGRES_USER=postgres
|
||||
POSTGRES_PASSWORD=postgres
|
||||
POSTGRES_DB=mydatabase
|
||||
POSTGRES_HOST=postgres
|
||||
POSTGRES_PORT=5432
|
||||
POSTGRES_DIR=
|
||||
POSTGRES_DIR=./db
|
||||
|
||||
# JWT
|
||||
JWT_SECRET_KEY=
|
||||
|
||||
@@ -5,7 +5,7 @@ import DatasetsPage from "./pages/Datasets";
|
||||
import DatasetStatusPage from "./pages/DatasetStatus";
|
||||
import LoginPage from "./pages/Login";
|
||||
import UploadPage from "./pages/Upload";
|
||||
import AutoScrapePage from "./pages/AutoScrape";
|
||||
import AutoFetchPage from "./pages/AutoFetch";
|
||||
import StatPage from "./pages/Stats";
|
||||
import { getDocumentTitle } from "./utils/documentTitle";
|
||||
import DatasetEditPage from "./pages/DatasetEdit";
|
||||
@@ -23,7 +23,7 @@ function App() {
|
||||
<Route path="/" element={<Navigate to="/login" replace />} />
|
||||
<Route path="/login" element={<LoginPage />} />
|
||||
<Route path="/upload" element={<UploadPage />} />
|
||||
<Route path="/auto-scrape" element={<AutoScrapePage />} />
|
||||
<Route path="/auto-fetch" element={<AutoFetchPage />} />
|
||||
<Route path="/datasets" element={<DatasetsPage />} />
|
||||
<Route path="/dataset/:datasetId/status" element={<DatasetStatusPage />} />
|
||||
<Route path="/dataset/:datasetId/stats" element={<StatPage />} />
|
||||
|
||||
247
frontend/src/components/CorpusExplorer.tsx
Normal file
@@ -0,0 +1,247 @@
|
||||
import { useEffect, useState } from "react";
|
||||
import { Dialog, DialogPanel, DialogTitle } from "@headlessui/react";
|
||||
|
||||
import StatsStyling from "../styles/stats_styling";
|
||||
import type { DatasetRecord } from "../utils/corpusExplorer";
|
||||
|
||||
const styles = StatsStyling;
|
||||
const INITIAL_RECORD_COUNT = 60;
|
||||
const RECORD_BATCH_SIZE = 60;
|
||||
const EXCERPT_LENGTH = 320;
|
||||
|
||||
const cleanText = (value: unknown) => {
|
||||
if (typeof value !== "string") {
|
||||
return "";
|
||||
}
|
||||
|
||||
const trimmed = value.trim();
|
||||
if (!trimmed) {
|
||||
return "";
|
||||
}
|
||||
|
||||
const lowered = trimmed.toLowerCase();
|
||||
if (lowered === "nan" || lowered === "null" || lowered === "undefined") {
|
||||
return "";
|
||||
}
|
||||
|
||||
return trimmed;
|
||||
};
|
||||
|
||||
const displayText = (value: unknown, fallback: string) => {
|
||||
const cleaned = cleanText(value);
|
||||
return cleaned || fallback;
|
||||
};
|
||||
|
||||
type CorpusExplorerProps = {
|
||||
open: boolean;
|
||||
onClose: () => void;
|
||||
title: string;
|
||||
description: string;
|
||||
records: DatasetRecord[];
|
||||
loading: boolean;
|
||||
error: string;
|
||||
emptyMessage: string;
|
||||
};
|
||||
|
||||
const formatRecordDate = (record: DatasetRecord) => {
|
||||
if (typeof record.dt === "string" && record.dt) {
|
||||
const date = new Date(record.dt);
|
||||
if (!Number.isNaN(date.getTime())) {
|
||||
return date.toLocaleString();
|
||||
}
|
||||
}
|
||||
|
||||
if (typeof record.date === "string" && record.date) {
|
||||
return record.date;
|
||||
}
|
||||
|
||||
if (typeof record.timestamp === "number") {
|
||||
return new Date(record.timestamp * 1000).toLocaleString();
|
||||
}
|
||||
|
||||
return "Unknown time";
|
||||
};
|
||||
|
||||
const getRecordKey = (record: DatasetRecord, index: number) =>
|
||||
String(record.id ?? record.post_id ?? `${record.author ?? "record"}-${index}`);
|
||||
|
||||
const getRecordTitle = (record: DatasetRecord) => {
|
||||
if (record.type === "comment") {
|
||||
return "";
|
||||
}
|
||||
|
||||
const title = cleanText(record.title);
|
||||
if (title) {
|
||||
return title;
|
||||
}
|
||||
|
||||
const content = cleanText(record.content);
|
||||
if (!content) {
|
||||
return "Untitled record";
|
||||
}
|
||||
|
||||
return content.length > 120 ? `${content.slice(0, 117)}...` : content;
|
||||
};
|
||||
|
||||
const CorpusExplorer = ({
|
||||
open,
|
||||
onClose,
|
||||
title,
|
||||
description,
|
||||
records,
|
||||
loading,
|
||||
error,
|
||||
emptyMessage,
|
||||
}: CorpusExplorerProps) => {
|
||||
const [visibleCount, setVisibleCount] = useState(INITIAL_RECORD_COUNT);
|
||||
const [expandedKeys, setExpandedKeys] = useState<Record<string, boolean>>({});
|
||||
|
||||
useEffect(() => {
|
||||
if (open) {
|
||||
setVisibleCount(INITIAL_RECORD_COUNT);
|
||||
setExpandedKeys({});
|
||||
}
|
||||
}, [open, title, records.length]);
|
||||
|
||||
const hasMoreRecords = visibleCount < records.length;
|
||||
|
||||
return (
|
||||
<Dialog open={open} onClose={onClose} style={styles.modalRoot}>
|
||||
<div style={styles.modalBackdrop} />
|
||||
|
||||
<div style={styles.modalContainer}>
|
||||
<DialogPanel
|
||||
style={{
|
||||
...styles.card,
|
||||
...styles.modalPanel,
|
||||
width: "min(960px, 96vw)",
|
||||
maxHeight: "88vh",
|
||||
display: "flex",
|
||||
flexDirection: "column",
|
||||
gap: 12,
|
||||
overflow: "hidden",
|
||||
}}
|
||||
>
|
||||
<div style={styles.headerBar}>
|
||||
<div style={{ minWidth: 0 }}>
|
||||
<DialogTitle style={styles.sectionTitle}>{title}</DialogTitle>
|
||||
<p style={styles.sectionSubtitle}>
|
||||
{description} {loading ? "Loading records..." : `${records.length.toLocaleString()} records.`}
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<button onClick={onClose} style={styles.buttonSecondary}>
|
||||
Close
|
||||
</button>
|
||||
</div>
|
||||
|
||||
{error ? <p style={styles.sectionSubtitle}>{error}</p> : null}
|
||||
|
||||
{!loading && !error && !records.length ? (
|
||||
<p style={styles.sectionSubtitle}>{emptyMessage}</p>
|
||||
) : null}
|
||||
|
||||
{loading ? <div style={styles.topUserMeta}>Preparing corpus slice...</div> : null}
|
||||
|
||||
{!loading && !error && records.length ? (
|
||||
<>
|
||||
<div
|
||||
style={{
|
||||
...styles.topUsersList,
|
||||
overflowY: "auto",
|
||||
overflowX: "hidden",
|
||||
paddingRight: 4,
|
||||
}}
|
||||
>
|
||||
{records.slice(0, visibleCount).map((record, index) => {
|
||||
const recordKey = getRecordKey(record, index);
|
||||
const titleText = getRecordTitle(record);
|
||||
const content = cleanText(record.content);
|
||||
const isExpanded = !!expandedKeys[recordKey];
|
||||
const canExpand = content.length > EXCERPT_LENGTH;
|
||||
const excerpt =
|
||||
canExpand && !isExpanded
|
||||
? `${content.slice(0, EXCERPT_LENGTH - 3)}...`
|
||||
: content || "No content available.";
|
||||
|
||||
return (
|
||||
<div key={recordKey} style={styles.topUserItem}>
|
||||
<div style={{ ...styles.headerBar, alignItems: "flex-start" }}>
|
||||
<div style={{ minWidth: 0, flex: 1 }}>
|
||||
{titleText ? <div style={styles.topUserName}>{titleText}</div> : null}
|
||||
<div
|
||||
style={{
|
||||
...styles.topUserMeta,
|
||||
overflowWrap: "anywhere",
|
||||
wordBreak: "break-word",
|
||||
}}
|
||||
>
|
||||
{displayText(record.author, "Unknown author")} • {displayText(record.source, "Unknown source")} • {displayText(record.type, "record")} • {formatRecordDate(record)}
|
||||
</div>
|
||||
</div>
|
||||
<div
|
||||
style={{
|
||||
...styles.topUserMeta,
|
||||
marginLeft: 12,
|
||||
textAlign: "right",
|
||||
overflowWrap: "anywhere",
|
||||
wordBreak: "break-word",
|
||||
}}
|
||||
>
|
||||
{cleanText(record.topic) ? `Topic: ${cleanText(record.topic)}` : ""}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div
|
||||
style={{
|
||||
...styles.topUserMeta,
|
||||
marginTop: 8,
|
||||
whiteSpace: "pre-wrap",
|
||||
overflowWrap: "anywhere",
|
||||
wordBreak: "break-word",
|
||||
}}
|
||||
>
|
||||
{excerpt}
|
||||
</div>
|
||||
|
||||
{canExpand ? (
|
||||
<div style={{ marginTop: 10 }}>
|
||||
<button
|
||||
onClick={() =>
|
||||
setExpandedKeys((current) => ({
|
||||
...current,
|
||||
[recordKey]: !current[recordKey],
|
||||
}))
|
||||
}
|
||||
style={styles.buttonSecondary}
|
||||
>
|
||||
{isExpanded ? "Show Less" : "Show More"}
|
||||
</button>
|
||||
</div>
|
||||
) : null}
|
||||
</div>
|
||||
);
|
||||
})}
|
||||
</div>
|
||||
|
||||
{hasMoreRecords ? (
|
||||
<div style={{ display: "flex", justifyContent: "center" }}>
|
||||
<button
|
||||
onClick={() =>
|
||||
setVisibleCount((current) => current + RECORD_BATCH_SIZE)
|
||||
}
|
||||
style={styles.buttonSecondary}
|
||||
>
|
||||
Show More Records
|
||||
</button>
|
||||
</div>
|
||||
) : null}
|
||||
</>
|
||||
) : null}
|
||||
</DialogPanel>
|
||||
</div>
|
||||
</Dialog>
|
||||
);
|
||||
};
|
||||
|
||||
export default CorpusExplorer;
|
||||
@@ -1,14 +1,34 @@
|
||||
import Card from "./Card";
|
||||
import StatsStyling from "../styles/stats_styling";
|
||||
import type { CulturalAnalysisResponse } from "../types/ApiTypes";
|
||||
import {
|
||||
buildCertaintySpec,
|
||||
buildDeonticSpec,
|
||||
buildEntitySpec,
|
||||
buildHedgeSpec,
|
||||
buildIdentityBucketSpec,
|
||||
buildPermissionSpec,
|
||||
type CorpusExplorerSpec,
|
||||
} from "../utils/corpusExplorer";
|
||||
|
||||
const styles = StatsStyling;
|
||||
const exploreButtonStyle = { padding: "4px 8px", fontSize: 12 };
|
||||
|
||||
type CulturalStatsProps = {
|
||||
data: CulturalAnalysisResponse;
|
||||
onExplore: (spec: CorpusExplorerSpec) => void;
|
||||
};
|
||||
|
||||
const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
const renderExploreButton = (onClick: () => void) => (
|
||||
<button
|
||||
onClick={onClick}
|
||||
style={{ ...styles.buttonSecondary, ...exploreButtonStyle }}
|
||||
>
|
||||
Explore
|
||||
</button>
|
||||
);
|
||||
|
||||
const CulturalStats = ({ data, onExplore }: CulturalStatsProps) => {
|
||||
const identity = data.identity_markers;
|
||||
const stance = data.stance_markers;
|
||||
const inGroupWords = identity?.in_group_usage ?? 0;
|
||||
@@ -30,7 +50,7 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
const topEmotion = (emotionAvg: Record<string, number> | undefined) => {
|
||||
const entries = Object.entries(emotionAvg ?? {});
|
||||
if (!entries.length) {
|
||||
return "—";
|
||||
return "-";
|
||||
}
|
||||
|
||||
entries.sort((a, b) => b[1] - a[1]);
|
||||
@@ -64,21 +84,30 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
/>
|
||||
<Card
|
||||
label="In-Group Posts"
|
||||
value={identity?.in_group_posts?.toLocaleString() ?? "—"}
|
||||
value={identity?.in_group_posts?.toLocaleString() ?? "-"}
|
||||
sublabel='Posts leaning toward "us" language'
|
||||
rightSlot={renderExploreButton(() =>
|
||||
onExplore(buildIdentityBucketSpec("in")),
|
||||
)}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
<Card
|
||||
label="Out-Group Posts"
|
||||
value={identity?.out_group_posts?.toLocaleString() ?? "—"}
|
||||
value={identity?.out_group_posts?.toLocaleString() ?? "-"}
|
||||
sublabel='Posts leaning toward "them" language'
|
||||
rightSlot={renderExploreButton(() =>
|
||||
onExplore(buildIdentityBucketSpec("out")),
|
||||
)}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
|
||||
<Card
|
||||
label="Balanced Posts"
|
||||
value={identity?.tie_posts?.toLocaleString() ?? "—"}
|
||||
value={identity?.tie_posts?.toLocaleString() ?? "-"}
|
||||
sublabel="Posts with equal us/them signals"
|
||||
rightSlot={renderExploreButton(() =>
|
||||
onExplore(buildIdentityBucketSpec("tie")),
|
||||
)}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
<Card
|
||||
@@ -90,7 +119,7 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
<Card
|
||||
label="In-Group Share"
|
||||
value={
|
||||
inGroupWordRate === null ? "—" : `${inGroupWordRate.toFixed(2)}%`
|
||||
inGroupWordRate === null ? "-" : `${inGroupWordRate.toFixed(2)}%`
|
||||
}
|
||||
sublabel="Share of all words"
|
||||
style={{ gridColumn: "span 3" }}
|
||||
@@ -98,7 +127,7 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
<Card
|
||||
label="Out-Group Share"
|
||||
value={
|
||||
outGroupWordRate === null ? "—" : `${outGroupWordRate.toFixed(2)}%`
|
||||
outGroupWordRate === null ? "-" : `${outGroupWordRate.toFixed(2)}%`
|
||||
}
|
||||
sublabel="Share of all words"
|
||||
style={{ gridColumn: "span 3" }}
|
||||
@@ -106,42 +135,46 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
|
||||
<Card
|
||||
label="Hedging Words"
|
||||
value={stance?.hedge_total?.toLocaleString() ?? "—"}
|
||||
value={stance?.hedge_total?.toLocaleString() ?? "-"}
|
||||
sublabel={
|
||||
typeof stance?.hedge_per_1k_tokens === "number"
|
||||
? `${stance.hedge_per_1k_tokens.toFixed(1)} per 1k words`
|
||||
: "Word frequency"
|
||||
}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildHedgeSpec()))}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
<Card
|
||||
label="Certainty Words"
|
||||
value={stance?.certainty_total?.toLocaleString() ?? "—"}
|
||||
value={stance?.certainty_total?.toLocaleString() ?? "-"}
|
||||
sublabel={
|
||||
typeof stance?.certainty_per_1k_tokens === "number"
|
||||
? `${stance.certainty_per_1k_tokens.toFixed(1)} per 1k words`
|
||||
: "Word frequency"
|
||||
}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildCertaintySpec()))}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
<Card
|
||||
label="Need/Should Words"
|
||||
value={stance?.deontic_total?.toLocaleString() ?? "—"}
|
||||
value={stance?.deontic_total?.toLocaleString() ?? "-"}
|
||||
sublabel={
|
||||
typeof stance?.deontic_per_1k_tokens === "number"
|
||||
? `${stance.deontic_per_1k_tokens.toFixed(1)} per 1k words`
|
||||
: "Word frequency"
|
||||
}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildDeonticSpec()))}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
<Card
|
||||
label="Permission Words"
|
||||
value={stance?.permission_total?.toLocaleString() ?? "—"}
|
||||
value={stance?.permission_total?.toLocaleString() ?? "-"}
|
||||
sublabel={
|
||||
typeof stance?.permission_per_1k_tokens === "number"
|
||||
? `${stance.permission_per_1k_tokens.toFixed(1)} per 1k words`
|
||||
: "Word frequency"
|
||||
}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildPermissionSpec()))}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
|
||||
@@ -150,8 +183,14 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
<p style={styles.sectionSubtitle}>
|
||||
Most likely emotion when in-group wording is stronger.
|
||||
</p>
|
||||
<div style={styles.topUserName}>
|
||||
{topEmotion(identity?.in_group_emotion_avg)}
|
||||
<div style={styles.topUserName}>{topEmotion(identity?.in_group_emotion_avg)}</div>
|
||||
<div style={{ marginTop: 12 }}>
|
||||
<button
|
||||
onClick={() => onExplore(buildIdentityBucketSpec("in"))}
|
||||
style={styles.buttonSecondary}
|
||||
>
|
||||
Explore records
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -160,8 +199,14 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
<p style={styles.sectionSubtitle}>
|
||||
Most likely emotion when out-group wording is stronger.
|
||||
</p>
|
||||
<div style={styles.topUserName}>
|
||||
{topEmotion(identity?.out_group_emotion_avg)}
|
||||
<div style={styles.topUserName}>{topEmotion(identity?.out_group_emotion_avg)}</div>
|
||||
<div style={{ marginTop: 12 }}>
|
||||
<button
|
||||
onClick={() => onExplore(buildIdentityBucketSpec("out"))}
|
||||
style={styles.buttonSecondary}
|
||||
>
|
||||
Explore records
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -171,9 +216,7 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
Most mentioned entities and the mood that appears most with each.
|
||||
</p>
|
||||
{!entities.length ? (
|
||||
<div style={styles.topUserMeta}>
|
||||
No entity-level cultural data available.
|
||||
</div>
|
||||
<div style={styles.topUserMeta}>No entity-level cultural data available.</div>
|
||||
) : (
|
||||
<div
|
||||
style={{
|
||||
@@ -183,7 +226,11 @@ const CulturalStats = ({ data }: CulturalStatsProps) => {
|
||||
}}
|
||||
>
|
||||
{entities.map(([entity, aggregate]) => (
|
||||
<div key={entity} style={styles.topUserItem}>
|
||||
<div
|
||||
key={entity}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildEntitySpec(entity))}
|
||||
>
|
||||
<div style={styles.topUserName}>{entity}</div>
|
||||
<div style={styles.topUserMeta}>
|
||||
{aggregate.post_count.toLocaleString()} posts • Likely mood:{" "}
|
||||
|
||||
@@ -1,13 +1,20 @@
|
||||
import type { EmotionalAnalysisResponse } from "../types/ApiTypes";
|
||||
import StatsStyling from "../styles/stats_styling";
|
||||
import {
|
||||
buildDominantEmotionSpec,
|
||||
buildSourceSpec,
|
||||
buildTopicSpec,
|
||||
type CorpusExplorerSpec,
|
||||
} from "../utils/corpusExplorer";
|
||||
|
||||
const styles = StatsStyling;
|
||||
|
||||
type EmotionalStatsProps = {
|
||||
emotionalData: EmotionalAnalysisResponse;
|
||||
onExplore: (spec: CorpusExplorerSpec) => void;
|
||||
};
|
||||
|
||||
const EmotionalStats = ({ emotionalData }: EmotionalStatsProps) => {
|
||||
const EmotionalStats = ({ emotionalData, onExplore }: EmotionalStatsProps) => {
|
||||
const rows = emotionalData.average_emotion_by_topic ?? [];
|
||||
const overallEmotionAverage = emotionalData.overall_emotion_average ?? [];
|
||||
const dominantEmotionDistribution =
|
||||
@@ -126,7 +133,11 @@ const EmotionalStats = ({ emotionalData }: EmotionalStatsProps) => {
|
||||
{[...overallEmotionAverage]
|
||||
.sort((a, b) => b.score - a.score)
|
||||
.map((row) => (
|
||||
<div key={row.emotion} style={styles.topUserItem}>
|
||||
<div
|
||||
key={row.emotion}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildDominantEmotionSpec(row.emotion))}
|
||||
>
|
||||
<div style={styles.topUserName}>
|
||||
{formatEmotion(row.emotion)}
|
||||
</div>
|
||||
@@ -157,7 +168,11 @@ const EmotionalStats = ({ emotionalData }: EmotionalStatsProps) => {
|
||||
{[...dominantEmotionDistribution]
|
||||
.sort((a, b) => b.ratio - a.ratio)
|
||||
.map((row) => (
|
||||
<div key={row.emotion} style={styles.topUserItem}>
|
||||
<div
|
||||
key={row.emotion}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildDominantEmotionSpec(row.emotion))}
|
||||
>
|
||||
<div style={styles.topUserName}>
|
||||
{formatEmotion(row.emotion)}
|
||||
</div>
|
||||
@@ -189,7 +204,11 @@ const EmotionalStats = ({ emotionalData }: EmotionalStatsProps) => {
|
||||
{[...emotionBySource]
|
||||
.sort((a, b) => b.event_count - a.event_count)
|
||||
.map((row) => (
|
||||
<div key={row.source} style={styles.topUserItem}>
|
||||
<div
|
||||
key={row.source}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildSourceSpec(row.source))}
|
||||
>
|
||||
<div style={styles.topUserName}>{row.source}</div>
|
||||
<div style={styles.topUserMeta}>
|
||||
{formatEmotion(row.dominant_emotion)} •{" "}
|
||||
@@ -211,7 +230,8 @@ const EmotionalStats = ({ emotionalData }: EmotionalStatsProps) => {
|
||||
{strongestPerTopic.map((topic) => (
|
||||
<div
|
||||
key={topic.topic}
|
||||
style={{ ...styles.cardBase, gridColumn: "span 4" }}
|
||||
style={{ ...styles.cardBase, gridColumn: "span 4", cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildTopicSpec(topic.topic))}
|
||||
>
|
||||
<h3 style={{ ...styles.sectionTitle, marginBottom: 6 }}>
|
||||
{topic.topic}
|
||||
|
||||
@@ -1,14 +1,20 @@
|
||||
import Card from "./Card";
|
||||
import StatsStyling from "../styles/stats_styling";
|
||||
import type { LinguisticAnalysisResponse } from "../types/ApiTypes";
|
||||
import {
|
||||
buildNgramSpec,
|
||||
buildWordSpec,
|
||||
type CorpusExplorerSpec,
|
||||
} from "../utils/corpusExplorer";
|
||||
|
||||
const styles = StatsStyling;
|
||||
|
||||
type LinguisticStatsProps = {
|
||||
data: LinguisticAnalysisResponse;
|
||||
onExplore: (spec: CorpusExplorerSpec) => void;
|
||||
};
|
||||
|
||||
const LinguisticStats = ({ data }: LinguisticStatsProps) => {
|
||||
const LinguisticStats = ({ data, onExplore }: LinguisticStatsProps) => {
|
||||
const lexical = data.lexical_diversity;
|
||||
const words = data.word_frequencies ?? [];
|
||||
const bigrams = data.common_two_phrases ?? [];
|
||||
@@ -60,7 +66,11 @@ const LinguisticStats = ({ data }: LinguisticStatsProps) => {
|
||||
}}
|
||||
>
|
||||
{topWords.map((item) => (
|
||||
<div key={item.word} style={styles.topUserItem}>
|
||||
<div
|
||||
key={item.word}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildWordSpec(item.word))}
|
||||
>
|
||||
<div style={styles.topUserName}>{item.word}</div>
|
||||
<div style={styles.topUserMeta}>
|
||||
{item.count.toLocaleString()} uses
|
||||
@@ -81,7 +91,11 @@ const LinguisticStats = ({ data }: LinguisticStatsProps) => {
|
||||
}}
|
||||
>
|
||||
{topBigrams.map((item) => (
|
||||
<div key={item.ngram} style={styles.topUserItem}>
|
||||
<div
|
||||
key={item.ngram}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildNgramSpec(item.ngram))}
|
||||
>
|
||||
<div style={styles.topUserName}>{item.ngram}</div>
|
||||
<div style={styles.topUserMeta}>
|
||||
{item.count.toLocaleString()} uses
|
||||
@@ -102,7 +116,11 @@ const LinguisticStats = ({ data }: LinguisticStatsProps) => {
|
||||
}}
|
||||
>
|
||||
{topTrigrams.map((item) => (
|
||||
<div key={item.ngram} style={styles.topUserItem}>
|
||||
<div
|
||||
key={item.ngram}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => onExplore(buildNgramSpec(item.ngram))}
|
||||
>
|
||||
<div style={styles.topUserName}>{item.ngram}</div>
|
||||
<div style={styles.topUserMeta}>
|
||||
{item.count.toLocaleString()} uses
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
import { memo, useMemo, useState } from "react";
|
||||
import { memo, useMemo } from "react";
|
||||
import {
|
||||
LineChart,
|
||||
Line,
|
||||
@@ -13,7 +13,6 @@ import ActivityHeatmap from "../stats/ActivityHeatmap";
|
||||
import { ReactWordcloud } from "@cp949/react-wordcloud";
|
||||
import StatsStyling from "../styles/stats_styling";
|
||||
import Card from "../components/Card";
|
||||
import UserModal from "../components/UserModal";
|
||||
|
||||
import {
|
||||
type SummaryResponse,
|
||||
@@ -21,11 +20,18 @@ import {
|
||||
type UserEndpointResponse,
|
||||
type TimeAnalysisResponse,
|
||||
type LinguisticAnalysisResponse,
|
||||
type User,
|
||||
} from "../types/ApiTypes";
|
||||
import {
|
||||
buildAllRecordsSpec,
|
||||
buildDateBucketSpec,
|
||||
buildOneTimeUsersSpec,
|
||||
buildUserSpec,
|
||||
type CorpusExplorerSpec,
|
||||
} from "../utils/corpusExplorer";
|
||||
|
||||
const styles = StatsStyling;
|
||||
const MAX_WORDCLOUD_WORDS = 250;
|
||||
const exploreButtonStyle = { padding: "4px 8px", fontSize: 12 };
|
||||
|
||||
const WORDCLOUD_OPTIONS = {
|
||||
rotations: 2,
|
||||
@@ -39,6 +45,7 @@ type SummaryStatsProps = {
|
||||
timeData: TimeAnalysisResponse | null;
|
||||
linguisticData: LinguisticAnalysisResponse | null;
|
||||
summary: SummaryResponse | null;
|
||||
onExplore: (spec: CorpusExplorerSpec) => void;
|
||||
};
|
||||
|
||||
type WordCloudPanelProps = {
|
||||
@@ -60,7 +67,7 @@ function formatDateRange(startUnix: number, endUnix: number) {
|
||||
day: "2-digit",
|
||||
});
|
||||
|
||||
return `${fmt(start)} → ${fmt(end)}`;
|
||||
return `${fmt(start)} -> ${fmt(end)}`;
|
||||
}
|
||||
|
||||
function convertFrequencyData(data: FrequencyWord[]) {
|
||||
@@ -70,25 +77,22 @@ function convertFrequencyData(data: FrequencyWord[]) {
|
||||
}));
|
||||
}
|
||||
|
||||
const renderExploreButton = (onClick: () => void) => (
|
||||
<button
|
||||
onClick={onClick}
|
||||
style={{ ...styles.buttonSecondary, ...exploreButtonStyle }}
|
||||
>
|
||||
Explore
|
||||
</button>
|
||||
);
|
||||
|
||||
const SummaryStats = ({
|
||||
userData,
|
||||
timeData,
|
||||
linguisticData,
|
||||
summary,
|
||||
onExplore,
|
||||
}: SummaryStatsProps) => {
|
||||
const [selectedUser, setSelectedUser] = useState<string | null>(null);
|
||||
const usersByAuthor = useMemo(() => {
|
||||
const nextMap = new Map<string, User>();
|
||||
for (const user of userData?.users ?? []) {
|
||||
nextMap.set(user.author, user);
|
||||
}
|
||||
return nextMap;
|
||||
}, [userData?.users]);
|
||||
|
||||
const selectedUserData: User | null = selectedUser
|
||||
? usersByAuthor.get(selectedUser) ?? null
|
||||
: null;
|
||||
|
||||
const wordCloudWords = useMemo(
|
||||
() =>
|
||||
convertFrequencyData(
|
||||
@@ -104,49 +108,41 @@ const SummaryStats = ({
|
||||
|
||||
return (
|
||||
<div style={styles.page}>
|
||||
{/* main grid*/}
|
||||
<div style={{ ...styles.container, ...styles.grid }}>
|
||||
<Card
|
||||
label="Total Activity"
|
||||
value={summary?.total_events ?? "—"}
|
||||
value={summary?.total_events ?? "-"}
|
||||
sublabel="Posts + comments"
|
||||
style={{
|
||||
gridColumn: "span 4",
|
||||
}}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
|
||||
style={{ gridColumn: "span 4" }}
|
||||
/>
|
||||
<Card
|
||||
label="Active People"
|
||||
value={summary?.unique_users ?? "—"}
|
||||
value={summary?.unique_users ?? "-"}
|
||||
sublabel="Distinct users"
|
||||
style={{
|
||||
gridColumn: "span 4",
|
||||
}}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
|
||||
style={{ gridColumn: "span 4" }}
|
||||
/>
|
||||
<Card
|
||||
label="Posts vs Comments"
|
||||
value={
|
||||
summary ? `${summary.total_posts} / ${summary.total_comments}` : "—"
|
||||
summary ? `${summary.total_posts} / ${summary.total_comments}` : "-"
|
||||
}
|
||||
sublabel={`Comments per post: ${summary?.comments_per_post ?? "—"}`}
|
||||
style={{
|
||||
gridColumn: "span 4",
|
||||
}}
|
||||
sublabel={`Comments per post: ${summary?.comments_per_post ?? "-"}`}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
|
||||
style={{ gridColumn: "span 4" }}
|
||||
/>
|
||||
|
||||
<Card
|
||||
label="Time Range"
|
||||
value={
|
||||
summary?.time_range
|
||||
? formatDateRange(
|
||||
summary.time_range.start,
|
||||
summary.time_range.end,
|
||||
)
|
||||
: "—"
|
||||
? formatDateRange(summary.time_range.start, summary.time_range.end)
|
||||
: "-"
|
||||
}
|
||||
sublabel="Based on dataset timestamps"
|
||||
style={{
|
||||
gridColumn: "span 4",
|
||||
}}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
|
||||
style={{ gridColumn: "span 4" }}
|
||||
/>
|
||||
|
||||
<Card
|
||||
@@ -154,38 +150,44 @@ const SummaryStats = ({
|
||||
value={
|
||||
typeof summary?.lurker_ratio === "number"
|
||||
? `${Math.round(summary.lurker_ratio * 100)}%`
|
||||
: "—"
|
||||
: "-"
|
||||
}
|
||||
sublabel="Users with only one event"
|
||||
style={{
|
||||
gridColumn: "span 4",
|
||||
}}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildOneTimeUsersSpec()))}
|
||||
style={{ gridColumn: "span 4" }}
|
||||
/>
|
||||
|
||||
<Card
|
||||
label="Sources"
|
||||
value={summary?.sources?.length ?? "—"}
|
||||
value={summary?.sources?.length ?? "-"}
|
||||
sublabel={
|
||||
summary?.sources?.length
|
||||
? summary.sources.slice(0, 3).join(", ") +
|
||||
(summary.sources.length > 3 ? "…" : "")
|
||||
: "—"
|
||||
(summary.sources.length > 3 ? "..." : "")
|
||||
: "-"
|
||||
}
|
||||
style={{
|
||||
gridColumn: "span 4",
|
||||
}}
|
||||
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
|
||||
style={{ gridColumn: "span 4" }}
|
||||
/>
|
||||
|
||||
{/* events per day */}
|
||||
<div style={{ ...styles.card, gridColumn: "span 5" }}>
|
||||
<h2 style={styles.sectionTitle}>Activity Over Time</h2>
|
||||
<p style={styles.sectionSubtitle}>
|
||||
How much posting happened each day.
|
||||
</p>
|
||||
<p style={styles.sectionSubtitle}>How much posting happened each day.</p>
|
||||
|
||||
<div style={styles.chartWrapper}>
|
||||
<ResponsiveContainer width="100%" height="100%">
|
||||
<LineChart data={timeData?.events_per_day ?? []}>
|
||||
<LineChart
|
||||
data={timeData?.events_per_day ?? []}
|
||||
onClick={(state: unknown) => {
|
||||
const payload = (state as { activePayload?: Array<{ payload?: { date?: string } }> })
|
||||
?.activePayload?.[0]?.payload as
|
||||
| { date?: string }
|
||||
| undefined;
|
||||
if (payload?.date) {
|
||||
onExplore(buildDateBucketSpec(String(payload.date)));
|
||||
}
|
||||
}}
|
||||
>
|
||||
<CartesianGrid strokeDasharray="3 3" />
|
||||
<XAxis dataKey="date" />
|
||||
<YAxis />
|
||||
@@ -201,7 +203,6 @@ const SummaryStats = ({
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Word Cloud */}
|
||||
<div style={{ ...styles.card, gridColumn: "span 4" }}>
|
||||
<h2 style={styles.sectionTitle}>Common Words</h2>
|
||||
<p style={styles.sectionSubtitle}>
|
||||
@@ -213,7 +214,6 @@ const SummaryStats = ({
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Top Users */}
|
||||
<div
|
||||
style={{ ...styles.card, ...styles.scrollArea, gridColumn: "span 3" }}
|
||||
>
|
||||
@@ -225,7 +225,7 @@ const SummaryStats = ({
|
||||
<div
|
||||
key={`${item.author}-${item.source}`}
|
||||
style={{ ...styles.topUserItem, cursor: "pointer" }}
|
||||
onClick={() => setSelectedUser(item.author)}
|
||||
onClick={() => onExplore(buildUserSpec(item.author))}
|
||||
>
|
||||
<div style={styles.topUserName}>{item.author}</div>
|
||||
<div style={styles.topUserMeta}>
|
||||
@@ -236,7 +236,6 @@ const SummaryStats = ({
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Heatmap */}
|
||||
<div style={{ ...styles.card, gridColumn: "span 12" }}>
|
||||
<h2 style={styles.sectionTitle}>Weekly Activity Pattern</h2>
|
||||
<p style={styles.sectionSubtitle}>
|
||||
@@ -248,13 +247,6 @@ const SummaryStats = ({
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<UserModal
|
||||
open={!!selectedUser}
|
||||
onClose={() => setSelectedUser(null)}
|
||||
username={selectedUser ?? ""}
|
||||
userData={selectedUserData}
|
||||
/>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
@@ -88,6 +88,15 @@ export default function UserModal({
|
||||
</div>
|
||||
</div>
|
||||
) : null}
|
||||
|
||||
{userData.dominant_topic ? (
|
||||
<div style={styles.topUserItem}>
|
||||
<div style={styles.topUserName}>Most Common Topic</div>
|
||||
<div style={styles.topUserMeta}>
|
||||
{userData.dominant_topic.topic} ({userData.dominant_topic.count} events)
|
||||
</div>
|
||||
</div>
|
||||
) : null}
|
||||
</div>
|
||||
)}
|
||||
</DialogPanel>
|
||||
|
||||
@@ -5,6 +5,12 @@ import { type TopUser, type InteractionGraph } from "../types/ApiTypes";
|
||||
|
||||
import StatsStyling from "../styles/stats_styling";
|
||||
import Card from "./Card";
|
||||
import {
|
||||
buildReplyPairSpec,
|
||||
toText,
|
||||
buildUserSpec,
|
||||
type CorpusExplorerSpec,
|
||||
} from "../utils/corpusExplorer";
|
||||
|
||||
const styles = StatsStyling;
|
||||
|
||||
@@ -14,7 +20,7 @@ type GraphLink = {
|
||||
value: number;
|
||||
};
|
||||
|
||||
function ApiToGraphData(apiData: InteractionGraph) {
|
||||
function toGraphData(apiData: InteractionGraph) {
|
||||
const links: GraphLink[] = [];
|
||||
const connectedNodeIds = new Set<string>();
|
||||
|
||||
@@ -39,6 +45,7 @@ type UserStatsProps = {
|
||||
interactionGraph: InteractionGraph;
|
||||
totalUsers: number;
|
||||
mostCommentHeavyUser: { author: string; commentShare: number } | null;
|
||||
onExplore: (spec: CorpusExplorerSpec) => void;
|
||||
};
|
||||
|
||||
const UserStats = ({
|
||||
@@ -46,9 +53,10 @@ const UserStats = ({
|
||||
interactionGraph,
|
||||
totalUsers,
|
||||
mostCommentHeavyUser,
|
||||
onExplore,
|
||||
}: UserStatsProps) => {
|
||||
const graphData = useMemo(
|
||||
() => ApiToGraphData(interactionGraph),
|
||||
() => toGraphData(interactionGraph),
|
||||
[interactionGraph],
|
||||
);
|
||||
const graphContainerRef = useRef<HTMLDivElement | null>(null);
|
||||
@@ -87,9 +95,9 @@ const UserStats = ({
|
||||
null,
|
||||
);
|
||||
|
||||
const mostActiveUser = topUsers.find(
|
||||
(u) => u.author !== "[deleted]",
|
||||
);
|
||||
const mostActiveUser = topUsers.find((u) => u.author !== "[deleted]");
|
||||
const strongestLinkSource = strongestLink ? toText(strongestLink.source) : "";
|
||||
const strongestLinkTarget = strongestLink ? toText(strongestLink.target) : "";
|
||||
|
||||
return (
|
||||
<div style={styles.page}>
|
||||
@@ -114,37 +122,69 @@ const UserStats = ({
|
||||
/>
|
||||
<Card
|
||||
label="Most Active User"
|
||||
value={mostActiveUser?.author ?? "—"}
|
||||
value={mostActiveUser?.author ?? "-"}
|
||||
sublabel={
|
||||
mostActiveUser
|
||||
? `${mostActiveUser.count.toLocaleString()} events`
|
||||
: "No user activity found"
|
||||
}
|
||||
rightSlot={
|
||||
mostActiveUser ? (
|
||||
<button
|
||||
onClick={() => onExplore(buildUserSpec(mostActiveUser.author))}
|
||||
style={styles.buttonSecondary}
|
||||
>
|
||||
Explore
|
||||
</button>
|
||||
) : null
|
||||
}
|
||||
style={{ gridColumn: "span 3" }}
|
||||
/>
|
||||
|
||||
<Card
|
||||
label="Strongest User Link"
|
||||
value={
|
||||
strongestLink
|
||||
? `${strongestLink.source} -> ${strongestLink.target}`
|
||||
: "—"
|
||||
strongestLinkSource && strongestLinkTarget
|
||||
? `${strongestLinkSource} -> ${strongestLinkTarget}`
|
||||
: "-"
|
||||
}
|
||||
sublabel={
|
||||
strongestLink
|
||||
? `${strongestLink.value.toLocaleString()} replies`
|
||||
: "No graph links after filtering"
|
||||
}
|
||||
rightSlot={
|
||||
strongestLinkSource && strongestLinkTarget ? (
|
||||
<button
|
||||
onClick={() =>
|
||||
onExplore(buildReplyPairSpec(strongestLinkSource, strongestLinkTarget))
|
||||
}
|
||||
style={styles.buttonSecondary}
|
||||
>
|
||||
Explore
|
||||
</button>
|
||||
) : null
|
||||
}
|
||||
style={{ gridColumn: "span 6" }}
|
||||
/>
|
||||
<Card
|
||||
label="Most Comment-Heavy User"
|
||||
value={mostCommentHeavyUser?.author ?? "—"}
|
||||
value={mostCommentHeavyUser?.author ?? "-"}
|
||||
sublabel={
|
||||
mostCommentHeavyUser
|
||||
? `${Math.round(mostCommentHeavyUser.commentShare * 100)}% comments`
|
||||
: "No user distribution available"
|
||||
}
|
||||
rightSlot={
|
||||
mostCommentHeavyUser ? (
|
||||
<button
|
||||
onClick={() => onExplore(buildUserSpec(mostCommentHeavyUser.author))}
|
||||
style={styles.buttonSecondary}
|
||||
>
|
||||
Explore
|
||||
</button>
|
||||
) : null
|
||||
}
|
||||
style={{ gridColumn: "span 6" }}
|
||||
/>
|
||||
|
||||
@@ -166,6 +206,19 @@ const UserStats = ({
|
||||
linkDirectionalParticleSpeed={0.004}
|
||||
linkWidth={(link) => Math.sqrt(Number(link.value))}
|
||||
nodeLabel={(node) => `${node.id}`}
|
||||
onNodeClick={(node) => {
|
||||
const userId = toText(node.id);
|
||||
if (userId) {
|
||||
onExplore(buildUserSpec(userId));
|
||||
}
|
||||
}}
|
||||
onLinkClick={(link) => {
|
||||
const source = toText(link.source);
|
||||
const target = toText(link.target);
|
||||
if (source && target) {
|
||||
onExplore(buildReplyPairSpec(source, target));
|
||||
}
|
||||
}}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -37,7 +37,7 @@ const supportsSearch = (source?: SourceOption): boolean =>
|
||||
const supportsCategories = (source?: SourceOption): boolean =>
|
||||
Boolean(source?.categories_enabled ?? source?.categoriesEnabled);
|
||||
|
||||
const AutoScrapePage = () => {
|
||||
const AutoFetchPage = () => {
|
||||
const navigate = useNavigate();
|
||||
const [datasetName, setDatasetName] = useState("");
|
||||
const [sourceOptions, setSourceOptions] = useState<SourceOption[]>([]);
|
||||
@@ -106,11 +106,11 @@ const AutoScrapePage = () => {
|
||||
);
|
||||
};
|
||||
|
||||
const autoScrape = async () => {
|
||||
const autoFetch = async () => {
|
||||
const token = localStorage.getItem("access_token");
|
||||
if (!token) {
|
||||
setHasError(true);
|
||||
setReturnMessage("You must be signed in to auto scrape a dataset.");
|
||||
setReturnMessage("You must be signed in to auto fetch a dataset.");
|
||||
return;
|
||||
}
|
||||
|
||||
@@ -243,7 +243,7 @@ const AutoScrapePage = () => {
|
||||
setReturnMessage("");
|
||||
|
||||
const response = await axios.post(
|
||||
`${API_BASE_URL}/datasets/scrape`,
|
||||
`${API_BASE_URL}/datasets/fetch`,
|
||||
requestBody,
|
||||
{
|
||||
headers: {
|
||||
@@ -255,7 +255,7 @@ const AutoScrapePage = () => {
|
||||
const datasetId = Number(response.data.dataset_id);
|
||||
|
||||
setReturnMessage(
|
||||
`Auto scrape queued successfully (dataset #${datasetId}). Redirecting to processing status...`,
|
||||
`Auto fetch queued successfully (dataset #${datasetId}). Redirecting to processing status...`,
|
||||
);
|
||||
|
||||
setTimeout(() => {
|
||||
@@ -267,11 +267,11 @@ const AutoScrapePage = () => {
|
||||
const message = String(
|
||||
requestError.response?.data?.error ||
|
||||
requestError.message ||
|
||||
"Auto scrape failed.",
|
||||
"Auto fetch failed.",
|
||||
);
|
||||
setReturnMessage(`Auto scrape failed: ${message}`);
|
||||
setReturnMessage(`Auto fetch failed: ${message}`);
|
||||
} else {
|
||||
setReturnMessage("Auto scrape failed due to an unexpected error.");
|
||||
setReturnMessage("Auto fetch failed due to an unexpected error.");
|
||||
}
|
||||
} finally {
|
||||
setIsSubmitting(false);
|
||||
@@ -283,9 +283,9 @@ const AutoScrapePage = () => {
|
||||
<div style={styles.containerWide}>
|
||||
<div style={{ ...styles.card, ...styles.headerBar }}>
|
||||
<div>
|
||||
<h1 style={styles.sectionHeaderTitle}>Auto Scrape Dataset</h1>
|
||||
<h1 style={styles.sectionHeaderTitle}>Auto Fetch Dataset</h1>
|
||||
<p style={styles.sectionHeaderSubtitle}>
|
||||
Select sources and scrape settings, then queue processing
|
||||
Select sources and fetch settings, then queue processing
|
||||
automatically.
|
||||
</p>
|
||||
<p
|
||||
@@ -295,7 +295,7 @@ const AutoScrapePage = () => {
|
||||
color: "#9a6700",
|
||||
}}
|
||||
>
|
||||
Warning: Scraping more than 250 posts from any single site can
|
||||
Warning: Fetching more than 250 posts from any single site can
|
||||
take hours due to rate limits.
|
||||
</p>
|
||||
</div>
|
||||
@@ -305,10 +305,10 @@ const AutoScrapePage = () => {
|
||||
...styles.buttonPrimary,
|
||||
opacity: isSubmitting || isLoadingSources ? 0.75 : 1,
|
||||
}}
|
||||
onClick={autoScrape}
|
||||
onClick={autoFetch}
|
||||
disabled={isSubmitting || isLoadingSources}
|
||||
>
|
||||
{isSubmitting ? "Queueing..." : "Auto Scrape and Analyze"}
|
||||
{isSubmitting ? "Queueing..." : "Auto Fetch and Analyze"}
|
||||
</button>
|
||||
</div>
|
||||
|
||||
@@ -527,4 +527,4 @@ const AutoScrapePage = () => {
|
||||
);
|
||||
};
|
||||
|
||||
export default AutoScrapePage;
|
||||
export default AutoFetchPage;
|
||||
@@ -22,12 +22,10 @@ const DatasetEditPage = () => {
|
||||
const [isSaving, setIsSaving] = useState(false);
|
||||
const [isDeleting, setIsDeleting] = useState(false);
|
||||
const [isDeleteModalOpen, setIsDeleteModalOpen] = useState(false);
|
||||
const [hasError, setHasError] = useState(false);
|
||||
|
||||
const [datasetName, setDatasetName] = useState("");
|
||||
useEffect(() => {
|
||||
if (!Number.isInteger(parsedDatasetId) || parsedDatasetId <= 0) {
|
||||
setHasError(true);
|
||||
setStatusMessage("Invalid dataset id.");
|
||||
setLoading(false);
|
||||
return;
|
||||
@@ -35,7 +33,6 @@ const DatasetEditPage = () => {
|
||||
|
||||
const token = localStorage.getItem("access_token");
|
||||
if (!token) {
|
||||
setHasError(true);
|
||||
setStatusMessage("You must be signed in to edit datasets.");
|
||||
setLoading(false);
|
||||
return;
|
||||
@@ -49,7 +46,6 @@ const DatasetEditPage = () => {
|
||||
setDatasetName(response.data.name || "");
|
||||
})
|
||||
.catch((error: unknown) => {
|
||||
setHasError(true);
|
||||
if (axios.isAxiosError(error)) {
|
||||
setStatusMessage(
|
||||
String(error.response?.data?.error || error.message),
|
||||
@@ -68,21 +64,18 @@ const DatasetEditPage = () => {
|
||||
|
||||
const trimmedName = datasetName.trim();
|
||||
if (!trimmedName) {
|
||||
setHasError(true);
|
||||
setStatusMessage("Please enter a valid dataset name.");
|
||||
return;
|
||||
}
|
||||
|
||||
const token = localStorage.getItem("access_token");
|
||||
if (!token) {
|
||||
setHasError(true);
|
||||
setStatusMessage("You must be signed in to save changes.");
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
setIsSaving(true);
|
||||
setHasError(false);
|
||||
setStatusMessage("");
|
||||
|
||||
await axios.patch(
|
||||
@@ -93,7 +86,6 @@ const DatasetEditPage = () => {
|
||||
|
||||
navigate("/datasets", { replace: true });
|
||||
} catch (error: unknown) {
|
||||
setHasError(true);
|
||||
if (axios.isAxiosError(error)) {
|
||||
setStatusMessage(
|
||||
String(
|
||||
@@ -111,7 +103,6 @@ const DatasetEditPage = () => {
|
||||
const deleteDataset = async () => {
|
||||
const deleteToken = localStorage.getItem("access_token");
|
||||
if (!deleteToken) {
|
||||
setHasError(true);
|
||||
setStatusMessage("You must be signed in to delete datasets.");
|
||||
setIsDeleteModalOpen(false);
|
||||
return;
|
||||
@@ -119,7 +110,6 @@ const DatasetEditPage = () => {
|
||||
|
||||
try {
|
||||
setIsDeleting(true);
|
||||
setHasError(false);
|
||||
setStatusMessage("");
|
||||
|
||||
await axios.delete(`${API_BASE_URL}/dataset/${parsedDatasetId}`, {
|
||||
@@ -129,7 +119,6 @@ const DatasetEditPage = () => {
|
||||
setIsDeleteModalOpen(false);
|
||||
navigate("/datasets", { replace: true });
|
||||
} catch (error: unknown) {
|
||||
setHasError(true);
|
||||
if (axios.isAxiosError(error)) {
|
||||
setStatusMessage(
|
||||
String(
|
||||
|
||||
@@ -108,9 +108,9 @@ const DatasetsPage = () => {
|
||||
<button
|
||||
type="button"
|
||||
style={styles.buttonSecondary}
|
||||
onClick={() => navigate("/auto-scrape")}
|
||||
onClick={() => navigate("/auto-fetch")}
|
||||
>
|
||||
Auto Scrape Dataset
|
||||
Auto Fetch Dataset
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
import { useEffect, useState, useRef } from "react";
|
||||
import { useEffect, useRef, useState } from "react";
|
||||
import axios from "axios";
|
||||
import { useParams } from "react-router-dom";
|
||||
import StatsStyling from "../styles/stats_styling";
|
||||
@@ -8,6 +8,7 @@ import UserStats from "../components/UserStats";
|
||||
import LinguisticStats from "../components/LinguisticStats";
|
||||
import InteractionalStats from "../components/InteractionalStats";
|
||||
import CulturalStats from "../components/CulturalStats";
|
||||
import CorpusExplorer from "../components/CorpusExplorer";
|
||||
|
||||
import {
|
||||
type SummaryResponse,
|
||||
@@ -19,10 +20,15 @@ import {
|
||||
type InteractionAnalysisResponse,
|
||||
type CulturalAnalysisResponse,
|
||||
} from "../types/ApiTypes";
|
||||
import {
|
||||
buildExplorerContext,
|
||||
type CorpusExplorerSpec,
|
||||
type DatasetRecord,
|
||||
} from "../utils/corpusExplorer";
|
||||
|
||||
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
|
||||
const styles = StatsStyling;
|
||||
const DELETED_USERS = ["[deleted]"];
|
||||
const DELETED_USERS = ["[deleted]", "automoderator"];
|
||||
|
||||
const isDeletedUser = (value: string | null | undefined) =>
|
||||
DELETED_USERS.includes((value ?? "").trim().toLowerCase());
|
||||
@@ -40,6 +46,194 @@ type UserStatsMeta = {
|
||||
mostCommentHeavyUser: { author: string; commentShare: number } | null;
|
||||
};
|
||||
|
||||
type ExplorerState = {
|
||||
open: boolean;
|
||||
title: string;
|
||||
description: string;
|
||||
emptyMessage: string;
|
||||
records: DatasetRecord[];
|
||||
loading: boolean;
|
||||
error: string;
|
||||
};
|
||||
|
||||
const EMPTY_EXPLORER_STATE: ExplorerState = {
|
||||
open: false,
|
||||
title: "Corpus Explorer",
|
||||
description: "",
|
||||
emptyMessage: "No records found.",
|
||||
records: [],
|
||||
loading: false,
|
||||
error: "",
|
||||
};
|
||||
|
||||
const createExplorerState = (
|
||||
spec: CorpusExplorerSpec,
|
||||
patch: Partial<ExplorerState> = {},
|
||||
): ExplorerState => ({
|
||||
open: true,
|
||||
title: spec.title,
|
||||
description: spec.description,
|
||||
emptyMessage: spec.emptyMessage ?? "No matching records found.",
|
||||
records: [],
|
||||
loading: false,
|
||||
error: "",
|
||||
...patch,
|
||||
});
|
||||
|
||||
const compareRecordsByNewest = (a: DatasetRecord, b: DatasetRecord) => {
|
||||
const aValue = String(a.dt ?? a.date ?? a.timestamp ?? "");
|
||||
const bValue = String(b.dt ?? b.date ?? b.timestamp ?? "");
|
||||
return bValue.localeCompare(aValue);
|
||||
};
|
||||
|
||||
const parseJsonLikePayload = (value: string): unknown => {
|
||||
const normalized = value
|
||||
.replace(/\uFEFF/g, "")
|
||||
.replace(/,\s*([}\]])/g, "$1")
|
||||
.replace(/(:\s*)(NaN|Infinity|-Infinity)\b/g, "$1null")
|
||||
.replace(/(\[\s*)(NaN|Infinity|-Infinity)\b/g, "$1null")
|
||||
.replace(/(,\s*)(NaN|Infinity|-Infinity)\b/g, "$1null")
|
||||
.replace(/(:\s*)None\b/g, "$1null")
|
||||
.replace(/(:\s*)True\b/g, "$1true")
|
||||
.replace(/(:\s*)False\b/g, "$1false")
|
||||
.replace(/(\[\s*)None\b/g, "$1null")
|
||||
.replace(/(\[\s*)True\b/g, "$1true")
|
||||
.replace(/(\[\s*)False\b/g, "$1false")
|
||||
.replace(/(,\s*)None\b/g, "$1null")
|
||||
.replace(/(,\s*)True\b/g, "$1true")
|
||||
.replace(/(,\s*)False\b/g, "$1false");
|
||||
|
||||
return JSON.parse(normalized);
|
||||
};
|
||||
|
||||
const tryParseRecords = (value: string) => {
|
||||
try {
|
||||
return normalizeRecordPayload(parseJsonLikePayload(value));
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
};
|
||||
|
||||
const parseRecordStringPayload = (payload: string): DatasetRecord[] | null => {
|
||||
const trimmed = payload.trim();
|
||||
if (!trimmed) {
|
||||
return [];
|
||||
}
|
||||
|
||||
const direct = tryParseRecords(trimmed);
|
||||
if (direct) {
|
||||
return direct;
|
||||
}
|
||||
|
||||
const ndjsonLines = trimmed
|
||||
.split(/\r?\n/)
|
||||
.map((line) => line.trim())
|
||||
.filter(Boolean);
|
||||
if (ndjsonLines.length > 0) {
|
||||
try {
|
||||
return ndjsonLines.map((line) => parseJsonLikePayload(line)) as DatasetRecord[];
|
||||
} catch {
|
||||
}
|
||||
}
|
||||
|
||||
const bracketStart = trimmed.indexOf("[");
|
||||
const bracketEnd = trimmed.lastIndexOf("]");
|
||||
if (bracketStart !== -1 && bracketEnd > bracketStart) {
|
||||
const parsed = tryParseRecords(trimmed.slice(bracketStart, bracketEnd + 1));
|
||||
if (parsed) {
|
||||
return parsed;
|
||||
}
|
||||
}
|
||||
|
||||
const braceStart = trimmed.indexOf("{");
|
||||
const braceEnd = trimmed.lastIndexOf("}");
|
||||
if (braceStart !== -1 && braceEnd > braceStart) {
|
||||
const parsed = tryParseRecords(trimmed.slice(braceStart, braceEnd + 1));
|
||||
if (parsed) {
|
||||
return parsed;
|
||||
}
|
||||
}
|
||||
|
||||
return null;
|
||||
};
|
||||
|
||||
const normalizeRecordPayload = (payload: unknown): DatasetRecord[] => {
|
||||
if (typeof payload === "string") {
|
||||
const parsed = parseRecordStringPayload(payload);
|
||||
if (parsed) {
|
||||
return parsed;
|
||||
}
|
||||
|
||||
const preview = payload.trim().slice(0, 120).replace(/\s+/g, " ");
|
||||
throw new Error(
|
||||
`Corpus endpoint returned a non-JSON string payload.${
|
||||
preview ? ` Response preview: ${preview}` : ""
|
||||
}`,
|
||||
);
|
||||
}
|
||||
|
||||
if (
|
||||
payload &&
|
||||
typeof payload === "object" &&
|
||||
"error" in payload &&
|
||||
typeof (payload as { error?: unknown }).error === "string"
|
||||
) {
|
||||
throw new Error((payload as { error: string }).error);
|
||||
}
|
||||
|
||||
if (Array.isArray(payload)) {
|
||||
return payload as DatasetRecord[];
|
||||
}
|
||||
|
||||
if (
|
||||
payload &&
|
||||
typeof payload === "object" &&
|
||||
"data" in payload &&
|
||||
Array.isArray((payload as { data?: unknown }).data)
|
||||
) {
|
||||
return (payload as { data: DatasetRecord[] }).data;
|
||||
}
|
||||
|
||||
if (
|
||||
payload &&
|
||||
typeof payload === "object" &&
|
||||
"records" in payload &&
|
||||
Array.isArray((payload as { records?: unknown }).records)
|
||||
) {
|
||||
return (payload as { records: DatasetRecord[] }).records;
|
||||
}
|
||||
|
||||
if (
|
||||
payload &&
|
||||
typeof payload === "object" &&
|
||||
"rows" in payload &&
|
||||
Array.isArray((payload as { rows?: unknown }).rows)
|
||||
) {
|
||||
return (payload as { rows: DatasetRecord[] }).rows;
|
||||
}
|
||||
|
||||
if (
|
||||
payload &&
|
||||
typeof payload === "object" &&
|
||||
"result" in payload &&
|
||||
Array.isArray((payload as { result?: unknown }).result)
|
||||
) {
|
||||
return (payload as { result: DatasetRecord[] }).result;
|
||||
}
|
||||
|
||||
if (payload && typeof payload === "object") {
|
||||
const values = Object.values(payload);
|
||||
if (values.length === 1 && Array.isArray(values[0])) {
|
||||
return values[0] as DatasetRecord[];
|
||||
}
|
||||
if (values.every((value) => value && typeof value === "object")) {
|
||||
return values as DatasetRecord[];
|
||||
}
|
||||
}
|
||||
|
||||
throw new Error("Corpus endpoint returned an unexpected payload.");
|
||||
};
|
||||
|
||||
const StatPage = () => {
|
||||
const { datasetId: routeDatasetId } = useParams<{ datasetId: string }>();
|
||||
const [error, setError] = useState("");
|
||||
@@ -61,6 +255,12 @@ const StatPage = () => {
|
||||
totalUsers: 0,
|
||||
mostCommentHeavyUser: null,
|
||||
});
|
||||
const [appliedFilters, setAppliedFilters] = useState<Record<string, string>>({});
|
||||
const [allRecords, setAllRecords] = useState<DatasetRecord[] | null>(null);
|
||||
const [allRecordsKey, setAllRecordsKey] = useState("");
|
||||
const [explorerState, setExplorerState] = useState<ExplorerState>(
|
||||
EMPTY_EXPLORER_STATE,
|
||||
);
|
||||
|
||||
const searchInputRef = useRef<HTMLInputElement>(null);
|
||||
const beforeDateRef = useRef<HTMLInputElement>(null);
|
||||
@@ -104,6 +304,59 @@ const StatPage = () => {
|
||||
};
|
||||
};
|
||||
|
||||
const getFilterKey = (params: Record<string, string>) =>
|
||||
JSON.stringify(Object.entries(params).sort(([a], [b]) => a.localeCompare(b)));
|
||||
|
||||
const ensureFilteredRecords = async () => {
|
||||
if (!datasetId) {
|
||||
throw new Error("Missing dataset id.");
|
||||
}
|
||||
|
||||
const authHeaders = getAuthHeaders();
|
||||
if (!authHeaders) {
|
||||
throw new Error("You must be signed in to load corpus records.");
|
||||
}
|
||||
|
||||
const filterKey = getFilterKey(appliedFilters);
|
||||
if (allRecords && allRecordsKey === filterKey) {
|
||||
return allRecords;
|
||||
}
|
||||
|
||||
const response = await axios.get<unknown>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/all`,
|
||||
{
|
||||
params: appliedFilters,
|
||||
headers: authHeaders,
|
||||
},
|
||||
);
|
||||
|
||||
const normalizedRecords = normalizeRecordPayload(response.data);
|
||||
|
||||
setAllRecords(normalizedRecords);
|
||||
setAllRecordsKey(filterKey);
|
||||
return normalizedRecords;
|
||||
};
|
||||
|
||||
const openExplorer = async (spec: CorpusExplorerSpec) => {
|
||||
setExplorerState(createExplorerState(spec, { loading: true }));
|
||||
|
||||
try {
|
||||
const records = await ensureFilteredRecords();
|
||||
const context = buildExplorerContext(records);
|
||||
const matched = records
|
||||
.filter((record) => spec.matcher(record, context))
|
||||
.sort(compareRecordsByNewest);
|
||||
|
||||
setExplorerState(createExplorerState(spec, { records: matched }));
|
||||
} catch (e) {
|
||||
setExplorerState(
|
||||
createExplorerState(spec, {
|
||||
error: `Failed to load corpus records: ${String(e)}`,
|
||||
}),
|
||||
);
|
||||
}
|
||||
};
|
||||
|
||||
const getStats = (params: Record<string, string> = {}) => {
|
||||
if (!datasetId) {
|
||||
setError("Missing dataset id. Open /dataset/<id>/stats.");
|
||||
@@ -118,22 +371,20 @@ const StatPage = () => {
|
||||
|
||||
setError("");
|
||||
setLoading(true);
|
||||
setAppliedFilters(params);
|
||||
setAllRecords(null);
|
||||
setAllRecordsKey("");
|
||||
setExplorerState((current) => ({ ...current, open: false }));
|
||||
|
||||
Promise.all([
|
||||
axios.get<TimeAnalysisResponse>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/temporal`,
|
||||
{
|
||||
params,
|
||||
headers: authHeaders,
|
||||
},
|
||||
),
|
||||
axios.get<UserEndpointResponse>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/user`,
|
||||
{
|
||||
params,
|
||||
headers: authHeaders,
|
||||
},
|
||||
),
|
||||
axios.get<TimeAnalysisResponse>(`${API_BASE_URL}/dataset/${datasetId}/temporal`, {
|
||||
params,
|
||||
headers: authHeaders,
|
||||
}),
|
||||
axios.get<UserEndpointResponse>(`${API_BASE_URL}/dataset/${datasetId}/user`, {
|
||||
params,
|
||||
headers: authHeaders,
|
||||
}),
|
||||
axios.get<LinguisticAnalysisResponse>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/linguistic`,
|
||||
{
|
||||
@@ -141,13 +392,10 @@ const StatPage = () => {
|
||||
headers: authHeaders,
|
||||
},
|
||||
),
|
||||
axios.get<EmotionalAnalysisResponse>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/emotional`,
|
||||
{
|
||||
params,
|
||||
headers: authHeaders,
|
||||
},
|
||||
),
|
||||
axios.get<EmotionalAnalysisResponse>(`${API_BASE_URL}/dataset/${datasetId}/emotional`, {
|
||||
params,
|
||||
headers: authHeaders,
|
||||
}),
|
||||
axios.get<InteractionAnalysisResponse>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/interactional`,
|
||||
{
|
||||
@@ -155,20 +403,14 @@ const StatPage = () => {
|
||||
headers: authHeaders,
|
||||
},
|
||||
),
|
||||
axios.get<SummaryResponse>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/summary`,
|
||||
{
|
||||
params,
|
||||
headers: authHeaders,
|
||||
},
|
||||
),
|
||||
axios.get<CulturalAnalysisResponse>(
|
||||
`${API_BASE_URL}/dataset/${datasetId}/cultural`,
|
||||
{
|
||||
params,
|
||||
headers: authHeaders,
|
||||
},
|
||||
),
|
||||
axios.get<SummaryResponse>(`${API_BASE_URL}/dataset/${datasetId}/summary`, {
|
||||
params,
|
||||
headers: authHeaders,
|
||||
}),
|
||||
axios.get<CulturalAnalysisResponse>(`${API_BASE_URL}/dataset/${datasetId}/cultural`, {
|
||||
params,
|
||||
headers: authHeaders,
|
||||
}),
|
||||
])
|
||||
.then(
|
||||
([
|
||||
@@ -182,8 +424,7 @@ const StatPage = () => {
|
||||
]) => {
|
||||
const usersList = userRes.data.users ?? [];
|
||||
const topUsersList = userRes.data.top_users ?? [];
|
||||
const interactionGraphRaw =
|
||||
interactionRes.data?.interaction_graph ?? {};
|
||||
const interactionGraphRaw = interactionRes.data?.interaction_graph ?? {};
|
||||
const topPairsRaw = interactionRes.data?.top_interaction_pairs ?? [];
|
||||
|
||||
const filteredUsers: typeof usersList = [];
|
||||
@@ -194,18 +435,14 @@ const StatPage = () => {
|
||||
|
||||
const filteredTopUsers: typeof topUsersList = [];
|
||||
for (const user of topUsersList) {
|
||||
if (isDeletedUser(user.author)) continue;
|
||||
filteredTopUsers.push(user);
|
||||
if (isDeletedUser(user.author)) continue;
|
||||
filteredTopUsers.push(user);
|
||||
}
|
||||
|
||||
let mostCommentHeavyUser: UserStatsMeta["mostCommentHeavyUser"] =
|
||||
null;
|
||||
let mostCommentHeavyUser: UserStatsMeta["mostCommentHeavyUser"] = null;
|
||||
for (const user of filteredUsers) {
|
||||
const currentShare = user.comment_share ?? 0;
|
||||
if (
|
||||
!mostCommentHeavyUser ||
|
||||
currentShare > mostCommentHeavyUser.commentShare
|
||||
) {
|
||||
if (!mostCommentHeavyUser || currentShare > mostCommentHeavyUser.commentShare) {
|
||||
mostCommentHeavyUser = {
|
||||
author: user.author,
|
||||
commentShare: currentShare,
|
||||
@@ -221,8 +458,7 @@ const StatPage = () => {
|
||||
}
|
||||
}
|
||||
|
||||
const filteredInteractionGraph: Record<string, Record<string, number>> =
|
||||
{};
|
||||
const filteredInteractionGraph: Record<string, Record<string, number>> = {};
|
||||
for (const [source, targets] of Object.entries(interactionGraphRaw)) {
|
||||
if (isDeletedUser(source)) {
|
||||
continue;
|
||||
@@ -279,7 +515,7 @@ const StatPage = () => {
|
||||
setSummary(filteredSummary || null);
|
||||
},
|
||||
)
|
||||
.catch((e) => setError("Failed to load statistics: " + String(e)))
|
||||
.catch((e) => setError(`Failed to load statistics: ${String(e)}`))
|
||||
.finally(() => setLoading(false));
|
||||
};
|
||||
|
||||
@@ -302,6 +538,9 @@ const StatPage = () => {
|
||||
|
||||
useEffect(() => {
|
||||
setError("");
|
||||
setAllRecords(null);
|
||||
setAllRecordsKey("");
|
||||
setExplorerState(EMPTY_EXPLORER_STATE);
|
||||
if (!datasetId) {
|
||||
setError("Missing dataset id. Open /dataset/<id>/stats.");
|
||||
return;
|
||||
@@ -398,9 +637,7 @@ const StatPage = () => {
|
||||
<button
|
||||
onClick={() => setActiveView("summary")}
|
||||
style={
|
||||
activeView === "summary"
|
||||
? styles.buttonPrimary
|
||||
: styles.buttonSecondary
|
||||
activeView === "summary" ? styles.buttonPrimary : styles.buttonSecondary
|
||||
}
|
||||
>
|
||||
Summary
|
||||
@@ -418,11 +655,7 @@ const StatPage = () => {
|
||||
|
||||
<button
|
||||
onClick={() => setActiveView("user")}
|
||||
style={
|
||||
activeView === "user"
|
||||
? styles.buttonPrimary
|
||||
: styles.buttonSecondary
|
||||
}
|
||||
style={activeView === "user" ? styles.buttonPrimary : styles.buttonSecondary}
|
||||
>
|
||||
Users
|
||||
</button>
|
||||
@@ -449,9 +682,7 @@ const StatPage = () => {
|
||||
<button
|
||||
onClick={() => setActiveView("cultural")}
|
||||
style={
|
||||
activeView === "cultural"
|
||||
? styles.buttonPrimary
|
||||
: styles.buttonSecondary
|
||||
activeView === "cultural" ? styles.buttonPrimary : styles.buttonSecondary
|
||||
}
|
||||
>
|
||||
Cultural
|
||||
@@ -464,11 +695,12 @@ const StatPage = () => {
|
||||
timeData={timeData}
|
||||
linguisticData={linguisticData}
|
||||
summary={summary}
|
||||
onExplore={openExplorer}
|
||||
/>
|
||||
)}
|
||||
|
||||
{activeView === "emotional" && emotionalData && (
|
||||
<EmotionalStats emotionalData={emotionalData} />
|
||||
<EmotionalStats emotionalData={emotionalData} onExplore={openExplorer} />
|
||||
)}
|
||||
|
||||
{activeView === "emotional" && !emotionalData && (
|
||||
@@ -483,6 +715,7 @@ const StatPage = () => {
|
||||
interactionGraph={interactionData.interaction_graph}
|
||||
totalUsers={userStatsMeta.totalUsers}
|
||||
mostCommentHeavyUser={userStatsMeta.mostCommentHeavyUser}
|
||||
onExplore={openExplorer}
|
||||
/>
|
||||
)}
|
||||
|
||||
@@ -493,7 +726,7 @@ const StatPage = () => {
|
||||
)}
|
||||
|
||||
{activeView === "linguistic" && linguisticData && (
|
||||
<LinguisticStats data={linguisticData} />
|
||||
<LinguisticStats data={linguisticData} onExplore={openExplorer} />
|
||||
)}
|
||||
|
||||
{activeView === "linguistic" && !linguisticData && (
|
||||
@@ -513,7 +746,7 @@ const StatPage = () => {
|
||||
)}
|
||||
|
||||
{activeView === "cultural" && culturalData && (
|
||||
<CulturalStats data={culturalData} />
|
||||
<CulturalStats data={culturalData} onExplore={openExplorer} />
|
||||
)}
|
||||
|
||||
{activeView === "cultural" && !culturalData && (
|
||||
@@ -521,6 +754,17 @@ const StatPage = () => {
|
||||
No cultural data available.
|
||||
</div>
|
||||
)}
|
||||
|
||||
<CorpusExplorer
|
||||
open={explorerState.open}
|
||||
onClose={() => setExplorerState((current) => ({ ...current, open: false }))}
|
||||
title={explorerState.title}
|
||||
description={explorerState.description}
|
||||
records={explorerState.records}
|
||||
loading={explorerState.loading}
|
||||
error={explorerState.error}
|
||||
emptyMessage={explorerState.emptyMessage}
|
||||
/>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
@@ -34,6 +34,11 @@ type Vocab = {
|
||||
top_words: FrequencyWord[];
|
||||
};
|
||||
|
||||
type DominantTopic = {
|
||||
topic: string;
|
||||
count: number;
|
||||
};
|
||||
|
||||
type User = {
|
||||
author: string;
|
||||
post: number;
|
||||
@@ -41,6 +46,7 @@ type User = {
|
||||
comment_post_ratio: number;
|
||||
comment_share: number;
|
||||
avg_emotions?: Record<string, number>;
|
||||
dominant_topic?: DominantTopic | null;
|
||||
vocab?: Vocab | null;
|
||||
};
|
||||
|
||||
@@ -162,6 +168,10 @@ type StanceMarkers = {
|
||||
certainty_per_1k_tokens: number;
|
||||
deontic_per_1k_tokens: number;
|
||||
permission_per_1k_tokens: number;
|
||||
hedge_emotion_avg?: Record<string, number>;
|
||||
certainty_emotion_avg?: Record<string, number>;
|
||||
deontic_emotion_avg?: Record<string, number>;
|
||||
permission_emotion_avg?: Record<string, number>;
|
||||
};
|
||||
|
||||
type EntityEmotionAggregate = {
|
||||
@@ -202,6 +212,7 @@ type FilterResponse = {
|
||||
|
||||
export type {
|
||||
TopUser,
|
||||
DominantTopic,
|
||||
Vocab,
|
||||
User,
|
||||
InteractionGraph,
|
||||
|
||||
371
frontend/src/utils/corpusExplorer.ts
Normal file
@@ -0,0 +1,371 @@
|
||||
type EntityRecord = {
|
||||
text?: string;
|
||||
[key: string]: unknown;
|
||||
};
|
||||
|
||||
type DatasetRecord = {
|
||||
id?: string | number;
|
||||
post_id?: string | number | null;
|
||||
parent_id?: string | number | null;
|
||||
author?: string | null;
|
||||
title?: string | null;
|
||||
content?: string | null;
|
||||
timestamp?: string | number | null;
|
||||
date?: string | null;
|
||||
dt?: string | null;
|
||||
hour?: number | null;
|
||||
weekday?: string | null;
|
||||
reply_to?: string | number | null;
|
||||
source?: string | null;
|
||||
topic?: string | null;
|
||||
topic_confidence?: number | null;
|
||||
type?: string | null;
|
||||
ner_entities?: EntityRecord[] | null;
|
||||
emotion_anger?: number | null;
|
||||
emotion_disgust?: number | null;
|
||||
emotion_fear?: number | null;
|
||||
emotion_joy?: number | null;
|
||||
emotion_sadness?: number | null;
|
||||
[key: string]: unknown;
|
||||
};
|
||||
|
||||
type CorpusExplorerContext = {
|
||||
authorByPostId: Map<string, string>;
|
||||
authorEventCounts: Map<string, number>;
|
||||
authorCommentCounts: Map<string, number>;
|
||||
};
|
||||
|
||||
type CorpusExplorerSpec = {
|
||||
title: string;
|
||||
description: string;
|
||||
emptyMessage?: string;
|
||||
matcher: (record: DatasetRecord, context: CorpusExplorerContext) => boolean;
|
||||
};
|
||||
|
||||
const IN_GROUP_PATTERN = /\b(we|us|our|ourselves)\b/gi;
|
||||
const OUT_GROUP_PATTERN = /\b(they|them|their|themselves)\b/gi;
|
||||
const HEDGE_PATTERN = /\b(maybe|perhaps|possibly|probably|likely|seems|seem|i think|i feel|i guess|kind of|sort of|somewhat)\b/i;
|
||||
const CERTAINTY_PATTERN = /\b(definitely|certainly|clearly|obviously|undeniably|always|never)\b/i;
|
||||
const DEONTIC_PATTERN = /\b(must|should|need|needs|have to|has to|ought|required|require)\b/i;
|
||||
const PERMISSION_PATTERN = /\b(can|allowed|okay|ok|permitted)\b/i;
|
||||
const EMOTION_KEYS = [
|
||||
"emotion_anger",
|
||||
"emotion_disgust",
|
||||
"emotion_fear",
|
||||
"emotion_joy",
|
||||
"emotion_sadness",
|
||||
] as const;
|
||||
|
||||
const toText = (value: unknown) => {
|
||||
if (typeof value === "string") {
|
||||
return value;
|
||||
}
|
||||
|
||||
if (typeof value === "number" || typeof value === "boolean") {
|
||||
return String(value);
|
||||
}
|
||||
|
||||
if (value && typeof value === "object" && "id" in value) {
|
||||
const id = (value as { id?: unknown }).id;
|
||||
if (typeof id === "string" || typeof id === "number") {
|
||||
return String(id);
|
||||
}
|
||||
}
|
||||
|
||||
return "";
|
||||
};
|
||||
|
||||
const normalize = (value: unknown) => toText(value).trim().toLowerCase();
|
||||
const getAuthor = (record: DatasetRecord) => toText(record.author).trim();
|
||||
|
||||
const getRecordText = (record: DatasetRecord) =>
|
||||
`${record.title ?? ""} ${record.content ?? ""}`.trim();
|
||||
|
||||
const escapeRegExp = (value: string) =>
|
||||
value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
|
||||
|
||||
const buildPhrasePattern = (phrase: string) => {
|
||||
const tokens = phrase
|
||||
.toLowerCase()
|
||||
.trim()
|
||||
.split(/\s+/)
|
||||
.filter(Boolean)
|
||||
.map(escapeRegExp);
|
||||
|
||||
if (!tokens.length) {
|
||||
return null;
|
||||
}
|
||||
|
||||
return new RegExp(`\\b${tokens.join("\\s+")}\\b`, "i");
|
||||
};
|
||||
|
||||
const countMatches = (pattern: RegExp, text: string) =>
|
||||
Array.from(text.matchAll(new RegExp(pattern.source, "gi"))).length;
|
||||
|
||||
const getDateBucket = (record: DatasetRecord) => {
|
||||
if (typeof record.date === "string" && record.date) {
|
||||
return record.date.slice(0, 10);
|
||||
}
|
||||
|
||||
if (typeof record.dt === "string" && record.dt) {
|
||||
return record.dt.slice(0, 10);
|
||||
}
|
||||
|
||||
if (typeof record.timestamp === "number") {
|
||||
return new Date(record.timestamp * 1000).toISOString().slice(0, 10);
|
||||
}
|
||||
|
||||
if (typeof record.timestamp === "string" && record.timestamp) {
|
||||
const numeric = Number(record.timestamp);
|
||||
if (Number.isFinite(numeric)) {
|
||||
return new Date(numeric * 1000).toISOString().slice(0, 10);
|
||||
}
|
||||
}
|
||||
|
||||
return "";
|
||||
};
|
||||
|
||||
const getDominantEmotion = (record: DatasetRecord) => {
|
||||
let bestKey = "";
|
||||
let bestValue = Number.NEGATIVE_INFINITY;
|
||||
|
||||
for (const key of EMOTION_KEYS) {
|
||||
const value = Number(record[key] ?? Number.NEGATIVE_INFINITY);
|
||||
if (value > bestValue) {
|
||||
bestValue = value;
|
||||
bestKey = key;
|
||||
}
|
||||
}
|
||||
|
||||
return bestKey.replace("emotion_", "");
|
||||
};
|
||||
|
||||
const matchesPhrase = (record: DatasetRecord, phrase: string) => {
|
||||
const pattern = buildPhrasePattern(phrase);
|
||||
if (!pattern) {
|
||||
return false;
|
||||
}
|
||||
|
||||
return pattern.test(getRecordText(record));
|
||||
};
|
||||
|
||||
const recordIdentityBucket = (record: DatasetRecord) => {
|
||||
const text = getRecordText(record);
|
||||
const inHits = countMatches(IN_GROUP_PATTERN, text);
|
||||
const outHits = countMatches(OUT_GROUP_PATTERN, text);
|
||||
|
||||
if (inHits > outHits) {
|
||||
return "in";
|
||||
}
|
||||
|
||||
if (outHits > inHits) {
|
||||
return "out";
|
||||
}
|
||||
|
||||
return "tie";
|
||||
};
|
||||
|
||||
const buildExplorerContext = (records: DatasetRecord[]): CorpusExplorerContext => {
|
||||
const authorByPostId = new Map<string, string>();
|
||||
const authorEventCounts = new Map<string, number>();
|
||||
const authorCommentCounts = new Map<string, number>();
|
||||
|
||||
for (const record of records) {
|
||||
const author = getAuthor(record);
|
||||
if (!author) {
|
||||
continue;
|
||||
}
|
||||
|
||||
authorEventCounts.set(author, (authorEventCounts.get(author) ?? 0) + 1);
|
||||
|
||||
if (record.type === "comment") {
|
||||
authorCommentCounts.set(author, (authorCommentCounts.get(author) ?? 0) + 1);
|
||||
}
|
||||
|
||||
if (record.post_id !== null && record.post_id !== undefined) {
|
||||
authorByPostId.set(String(record.post_id), author);
|
||||
}
|
||||
}
|
||||
|
||||
return { authorByPostId, authorEventCounts, authorCommentCounts };
|
||||
};
|
||||
|
||||
const buildAllRecordsSpec = (): CorpusExplorerSpec => ({
|
||||
title: "Corpus Explorer",
|
||||
description: "All records in the current filtered dataset.",
|
||||
emptyMessage: "No records match the current filters.",
|
||||
matcher: () => true,
|
||||
});
|
||||
|
||||
const buildUserSpec = (author: string): CorpusExplorerSpec => {
|
||||
const target = normalize(author);
|
||||
|
||||
return {
|
||||
title: `User: ${author}`,
|
||||
description: `All records authored by ${author}.`,
|
||||
emptyMessage: `No records found for ${author}.`,
|
||||
matcher: (record) => normalize(record.author) === target,
|
||||
};
|
||||
};
|
||||
|
||||
const buildTopicSpec = (topic: string): CorpusExplorerSpec => {
|
||||
const target = normalize(topic);
|
||||
|
||||
return {
|
||||
title: `Topic: ${topic}`,
|
||||
description: `Records assigned to the ${topic} topic bucket.`,
|
||||
emptyMessage: `No records found in the ${topic} topic bucket.`,
|
||||
matcher: (record) => normalize(record.topic) === target,
|
||||
};
|
||||
};
|
||||
|
||||
const buildDateBucketSpec = (date: string): CorpusExplorerSpec => ({
|
||||
title: `Date Bucket: ${date}`,
|
||||
description: `Records from the ${date} activity bucket.`,
|
||||
emptyMessage: `No records found on ${date}.`,
|
||||
matcher: (record) => getDateBucket(record) === date,
|
||||
});
|
||||
|
||||
const buildWordSpec = (word: string): CorpusExplorerSpec => ({
|
||||
title: `Word: ${word}`,
|
||||
description: `Records containing the word ${word}.`,
|
||||
emptyMessage: `No records mention ${word}.`,
|
||||
matcher: (record) => matchesPhrase(record, word),
|
||||
});
|
||||
|
||||
const buildNgramSpec = (ngram: string): CorpusExplorerSpec => ({
|
||||
title: `N-gram: ${ngram}`,
|
||||
description: `Records containing the phrase ${ngram}.`,
|
||||
emptyMessage: `No records contain the phrase ${ngram}.`,
|
||||
matcher: (record) => matchesPhrase(record, ngram),
|
||||
});
|
||||
|
||||
const buildEntitySpec = (entity: string): CorpusExplorerSpec => {
|
||||
const target = normalize(entity);
|
||||
|
||||
return {
|
||||
title: `Entity: ${entity}`,
|
||||
description: `Records mentioning the ${entity} entity.`,
|
||||
emptyMessage: `No records found for the ${entity} entity.`,
|
||||
matcher: (record) => {
|
||||
const entities = Array.isArray(record.ner_entities) ? record.ner_entities : [];
|
||||
return entities.some((item) => normalize(item?.text) === target) || matchesPhrase(record, entity);
|
||||
},
|
||||
};
|
||||
};
|
||||
|
||||
const buildSourceSpec = (source: string): CorpusExplorerSpec => {
|
||||
const target = normalize(source);
|
||||
|
||||
return {
|
||||
title: `Source: ${source}`,
|
||||
description: `Records from the ${source} source.`,
|
||||
emptyMessage: `No records found for ${source}.`,
|
||||
matcher: (record) => normalize(record.source) === target,
|
||||
};
|
||||
};
|
||||
|
||||
const buildDominantEmotionSpec = (emotion: string): CorpusExplorerSpec => {
|
||||
const target = normalize(emotion);
|
||||
|
||||
return {
|
||||
title: `Dominant Emotion: ${emotion}`,
|
||||
description: `Records where ${emotion} is the strongest emotion score.`,
|
||||
emptyMessage: `No records found with dominant emotion ${emotion}.`,
|
||||
matcher: (record) => getDominantEmotion(record) === target,
|
||||
};
|
||||
};
|
||||
|
||||
const buildReplyPairSpec = (source: string, target: string): CorpusExplorerSpec => {
|
||||
const sourceName = normalize(source);
|
||||
const targetName = normalize(target);
|
||||
|
||||
return {
|
||||
title: `Reply Path: ${source} -> ${target}`,
|
||||
description: `Reply records authored by ${source} in response to ${target}.`,
|
||||
emptyMessage: `No reply records found for ${source} -> ${target}.`,
|
||||
matcher: (record, context) => {
|
||||
if (normalize(record.author) !== sourceName) {
|
||||
return false;
|
||||
}
|
||||
|
||||
const replyTo = record.reply_to;
|
||||
if (replyTo === null || replyTo === undefined || replyTo === "") {
|
||||
return false;
|
||||
}
|
||||
|
||||
return normalize(context.authorByPostId.get(String(replyTo))) === targetName;
|
||||
},
|
||||
};
|
||||
};
|
||||
|
||||
const buildOneTimeUsersSpec = (): CorpusExplorerSpec => ({
|
||||
title: "One-Time Users",
|
||||
description: "Records written by authors who appear exactly once in the filtered corpus.",
|
||||
emptyMessage: "No one-time-user records found.",
|
||||
matcher: (record, context) => {
|
||||
const author = getAuthor(record);
|
||||
return !!author && context.authorEventCounts.get(author) === 1;
|
||||
},
|
||||
});
|
||||
|
||||
const buildIdentityBucketSpec = (bucket: "in" | "out" | "tie"): CorpusExplorerSpec => {
|
||||
const labels = {
|
||||
in: "In-Group Posts",
|
||||
out: "Out-Group Posts",
|
||||
tie: "Balanced Posts",
|
||||
} as const;
|
||||
|
||||
return {
|
||||
title: labels[bucket],
|
||||
description: `Records in the ${labels[bucket].toLowerCase()} cultural bucket.`,
|
||||
emptyMessage: `No records found for ${labels[bucket].toLowerCase()}.`,
|
||||
matcher: (record) => recordIdentityBucket(record) === bucket,
|
||||
};
|
||||
};
|
||||
|
||||
const buildPatternSpec = (
|
||||
title: string,
|
||||
description: string,
|
||||
pattern: RegExp,
|
||||
): CorpusExplorerSpec => ({
|
||||
title,
|
||||
description,
|
||||
emptyMessage: `No records found for ${title.toLowerCase()}.`,
|
||||
matcher: (record) => pattern.test(getRecordText(record)),
|
||||
});
|
||||
|
||||
const buildHedgeSpec = () =>
|
||||
buildPatternSpec("Hedging Words", "Records containing hedging language.", HEDGE_PATTERN);
|
||||
|
||||
const buildCertaintySpec = () =>
|
||||
buildPatternSpec("Certainty Words", "Records containing certainty language.", CERTAINTY_PATTERN);
|
||||
|
||||
const buildDeonticSpec = () =>
|
||||
buildPatternSpec("Need/Should Words", "Records containing deontic language.", DEONTIC_PATTERN);
|
||||
|
||||
const buildPermissionSpec = () =>
|
||||
buildPatternSpec("Permission Words", "Records containing permission language.", PERMISSION_PATTERN);
|
||||
|
||||
export type { DatasetRecord, CorpusExplorerSpec };
|
||||
export {
|
||||
buildAllRecordsSpec,
|
||||
buildCertaintySpec,
|
||||
buildDateBucketSpec,
|
||||
buildDeonticSpec,
|
||||
buildDominantEmotionSpec,
|
||||
buildEntitySpec,
|
||||
buildExplorerContext,
|
||||
buildHedgeSpec,
|
||||
buildIdentityBucketSpec,
|
||||
buildNgramSpec,
|
||||
buildOneTimeUsersSpec,
|
||||
buildPermissionSpec,
|
||||
buildReplyPairSpec,
|
||||
buildSourceSpec,
|
||||
buildTopicSpec,
|
||||
buildUserSpec,
|
||||
buildWordSpec,
|
||||
getDateBucket,
|
||||
toText,
|
||||
};
|
||||
@@ -3,7 +3,7 @@ const DEFAULT_TITLE = "Ethnograph View";
|
||||
const STATIC_TITLES: Record<string, string> = {
|
||||
"/login": "Sign In",
|
||||
"/upload": "Upload Dataset",
|
||||
"/auto-scrape": "Auto Scrape Dataset",
|
||||
"/auto-fetch": "Auto Fetch Dataset",
|
||||
"/datasets": "My Datasets",
|
||||
};
|
||||
|
||||
|
||||
BIN
report/img/analysis_bar.png
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
report/img/cork_temporal.png
Normal file
|
After Width: | Height: | Size: 274 KiB |
BIN
report/img/flooding_posts.png
Normal file
|
After Width: | Height: | Size: 90 KiB |
BIN
report/img/frontend.png
Normal file
|
After Width: | Height: | Size: 302 KiB |
BIN
report/img/gantt.png
Normal file
|
After Width: | Height: | Size: 50 KiB |
BIN
report/img/heatmap.png
Normal file
|
After Width: | Height: | Size: 86 KiB |
BIN
report/img/interaction_graph.png
Normal file
|
After Width: | Height: | Size: 114 KiB |
BIN
report/img/kpi_card.png
Normal file
|
After Width: | Height: | Size: 8.7 KiB |
BIN
report/img/moods.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
report/img/navbar.png
Normal file
|
After Width: | Height: | Size: 14 KiB |
BIN
report/img/ngrams.png
Normal file
|
After Width: | Height: | Size: 38 KiB |
BIN
report/img/nlp_backoff.png
Normal file
|
After Width: | Height: | Size: 143 KiB |
BIN
report/img/pipeline.png
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
report/img/reddit_bot.png
Normal file
|
After Width: | Height: | Size: 232 KiB |
BIN
report/img/signature.jpg
Normal file
|
After Width: | Height: | Size: 152 KiB |
BIN
report/img/stance_markers.png
Normal file
|
After Width: | Height: | Size: 111 KiB |
BIN
report/img/topic_emotions.png
Normal file
|
After Width: | Height: | Size: 17 KiB |
BIN
report/img/ucc_crest.png
Normal file
|
After Width: | Height: | Size: 27 KiB |
1399
report/main.tex
149
report/references.bib
Normal file
@@ -0,0 +1,149 @@
|
||||
@online{reddit_api,
|
||||
author = {{Reddit Inc.}},
|
||||
title = {Reddit API Documentation},
|
||||
year = {2025},
|
||||
url = {https://www.reddit.com/dev/api/},
|
||||
urldate = {2026-04-08}
|
||||
}
|
||||
|
||||
@misc{hartmann2022emotionenglish,
|
||||
author={Hartmann, Jochen},
|
||||
title={Emotion English DistilRoBERTa-base},
|
||||
year={2022},
|
||||
howpublished = {\url{https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/}},
|
||||
}
|
||||
|
||||
@misc{all_mpnet_base_v2,
|
||||
author={Microsoft Research},
|
||||
title={All-MPNet-Base-V2},
|
||||
year={2021},
|
||||
howpublished = {\url{https://huggingface.co/sentence-transformers/all-mpnet-base-v2}},
|
||||
}
|
||||
|
||||
@misc{minilm_l6_v2,
|
||||
author={Microsoft Research},
|
||||
title={MiniLM-L6-V2},
|
||||
year={2021},
|
||||
howpublished = {\url{https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2}},
|
||||
}
|
||||
|
||||
@misc{dslim_bert_base_ner,
|
||||
author={deepset},
|
||||
title={dslim/bert-base-NER},
|
||||
year={2018},
|
||||
howpublished = {\url{https://huggingface.co/dslim/bert-base-NER}},
|
||||
}
|
||||
|
||||
@inproceedings{demszky2020goemotions,
|
||||
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
|
||||
booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)},
|
||||
title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
|
||||
year = {2020}
|
||||
}
|
||||
|
||||
@article{dominguez2007virtual,
|
||||
author = {Domínguez, Daniel and Beaulieu, Anne and Estalella, Adolfo and Gómez, Edgar and Schnettler, Bernt and Read, Rosie},
|
||||
title = {Virtual Ethnography},
|
||||
journal = {Forum Qualitative Sozialforschung / Forum: Qualitative Social Research},
|
||||
year = {2007},
|
||||
volume = {8},
|
||||
number = {3},
|
||||
url = {http://nbn-resolving.de/urn:nbn:de:0114-fqs0703E19}
|
||||
}
|
||||
|
||||
@article{sun2014lurkers,
|
||||
author = {Sun, Na and Rau, Pei-Luen Patrick and Ma, Liang},
|
||||
title = {Understanding Lurkers in Online Communities: A Literature Review},
|
||||
journal = {Computers in Human Behavior},
|
||||
year = {2014},
|
||||
volume = {38},
|
||||
pages = {110--117},
|
||||
doi = {10.1016/j.chb.2014.05.022}
|
||||
}
|
||||
|
||||
@article{ahmad2024sentiment,
|
||||
author = {Ahmad, Waqar and others},
|
||||
title = {Recent Advancements and Challenges of NLP-based Sentiment Analysis: A State-of-the-art Review},
|
||||
journal = {Natural Language Processing Journal},
|
||||
year = {2024},
|
||||
doi = {10.1016/j.nlp.2024.100059}
|
||||
}
|
||||
|
||||
@article{coleman2010ethnographic,
|
||||
ISSN = {00846570},
|
||||
URL = {http://www.jstor.org/stable/25735124},
|
||||
abstract = {This review surveys and divides the ethnographic corpus on digital media into three broad but overlapping categories: the cultural politics of digital media, the vernacular cultures of digital media, and the prosaics of digital media. Engaging these three categories of scholarship on digital media, I consider how ethnographers are exploring the complex relationships between the local practices and global implications of digital media, their materiality and politics, and thier banal, as well as profound, presence in cultural life and modes of communication. I consider the way these media have become central to the articulation of cherished beliefs, ritual practices, and modes of being in the world; the fact that digital media culturally matters is undeniable but showing how, where, and why it matters is necessary to push against peculiarly narrow presumptions about the universality of digital experience.},
|
||||
author = {E. Gabriella Coleman},
|
||||
journal = {Annual Review of Anthropology},
|
||||
pages = {487--505},
|
||||
publisher = {Annual Reviews},
|
||||
title = {Ethnographic Approaches to Digital Media},
|
||||
urldate = {2026-04-15},
|
||||
volume = {39},
|
||||
year = {2010}
|
||||
}
|
||||
|
||||
@article{shen2021stance,
|
||||
author = {Shen, Qian and Tao, Yating},
|
||||
title = {Stance Markers in {English} Medical Research Articles and Newspaper Opinion Columns: A Comparative Corpus-Based Study},
|
||||
journal = {PLOS ONE},
|
||||
volume = {16},
|
||||
number = {3},
|
||||
pages = {e0247981},
|
||||
year = {2021},
|
||||
doi = {10.1371/journal.pone.0247981}
|
||||
}
|
||||
|
||||
@incollection{medvedev2019anatomy,
|
||||
author = {Medvedev, Alexey N. and Lambiotte, Renaud and Delvenne, Jean-Charles},
|
||||
title = {The Anatomy of Reddit: An Overview of Academic Research},
|
||||
booktitle = {Dynamics On and Of Complex Networks III},
|
||||
series = {Springer Proceedings in Complexity},
|
||||
publisher = {Springer},
|
||||
year = {2019},
|
||||
pages = {183--204}
|
||||
}
|
||||
|
||||
@misc{cook2023ethnography,
|
||||
author = {Cook, Chloe},
|
||||
title = {What is the Difference Between Ethnography and Digital Ethnography?},
|
||||
year = {2023},
|
||||
month = jan,
|
||||
day = {19},
|
||||
howpublished = {\url{https://ethosapp.com/blog/what-is-the-difference-between-ethnography-and-digital-ethnography/}},
|
||||
note = {Accessed: 2026-04-16},
|
||||
organization = {EthOS}
|
||||
}
|
||||
|
||||
@misc{giuffre2026sentiment,
|
||||
author = {Giuffre, Steven},
|
||||
title = {What is Sentiment Analysis?},
|
||||
year = {2026},
|
||||
month = mar,
|
||||
howpublished = {\url{https://www.vonage.com/resources/articles/sentiment-analysis/}},
|
||||
note = {Accessed: 2026-04-16},
|
||||
organization = {Vonage}
|
||||
}
|
||||
|
||||
@misc{mungalpara2022stemming,
|
||||
author = {Mungalpara, Jaimin},
|
||||
title = {Stemming Lemmatization Stopwords and {N}-Grams in {NLP}},
|
||||
year = {2022},
|
||||
month = jul,
|
||||
day = {26},
|
||||
howpublished = {\url{https://jaimin-ml2001.medium.com/stemming-lemmatization-stopwords-and-n-grams-in-nlp-96f8e8b6aa6f}},
|
||||
note = {Accessed: 2026-04-16},
|
||||
organization = {Medium}
|
||||
}
|
||||
|
||||
@misc{chugani2025ethicalscraping,
|
||||
author = {Chugani, Vinod},
|
||||
title = {Ethical Web Scraping: Principles and Practices},
|
||||
year = {2025},
|
||||
month = apr,
|
||||
day = {21},
|
||||
howpublished = {\url{https://www.datacamp.com/blog/ethical-web-scraping}},
|
||||
note = {Accessed: 2026-04-16},
|
||||
organization = {DataCamp}
|
||||
}
|
||||
|
||||
@@ -16,3 +16,4 @@ Requests==2.32.5
|
||||
sentence_transformers==5.2.2
|
||||
torch==2.10.0
|
||||
transformers==5.1.0
|
||||
gunicorn==25.3.0
|
||||
|
||||
@@ -67,6 +67,12 @@ class CulturalAnalysis:
|
||||
|
||||
def get_stance_markers(self, df: pd.DataFrame) -> dict[str, Any]:
|
||||
s = df[self.content_col].fillna("").astype(str)
|
||||
emotion_exclusions = {"emotion_neutral", "emotion_surprise"}
|
||||
emotion_cols = [
|
||||
c
|
||||
for c in df.columns
|
||||
if c.startswith("emotion_") and c not in emotion_exclusions
|
||||
]
|
||||
|
||||
hedge_pattern = re.compile(
|
||||
r"\b(maybe|perhaps|possibly|probably|likely|seems|seem|i think|i feel|i guess|kind of|sort of|somewhat)\b"
|
||||
@@ -88,7 +94,7 @@ class CulturalAnalysis:
|
||||
0, 1
|
||||
)
|
||||
|
||||
return {
|
||||
result = {
|
||||
"hedge_total": int(hedge_counts.sum()),
|
||||
"certainty_total": int(certainty_counts.sum()),
|
||||
"deontic_total": int(deontic_counts.sum()),
|
||||
@@ -107,6 +113,32 @@ class CulturalAnalysis:
|
||||
),
|
||||
}
|
||||
|
||||
if emotion_cols:
|
||||
emo = df[emotion_cols].apply(pd.to_numeric, errors="coerce").fillna(0.0)
|
||||
|
||||
result["hedge_emotion_avg"] = (
|
||||
emo.loc[hedge_counts > 0].mean()
|
||||
if (hedge_counts > 0).any()
|
||||
else pd.Series(0.0, index=emotion_cols)
|
||||
).to_dict()
|
||||
result["certainty_emotion_avg"] = (
|
||||
emo.loc[certainty_counts > 0].mean()
|
||||
if (certainty_counts > 0).any()
|
||||
else pd.Series(0.0, index=emotion_cols)
|
||||
).to_dict()
|
||||
result["deontic_emotion_avg"] = (
|
||||
emo.loc[deontic_counts > 0].mean()
|
||||
if (deontic_counts > 0).any()
|
||||
else pd.Series(0.0, index=emotion_cols)
|
||||
).to_dict()
|
||||
result["permission_emotion_avg"] = (
|
||||
emo.loc[perm_counts > 0].mean()
|
||||
if (perm_counts > 0).any()
|
||||
else pd.Series(0.0, index=emotion_cols)
|
||||
).to_dict()
|
||||
|
||||
return result
|
||||
|
||||
def get_avg_emotions_per_entity(
|
||||
self, df: pd.DataFrame, top_n: int = 25, min_posts: int = 10
|
||||
) -> dict[str, Any]:
|
||||
|
||||
@@ -1,17 +1,30 @@
|
||||
import pandas as pd
|
||||
import re
|
||||
|
||||
from collections import Counter
|
||||
from itertools import islice
|
||||
from dataclasses import dataclass
|
||||
|
||||
import pandas as pd
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class NGramConfig:
|
||||
min_token_length: int = 3
|
||||
min_count: int = 2
|
||||
max_results: int = 100
|
||||
|
||||
|
||||
class LinguisticAnalysis:
|
||||
def __init__(self, word_exclusions: set[str]):
|
||||
self.word_exclusions = word_exclusions
|
||||
self.ngram_config = NGramConfig()
|
||||
|
||||
def _tokenize(self, text: str):
|
||||
tokens = re.findall(r"\b[a-z]{3,}\b", text)
|
||||
return [t for t in tokens if t not in self.word_exclusions]
|
||||
def _tokenize(self, text: str, *, include_exclusions: bool = False) -> list[str]:
|
||||
pattern = rf"\b[a-z]{{{self.ngram_config.min_token_length},}}\b"
|
||||
tokens = re.findall(pattern, text)
|
||||
|
||||
if include_exclusions:
|
||||
return tokens
|
||||
|
||||
return [token for token in tokens if token not in self.word_exclusions]
|
||||
|
||||
def _clean_text(self, text: str) -> str:
|
||||
text = re.sub(r"http\S+", "", text) # remove URLs
|
||||
@@ -21,13 +34,24 @@ class LinguisticAnalysis:
|
||||
text = re.sub(r"\S+\.(jpg|jpeg|png|webp|gif)", "", text)
|
||||
return text
|
||||
|
||||
def _content_texts(self, df: pd.DataFrame) -> pd.Series:
|
||||
return df["content"].dropna().astype(str).apply(self._clean_text).str.lower()
|
||||
|
||||
def _valid_ngram(self, tokens: tuple[str, ...]) -> bool:
|
||||
if any(token in self.word_exclusions for token in tokens):
|
||||
return False
|
||||
|
||||
if len(set(tokens)) == 1:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def word_frequencies(self, df: pd.DataFrame, limit: int = 100) -> list[dict]:
|
||||
texts = df["content"].dropna().astype(str).str.lower()
|
||||
texts = self._content_texts(df)
|
||||
|
||||
words = []
|
||||
for text in texts:
|
||||
tokens = re.findall(r"\b[a-z]{3,}\b", text)
|
||||
words.extend(w for w in tokens if w not in self.word_exclusions)
|
||||
words.extend(self._tokenize(text))
|
||||
|
||||
counts = Counter(words)
|
||||
|
||||
@@ -40,25 +64,39 @@ class LinguisticAnalysis:
|
||||
|
||||
return word_frequencies.to_dict(orient="records")
|
||||
|
||||
def ngrams(self, df: pd.DataFrame, n=2, limit=100):
|
||||
texts = df["content"].dropna().astype(str).apply(self._clean_text).str.lower()
|
||||
def ngrams(self, df: pd.DataFrame, n: int = 2, limit: int | None = None) -> list[dict]:
|
||||
if n < 2:
|
||||
raise ValueError("n must be at least 2")
|
||||
|
||||
texts = self._content_texts(df)
|
||||
all_ngrams = []
|
||||
result_limit = limit or self.ngram_config.max_results
|
||||
|
||||
for text in texts:
|
||||
tokens = re.findall(r"\b[a-z]{3,}\b", text)
|
||||
tokens = self._tokenize(text, include_exclusions=True)
|
||||
|
||||
# stop word removal causes strange behaviors in ngrams
|
||||
# tokens = [w for w in tokens if w not in self.word_exclusions]
|
||||
if len(tokens) < n:
|
||||
continue
|
||||
|
||||
ngrams = zip(*(islice(tokens, i, None) for i in range(n)))
|
||||
all_ngrams.extend([" ".join(ng) for ng in ngrams])
|
||||
for index in range(len(tokens) - n + 1):
|
||||
ngram_tokens = tuple(tokens[index : index + n])
|
||||
if self._valid_ngram(ngram_tokens):
|
||||
all_ngrams.append(" ".join(ngram_tokens))
|
||||
|
||||
counts = Counter(all_ngrams)
|
||||
filtered_counts = [
|
||||
(ngram, count)
|
||||
for ngram, count in counts.items()
|
||||
if count >= self.ngram_config.min_count
|
||||
]
|
||||
|
||||
if not filtered_counts:
|
||||
return []
|
||||
|
||||
return (
|
||||
pd.DataFrame(counts.items(), columns=["ngram", "count"])
|
||||
.sort_values("count", ascending=False)
|
||||
.head(limit)
|
||||
pd.DataFrame(filtered_counts, columns=["ngram", "count"])
|
||||
.sort_values(["count", "ngram"], ascending=[False, True])
|
||||
.head(result_limit)
|
||||
.to_dict(orient="records")
|
||||
)
|
||||
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
import nltk
|
||||
import json
|
||||
import pandas as pd
|
||||
from nltk.corpus import stopwords
|
||||
|
||||
@@ -27,6 +28,8 @@ DOMAIN_STOPWORDS = {
|
||||
"one",
|
||||
}
|
||||
|
||||
EXCLUDED_AUTHORS = {"[deleted]", "automoderator"}
|
||||
|
||||
nltk.download("stopwords")
|
||||
EXCLUDE_WORDS = set(stopwords.words("english")) | DOMAIN_STOPWORDS
|
||||
|
||||
@@ -46,6 +49,12 @@ class StatGen:
|
||||
filters = filters or {}
|
||||
filtered_df = df.copy()
|
||||
|
||||
if "author" in filtered_df.columns:
|
||||
normalized_authors = (
|
||||
filtered_df["author"].fillna("").astype(str).str.strip().str.lower()
|
||||
)
|
||||
filtered_df = filtered_df[~normalized_authors.isin(EXCLUDED_AUTHORS)]
|
||||
|
||||
search_query = filters.get("search_query", None)
|
||||
start_date_filter = filters.get("start_date", None)
|
||||
end_date_filter = filters.get("end_date", None)
|
||||
@@ -75,11 +84,22 @@ class StatGen:
|
||||
|
||||
return filtered_df
|
||||
|
||||
def _json_ready_records(self, df: pd.DataFrame) -> list[dict]:
|
||||
return json.loads(
|
||||
df.to_json(orient="records", date_format="iso", date_unit="s")
|
||||
)
|
||||
|
||||
## Public Methods
|
||||
def filter_dataset(self, df: pd.DataFrame, filters: dict | None = None) -> list[dict]:
|
||||
return self._prepare_filtered_df(df, filters).to_dict(orient="records")
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
return self._json_ready_records(filtered_df)
|
||||
|
||||
def temporal(self, df: pd.DataFrame, filters: dict | None = None) -> dict:
|
||||
def temporal(
|
||||
self,
|
||||
df: pd.DataFrame,
|
||||
filters: dict | None = None,
|
||||
dataset_id: int | None = None,
|
||||
) -> dict:
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
|
||||
return {
|
||||
@@ -87,7 +107,12 @@ class StatGen:
|
||||
"weekday_hour_heatmap": self.temporal_analysis.heatmap(filtered_df),
|
||||
}
|
||||
|
||||
def linguistic(self, df: pd.DataFrame, filters: dict | None = None) -> dict:
|
||||
def linguistic(
|
||||
self,
|
||||
df: pd.DataFrame,
|
||||
filters: dict | None = None,
|
||||
dataset_id: int | None = None,
|
||||
) -> dict:
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
|
||||
return {
|
||||
@@ -97,7 +122,12 @@ class StatGen:
|
||||
"lexical_diversity": self.linguistic_analysis.lexical_diversity(filtered_df)
|
||||
}
|
||||
|
||||
def emotional(self, df: pd.DataFrame, filters: dict | None = None) -> dict:
|
||||
def emotional(
|
||||
self,
|
||||
df: pd.DataFrame,
|
||||
filters: dict | None = None,
|
||||
dataset_id: int | None = None,
|
||||
) -> dict:
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
|
||||
return {
|
||||
@@ -107,7 +137,12 @@ class StatGen:
|
||||
"emotion_by_source": self.emotional_analysis.emotion_by_source(filtered_df)
|
||||
}
|
||||
|
||||
def user(self, df: pd.DataFrame, filters: dict | None = None) -> dict:
|
||||
def user(
|
||||
self,
|
||||
df: pd.DataFrame,
|
||||
filters: dict | None = None,
|
||||
dataset_id: int | None = None,
|
||||
) -> dict:
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
|
||||
return {
|
||||
@@ -115,7 +150,12 @@ class StatGen:
|
||||
"users": self.user_analysis.per_user_analysis(filtered_df)
|
||||
}
|
||||
|
||||
def interactional(self, df: pd.DataFrame, filters: dict | None = None) -> dict:
|
||||
def interactional(
|
||||
self,
|
||||
df: pd.DataFrame,
|
||||
filters: dict | None = None,
|
||||
dataset_id: int | None = None,
|
||||
) -> dict:
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
|
||||
return {
|
||||
@@ -124,7 +164,12 @@ class StatGen:
|
||||
"conversation_concentration": self.interaction_analysis.conversation_concentration(filtered_df)
|
||||
}
|
||||
|
||||
def cultural(self, df: pd.DataFrame, filters: dict | None = None) -> dict:
|
||||
def cultural(
|
||||
self,
|
||||
df: pd.DataFrame,
|
||||
filters: dict | None = None,
|
||||
dataset_id: int | None = None,
|
||||
) -> dict:
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
|
||||
return {
|
||||
@@ -133,7 +178,12 @@ class StatGen:
|
||||
"avg_emotion_per_entity": self.cultural_analysis.get_avg_emotions_per_entity(filtered_df)
|
||||
}
|
||||
|
||||
def summary(self, df: pd.DataFrame, filters: dict | None = None) -> dict:
|
||||
def summary(
|
||||
self,
|
||||
df: pd.DataFrame,
|
||||
filters: dict | None = None,
|
||||
dataset_id: int | None = None,
|
||||
) -> dict:
|
||||
filtered_df = self._prepare_filtered_df(df, filters)
|
||||
|
||||
return self.summary_analysis.summary(filtered_df)
|
||||
|
||||
@@ -71,6 +71,7 @@ class UserAnalysis:
|
||||
per_user = df.groupby(["author", "type"]).size().unstack(fill_value=0)
|
||||
|
||||
emotion_cols = [col for col in df.columns if col.startswith("emotion_")]
|
||||
dominant_topic_by_author = {}
|
||||
|
||||
avg_emotions_by_author = {}
|
||||
if emotion_cols:
|
||||
@@ -80,6 +81,31 @@ class UserAnalysis:
|
||||
for author, row in avg_emotions.iterrows()
|
||||
}
|
||||
|
||||
if "topic" in df.columns:
|
||||
topic_df = df[
|
||||
df["topic"].notna()
|
||||
& (df["topic"] != "")
|
||||
& (df["topic"] != "Misc")
|
||||
]
|
||||
if not topic_df.empty:
|
||||
topic_counts = (
|
||||
topic_df.groupby(["author", "topic"])
|
||||
.size()
|
||||
.reset_index(name="count")
|
||||
.sort_values(
|
||||
["author", "count", "topic"],
|
||||
ascending=[True, False, True],
|
||||
)
|
||||
.drop_duplicates(subset=["author"])
|
||||
)
|
||||
dominant_topic_by_author = {
|
||||
row["author"]: {
|
||||
"topic": row["topic"],
|
||||
"count": int(row["count"]),
|
||||
}
|
||||
for _, row in topic_counts.iterrows()
|
||||
}
|
||||
|
||||
# ensure columns always exist
|
||||
for col in ("post", "comment"):
|
||||
if col not in per_user.columns:
|
||||
@@ -109,6 +135,7 @@ class UserAnalysis:
|
||||
"comment_post_ratio": float(row.get("comment_post_ratio", 0)),
|
||||
"comment_share": float(row.get("comment_share", 0)),
|
||||
"avg_emotions": avg_emotions_by_author.get(author, {}),
|
||||
"dominant_topic": dominant_topic_by_author.get(author),
|
||||
"vocab": vocab_by_author.get(
|
||||
author,
|
||||
{
|
||||
|
||||
@@ -152,9 +152,9 @@ def get_dataset_sources():
|
||||
return jsonify(list_metadata)
|
||||
|
||||
|
||||
@app.route("/datasets/scrape", methods=["POST"])
|
||||
@app.route("/datasets/fetch", methods=["POST"])
|
||||
@jwt_required()
|
||||
def scrape_data():
|
||||
def fetch_data():
|
||||
data = request.get_json()
|
||||
connector_metadata = get_connector_metadata()
|
||||
|
||||
@@ -424,7 +424,7 @@ def get_linguistic_analysis(dataset_id):
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.linguistic(dataset_content, filters)), 200
|
||||
return jsonify(stat_gen.linguistic(dataset_content, filters, dataset_id=dataset_id)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
@@ -448,7 +448,7 @@ def get_emotional_analysis(dataset_id):
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.emotional(dataset_content, filters)), 200
|
||||
return jsonify(stat_gen.emotional(dataset_content, filters, dataset_id=dataset_id)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
@@ -472,7 +472,7 @@ def get_summary(dataset_id):
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.summary(dataset_content, filters)), 200
|
||||
return jsonify(stat_gen.summary(dataset_content, filters, dataset_id=dataset_id)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
@@ -496,7 +496,7 @@ def get_temporal_analysis(dataset_id):
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.temporal(dataset_content, filters)), 200
|
||||
return jsonify(stat_gen.temporal(dataset_content, filters, dataset_id=dataset_id)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
@@ -520,7 +520,7 @@ def get_user_analysis(dataset_id):
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.user(dataset_content, filters)), 200
|
||||
return jsonify(stat_gen.user(dataset_content, filters, dataset_id=dataset_id)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
@@ -544,7 +544,7 @@ def get_cultural_analysis(dataset_id):
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.cultural(dataset_content, filters)), 200
|
||||
return jsonify(stat_gen.cultural(dataset_content, filters, dataset_id=dataset_id)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
@@ -568,7 +568,7 @@ def get_interaction_analysis(dataset_id):
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.interactional(dataset_content, filters)), 200
|
||||
return jsonify(stat_gen.interactional(dataset_content, filters, dataset_id=dataset_id)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
@@ -591,7 +591,8 @@ def get_full_dataset(dataset_id: int):
|
||||
)
|
||||
|
||||
dataset_content = dataset_manager.get_dataset_content(dataset_id)
|
||||
return jsonify(dataset_content.to_dict(orient="records")), 200
|
||||
filters = get_request_filters()
|
||||
return jsonify(stat_gen.filter_dataset(dataset_content, filters)), 200
|
||||
except NotAuthorisedException:
|
||||
return jsonify({"error": "User is not authorised to access this content"}), 403
|
||||
except NonExistentDatasetException:
|
||||
|
||||
@@ -1,21 +1,18 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from dto.post import Post
|
||||
import os
|
||||
|
||||
|
||||
class BaseConnector(ABC):
|
||||
# Each subclass declares these at the class level
|
||||
source_name: str # machine-readable: "reddit", "youtube"
|
||||
display_name: str # human-readable: "Reddit", "YouTube"
|
||||
required_env: list[str] = [] # env vars needed to activate
|
||||
source_name: str # machine readable
|
||||
display_name: str # human readablee
|
||||
required_env: list[str] = []
|
||||
|
||||
search_enabled: bool
|
||||
categories_enabled: bool
|
||||
|
||||
@classmethod
|
||||
def is_available(cls) -> bool:
|
||||
"""Returns True if all required env vars are set."""
|
||||
import os
|
||||
|
||||
return all(os.getenv(var) for var in cls.required_env)
|
||||
|
||||
@abstractmethod
|
||||
|
||||
@@ -11,8 +11,7 @@ from server.connectors.base import BaseConnector
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; ForumScraper/1.0)"}
|
||||
|
||||
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; Digital-Ethnography-Aid/1.0)"}
|
||||
|
||||
class BoardsAPI(BaseConnector):
|
||||
source_name: str = "boards.ie"
|
||||
@@ -88,7 +87,7 @@ class BoardsAPI(BaseConnector):
|
||||
post = self._parse_thread(html, post_url)
|
||||
return post
|
||||
|
||||
with ThreadPoolExecutor(max_workers=30) as executor:
|
||||
with ThreadPoolExecutor(max_workers=5) as executor:
|
||||
futures = {executor.submit(fetch_and_parse, url): url for url in urls}
|
||||
|
||||
for i, future in enumerate(as_completed(futures)):
|
||||
|
||||
@@ -232,7 +232,11 @@ class RedditAPI(BaseConnector):
|
||||
)
|
||||
|
||||
if response.status_code == 429:
|
||||
wait_time = response.headers.get("X-Ratelimit-Reset", backoff)
|
||||
try:
|
||||
wait_time = int(response.headers.get("X-Ratelimit-Reset", backoff))
|
||||
wait_time += 1 # Add a small buffer to ensure the rate limit has reset
|
||||
except ValueError:
|
||||
wait_time = backoff
|
||||
|
||||
logger.warning(
|
||||
f"Rate limited by Reddit API. Retrying in {wait_time} seconds..."
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
import os
|
||||
import datetime
|
||||
import logging
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from googleapiclient.discovery import build
|
||||
@@ -9,9 +10,11 @@ from dto.comment import Comment
|
||||
from server.connectors.base import BaseConnector
|
||||
|
||||
load_dotenv()
|
||||
|
||||
API_KEY = os.getenv("YOUTUBE_API_KEY")
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
|
||||
class YouTubeAPI(BaseConnector):
|
||||
source_name: str = "youtube"
|
||||
@@ -77,11 +80,30 @@ class YouTubeAPI(BaseConnector):
|
||||
return True
|
||||
|
||||
def _search_videos(self, query, limit):
|
||||
request = self.youtube.search().list(
|
||||
q=query, part="snippet", type="video", maxResults=limit
|
||||
)
|
||||
response = request.execute()
|
||||
return response.get("items", [])
|
||||
results = []
|
||||
next_page_token = None
|
||||
|
||||
while len(results) < limit:
|
||||
batch_size = min(50, limit - len(results))
|
||||
|
||||
request = self.youtube.search().list(
|
||||
q=query,
|
||||
part="snippet",
|
||||
type="video",
|
||||
maxResults=batch_size,
|
||||
pageToken=next_page_token
|
||||
)
|
||||
|
||||
response = request.execute()
|
||||
results.extend(response.get("items", []))
|
||||
logging.info(f"Fetched {len(results)} out of {limit} videos for query '{query}'")
|
||||
|
||||
next_page_token = response.get("nextPageToken")
|
||||
if not next_page_token:
|
||||
logging.warning(f"No more pages of results available for query '{query}'")
|
||||
break
|
||||
|
||||
return results[:limit]
|
||||
|
||||
def _get_video_comments(self, video_id):
|
||||
request = self.youtube.commentThreads().list(
|
||||
|
||||
@@ -26,7 +26,34 @@ class DatasetManager:
|
||||
def get_dataset_content(self, dataset_id: int) -> pd.DataFrame:
|
||||
query = "SELECT * FROM events WHERE dataset_id = %s"
|
||||
result = self.db.execute(query, (dataset_id,), fetch=True)
|
||||
return pd.DataFrame(result)
|
||||
df = pd.DataFrame(result)
|
||||
if df.empty:
|
||||
return df
|
||||
|
||||
dedupe_columns = [
|
||||
column
|
||||
for column in [
|
||||
"post_id",
|
||||
"parent_id",
|
||||
"reply_to",
|
||||
"author",
|
||||
"type",
|
||||
"timestamp",
|
||||
"dt",
|
||||
"title",
|
||||
"content",
|
||||
"source",
|
||||
"topic",
|
||||
]
|
||||
if column in df.columns
|
||||
]
|
||||
|
||||
if dedupe_columns:
|
||||
df = df.drop_duplicates(subset=dedupe_columns, keep="first")
|
||||
else:
|
||||
df = df.drop_duplicates(keep="first")
|
||||
|
||||
return df.reset_index(drop=True)
|
||||
|
||||
def get_dataset_info(self, dataset_id: int) -> dict:
|
||||
query = "SELECT * FROM datasets WHERE id = %s"
|
||||
@@ -52,6 +79,16 @@ class DatasetManager:
|
||||
if event_data.empty:
|
||||
return
|
||||
|
||||
dedupe_columns = [
|
||||
column for column in ["id", "type", "source"] if column in event_data.columns
|
||||
]
|
||||
if dedupe_columns:
|
||||
event_data = event_data.drop_duplicates(subset=dedupe_columns, keep="first")
|
||||
else:
|
||||
event_data = event_data.drop_duplicates(keep="first")
|
||||
|
||||
self.delete_dataset_content(dataset_id)
|
||||
|
||||
query = """
|
||||
INSERT INTO events (
|
||||
dataset_id,
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
from time import time
|
||||
|
||||
import pandas as pd
|
||||
import logging
|
||||
|
||||
@@ -46,6 +48,7 @@ def fetch_and_process_dataset(
|
||||
|
||||
try:
|
||||
for metadata in source_info:
|
||||
fetch_start = time()
|
||||
name = metadata["name"]
|
||||
search = metadata.get("search")
|
||||
category = metadata.get("category")
|
||||
@@ -57,8 +60,11 @@ def fetch_and_process_dataset(
|
||||
)
|
||||
posts.extend(post.to_dict() for post in raw_posts)
|
||||
|
||||
fetch_time = time() - fetch_start
|
||||
df = pd.DataFrame(posts)
|
||||
|
||||
nlp_start = time()
|
||||
|
||||
dataset_manager.set_dataset_status(
|
||||
dataset_id, "processing", "NLP Processing Started"
|
||||
)
|
||||
@@ -66,9 +72,11 @@ def fetch_and_process_dataset(
|
||||
processor = DatasetEnrichment(df, topics)
|
||||
enriched_df = processor.enrich()
|
||||
|
||||
nlp_time = time() - nlp_start
|
||||
|
||||
dataset_manager.save_dataset_content(dataset_id, enriched_df)
|
||||
dataset_manager.set_dataset_status(
|
||||
dataset_id, "complete", "NLP Processing Completed Successfully"
|
||||
dataset_id, "complete", f"Completed Successfully. Fetch time: {fetch_time:.2f}s, NLP time: {nlp_time:.2f}s"
|
||||
)
|
||||
except Exception as e:
|
||||
dataset_manager.set_dataset_status(
|
||||
|
||||