Compare commits

..

239 Commits

Author SHA1 Message Date
5970f555fa docs(readme): update readme 2026-04-19 13:54:09 +01:00
9b7a51ff33 docs(report): add Declaration of Originality and Acknowledgements sections 2026-04-18 22:10:16 +01:00
2d39ea6e66 refactor(connector): clean up comments 2026-04-18 22:10:03 +01:00
c1e5482f55 docs(report): fix typos 2026-04-18 16:09:22 +01:00
b2d7f6edaf docs(report): add visualizations and emotional analysis for Cork dataset 2026-04-18 15:44:04 +01:00
10efa664df docs(report): fix typos and add more eval 2026-04-17 20:31:39 +01:00
3db7c1d3ae docs(report): add future work section 2026-04-16 16:54:18 +01:00
72e17e900e fix(report): correct typos 2026-04-16 16:41:27 +01:00
7b9a17f395 fix(connector): reduce ThreadPoolExecutor max_workers 2026-04-16 16:37:27 +01:00
0a396dd504 docs(report): add more citations 2026-04-16 16:23:36 +01:00
c6e8144116 docs(report): add traditionl vs digital ethnography reference 2026-04-16 16:08:59 +01:00
760d2daf7f docs(report): remove redundant phrasing 2026-04-16 15:59:24 +01:00
ca38b992eb build(docker): switch backend flask deployment to Gunicorn 2026-04-15 17:57:22 +01:00
ee9c7b4ab2 docs(report): finish evaluation & reflection 2026-04-15 17:52:54 +01:00
703a7c435c fix(youtube_api): video search capped at 50 2026-04-14 17:54:43 +01:00
02ba727d05 chore(connector): add buffer to ratelimit reset 2026-04-14 17:41:09 +01:00
76591bc89e feat(tasks): add fetch and NLP processing time logging to dataset status 2026-04-14 17:35:43 +01:00
e35e51d295 fix(reddit_api): handle rate limit wait time conversion error 2026-04-14 17:35:21 +01:00
d2fe637743 docs: update references for digital ethnography and further work on evaluation 2026-04-14 15:16:56 +01:00
e1831aab7d docs(report): add researcher feedback 2026-04-13 22:00:41 +01:00
a3ef5a5655 chore: add more defaults to example env 2026-04-13 22:00:19 +01:00
5f943ce733 Merge pull request 'Corpus Explorer Feature' (#11) from feat/corpus-explorer into main
Reviewed-on: #11
2026-04-13 19:02:45 +01:00
9964a919c3 docs(report): enhance frontend design section 2026-04-13 19:01:51 +01:00
c11434344a refactor: streamline CorpusExplorer components 2026-04-13 17:06:46 +01:00
bc356848ef docs(report): start frontend section 2026-04-13 16:43:20 +01:00
047427432f docs(report): add summary section for dataset overview and update authentication manager details 2026-04-13 12:24:43 +01:00
d0d02e9ebf docs(report): add stance markers image and update related sections 2026-04-12 16:15:18 +01:00
68342606e3 docs(report): add NLP backoff diagram and update references for NER model 2026-04-11 15:24:57 +01:00
afae7f42a1 docs(report): add data pipeline diagram and update references for embedding models 2026-04-11 15:03:24 +01:00
4dd2721e98 Merge remote-tracking branch 'origin/main' into feat/corpus-explorer 2026-04-10 13:19:17 +01:00
99afe82464 docs(report): refine emotional classification model details 2026-04-10 13:17:11 +01:00
8c44df94c0 docs(report): update references for emotion classification models and NLP techniques 2026-04-09 19:01:21 +01:00
42905cc547 docs(report): add connector implementation & design NLP docs 2026-04-08 20:39:51 +01:00
ec64551881 fix(connectors): update User-Agent header for BoardsAPI 2026-04-08 19:34:30 +01:00
e274b8295a docs(report): add citations and start implementation section 2026-04-08 17:28:41 +01:00
3df6776111 docs(report): add decision tradeoff decisions 2026-04-07 18:04:25 +01:00
a347869353 docs(report): add more justification for ethnographic endpoints 2026-04-07 15:22:47 +01:00
8b4e13702e docs(report): add ucc crest to title page 2026-04-07 12:55:01 +01:00
8fa4f3fbdf refactor(report): move data pipeline above ethnographic analysis 2026-04-07 12:52:48 +01:00
c6cae040f0 feat(analysis): add emotional averages to stance markers 2026-04-07 12:49:18 +01:00
addc1d4087 docs(report): add justification at each stage 2026-04-07 12:17:02 +01:00
225133a074 docs(report): add ethnographic analysis section 2026-04-07 11:54:57 +01:00
e903e1b738 feat(user): add dominant topic information to user data 2026-04-07 11:34:03 +01:00
0c4dc02852 docs(report): add ethnographic analysis section 2026-04-06 19:39:09 +01:00
33e4291def docs(report): add table of contents 2026-04-06 19:34:38 +01:00
cedbce128e docs(report): add auto-fetch section 2026-04-06 19:32:49 +01:00
107dae0e95 docs(report): add data storage section 2026-04-06 19:26:10 +01:00
23833e2c5b docs(report): add custom topic section 2026-04-06 18:47:29 +01:00
f2b6917f1f docs(report); add data ingestion section 2026-04-06 12:44:17 +01:00
b57a8d3c65 docs(report): add data pipeline and connector sections
Also moved requirements to the end of design, where it is more appropriately placed. Requirements can be specified after discussing potential pitfalls.
2026-04-04 14:36:52 +01:00
ac65e26eab docs(report): add ethics section 2026-04-04 13:52:56 +01:00
6efa75dfe6 chore(connectors): reduce aggressive parallel connections to boards.ie 2026-04-04 12:33:06 +01:00
de61e7653f perf(connector): add reddit API authentication to speed up fetching
This aligns better with ethics and massively increases rate limits.
2026-04-04 12:26:54 +01:00
98aa04256b fix(reddit_api): fix reddit ratelimit check 2026-04-04 10:20:48 +01:00
5f81c51979 docs(report): add scalability constraints 2026-04-03 20:06:19 +01:00
361b532766 docs(analysis): add feasability analysis 2026-04-03 20:02:22 +01:00
9ef96661fc report(analysis): update structure & add justifications 2026-04-03 18:35:08 +01:00
9375abded5 docs(design): add docker & async processing sections 2026-04-03 17:59:01 +01:00
74ecdf238a docs: add database schema diagram 2026-04-02 19:30:20 +01:00
b85987e179 docs: add system architecture diagram 2026-04-02 18:59:32 +01:00
37d08c63b8 chore: rename auto-scraper to auto-fetcher
Improves the perception of ethics
2026-04-01 09:50:53 +01:00
1482e96051 feat(datasets): implement deduplication of dataset records in get_dataset_content 2026-04-01 09:06:07 +01:00
cd6030a760 fix(ngrams): remove stop words from ngrams 2026-04-01 08:44:47 +01:00
6378015726 fix(stats): remove duplicated entries in corpus explorer 2026-04-01 00:22:29 +01:00
430793cd09 feat(frontend): add "show more" functionality to corpus explorer 2026-04-01 00:09:20 +01:00
b270ed03ae feat(frontend): implement corpus explorer
This allows you to view the posts & comments associated with a specific aggregate.
2026-04-01 00:04:25 +01:00
1dde5f7b08 fix(nlp): fix missing processing dataset status update 2026-03-31 20:59:09 +01:00
a841c6f6a1 perf(stats): memoize derived state and reduce intermediate allocations 2026-03-31 20:15:07 +01:00
2045ccebb5 build(docker): update CMD to include host binding 2026-03-31 19:31:58 +01:00
efb4c8384d chore(stats): remove average_thread_depth 2026-03-31 16:40:54 +01:00
75fd042d74 feat(api): add support for custom topic lists when autoscraping 2026-03-31 13:36:37 +01:00
e776ef53ac refactor(database): configurable database source 2026-03-29 21:30:18 +01:00
f996b38fa5 fix(report): remove unicode char 2026-03-25 19:46:29 +00:00
6d8ae3e811 docs: add section on Topic Modelling in NLP 2026-03-25 19:44:14 +00:00
376773a0cc style: run python linter & prettifier on backend code 2026-03-25 19:34:43 +00:00
aae10c4d9d style: run prettifier plugin on entire frontend 2026-03-25 19:30:21 +00:00
8730af146d chore: remove main.py
Not used anymore.
2026-03-22 14:41:47 +00:00
7716ee0bff build(env): extract Redis URL into env file
This could allow one to connect to a remote Redis instance with a powerful GPU, allowing one to offload the NLP work.
2026-03-22 14:41:15 +00:00
97e897c240 fix(analysis): broken entity handling in cultural endpoint 2026-03-22 14:34:05 +00:00
c3762f189c build(docker): comment out GPU deployment configuration from worker service
While this works for NVIDIA GPUs, it breaks on a MacBook or any non-NVIDIA machine. I commented it out because it's still useful on these machines.
2026-03-22 13:34:51 +00:00
078716754c feat(report): add main.tex for project documentation and analysis 2026-03-21 23:54:42 +00:00
e43eae5afd fix(frontend): missing "fetching" status from auto-scrape
When auto-scraping, the dataset status page would say "Dataset Ready" when it was still fetching.
2026-03-21 22:49:16 +00:00
b537b5ef16 docs: update .gitignore 2026-03-21 19:24:51 +00:00
acc591ff1e Merge pull request 'Finish off the links between frontend and backend' (#10) from feat/add-frontend-pages into main
Reviewed-on: #10
2026-03-18 20:30:19 +00:00
e054997bb1 feat(frontend): reword CulturalStats to improve understandability 2026-03-18 19:23:35 +00:00
e5414befa7 feat(frontend): add dominant emotion display to UserModal 2026-03-18 19:12:25 +00:00
86926898ce feat(frontend): improve labels to be more understandable 2026-03-18 19:12:11 +00:00
b1177540a1 feat(frontend): enhance EmotionalStats component with detailed mood analysis 2026-03-18 19:11:18 +00:00
f604fcc531 feat(frontend): add warning message for scraping limits 2026-03-18 19:02:11 +00:00
b7aec2b0ea feat(frontend): add favicon
Credit goes to `srip` on flaticon for the image.
2026-03-18 19:00:31 +00:00
1446dd176d feat(frontend): center page selection 2026-03-18 18:53:14 +00:00
c215024ef2 feat(frontend): add deleted user filter
Reddit often contains "[Deleted]" when a user is banned or deletes their post/comment. Keeping the backend faithful to the original dataset is important so the filtering is being done on the frontend.
2026-03-18 18:50:51 +00:00
17ef42e548 feat!(frontend): add cultural, interactional and linguistic stat pages 2026-03-18 18:43:49 +00:00
7e4a91bb5e style(frontend): style api types to be in order of the endpoint 2026-03-18 18:40:39 +00:00
436549641f chore(frontend): add api types for new backend data 2026-03-18 18:37:39 +00:00
3e78a54388 feat(stat): add conversation concentration metric
Remove old `initiator_ratio` metric which wasn't working due every event having a `reply_to` value.

This metric was suggested by AI, and is a surprisingly interesting one that gave interesting insights.
2026-03-18 18:36:09 +00:00
71998c450e fix(db): change title type to text
Occasionally a Reddit post would have a long title, and would break in the schema.
2026-03-17 19:49:03 +00:00
2a00384a55 feat(interaction): add top interaction pairs and initiator ratio methods 2026-03-17 19:03:56 +00:00
8372aa7278 feat(api): add endpoint to view entire dataset 2026-03-17 13:36:41 +00:00
7b5a939271 fix(stats): missing private methods in User obj 2026-03-17 13:36:10 +00:00
2fa1dff4b7 feat(stat): add lexical diversity stat 2026-03-17 13:27:49 +00:00
31fb275ee3 fix(db): incorrect NER column being inserted 2026-03-17 12:53:30 +00:00
8a0f6e71e8 chore(api): rename cultural entity emotion endpoint 2026-03-17 12:31:53 +00:00
9093059d05 refactor(stats): move user stats out of interactional into users 2026-03-17 12:23:03 +00:00
8a13444b16 chore(frontend): add new API types 2026-03-16 16:46:07 +00:00
3468fdc2ea feat(api): add new user and linguistic endpoints 2026-03-16 16:45:11 +00:00
09a4f9036f refactor(stats): add summary and user stat classes for consistency 2026-03-16 16:43:24 +00:00
97fccd073b feat(emotional): add average emotion & dominant emotion stats 2026-03-16 16:41:28 +00:00
94befb61c5 Merge pull request 'Automatic Scraping of dataset options' (#9) from feat/automatic-scraping-datasets into main
Reviewed-on: #9
2026-03-14 21:58:49 +00:00
12f5953146 fix(api): remove error exceptions in API responses
Mainly a security thing, we don't want actual code errors being given in the API response, as someone could find out how the inner workings of the code behaves.
2026-03-14 21:58:00 +00:00
5b0441c34b fix(connector): unnecessary comment limits
In addition, I made some methods private to better align with the BaseConnector parent class.
2026-03-14 21:53:13 +00:00
d2b919cd66 fix(api): enforce integer limit and cap at 1000 in scrape_data function 2026-03-14 17:35:05 +00:00
062937ec3c fix(api): incorrect validation on search 2026-03-14 17:12:02 +00:00
2a00795cc2 chore(connectors): implement category_exists for Boards API 2026-03-14 17:11:49 +00:00
c990f29645 fix(frontend): misaligned loading page for datasets 2026-03-14 17:05:46 +00:00
8a423b2a29 feat(connectors): implement category validation in scraping process 2026-03-14 16:59:43 +00:00
d96f459104 fix(connectors): update URL references to use base_url in BoardsAPI 2026-03-13 21:59:17 +00:00
162a4de64e fix(frontend): detects which sources support category or search 2026-03-12 10:07:28 +00:00
6684780d23 fix(connectors): add stronger validation to scrape endpoint
Strong validation needed, otherwise data goes to Celery and crashes silently. In addition it checks if that specific source supports search or category.
2026-03-12 09:59:07 +00:00
c12f1b4371 chore(connectors): add category and search validation fields 2026-03-12 09:56:34 +00:00
01d6bd0164 fix(connectors): category / search fields breaking
Ideally category and search are fully optional, however some sites break if one or the other is not provided.

Unfortuntely `boards.ie` has a different page type for searches and I'm not bothered to implement a scraper from scratch.

In addition, removed comment limit options.
2026-03-11 21:16:26 +00:00
12cbc24074 chore(utils): remove split_limit function 2026-03-11 19:47:44 +00:00
0658713f42 chore: remove unused dataset creation script 2026-03-11 19:44:38 +00:00
b2ae1a9f70 feat(frontend): add page for scraping endpoint 2026-03-11 19:41:34 +00:00
eff416c34e fix(connectors): hardcoded source name in Youtube connector 2026-03-10 23:36:09 +00:00
524c9c50a0 fix(api): incorrect dataset status update message 2026-03-10 23:28:21 +00:00
2ab74d922a feat(api): support per-source search, category and limit configuration 2026-03-10 23:15:33 +00:00
d520e2af98 fix(auth): missing email and username business rules 2026-03-10 22:48:04 +00:00
8fe84a30f6 fix: data leak when opening topics file 2026-03-10 22:45:07 +00:00
dc330b87b9 fix(celery): process dataset directly in fetch task
Calling the original `process_dataset` function led to issues with JSON serialisation.
2026-03-10 22:17:00 +00:00
7ccc934f71 build: change celery to debug mode 2026-03-10 22:14:45 +00:00
a3dbe04a57 fix(frontend): option to delete dataset not shown after fail 2026-03-10 19:23:48 +00:00
a65c4a461c fix(api): flask delegates dataset fetch to celery 2026-03-10 19:17:41 +00:00
15704a0782 chore(db): update db schema to include "fetching" status 2026-03-10 19:17:08 +00:00
6ec47256d0 feat(api): add database scraping endpoints 2026-03-10 19:04:33 +00:00
2572664e26 chore(utils): add env getter that fails if env not found 2026-03-10 18:50:53 +00:00
17bd4702b2 fix(connectors): connector detectors returning name of ID alongside connector obj 2026-03-10 18:36:40 +00:00
53cb5c2ea5 feat(topics): add generalised topic list
This is easier and quicker compared to deriving a topics list based on the dataset that has been scraped.

While using LLMs to create a personalised topic list based on the query, category or dataset itself would yield better results for most, it is beyond the scope of this project.
2026-03-10 18:36:08 +00:00
0866dda8b3 chore: add util to always split evenly 2026-03-10 18:25:05 +00:00
5ccb2e73cd fix(connectors): incorrect registry location
Registry paths were using the incorrect connector path locations.
2026-03-10 18:18:42 +00:00
2a8d7c7972 refactor(connectors): Youtube & Reddit connectors implement BaseConnector 2026-03-10 18:11:33 +00:00
e7a8c17be4 chore(connectors): add base connector inheritance 2026-03-10 18:08:01 +00:00
cc799f7368 feat(connectors): add base connector and registry for detection
Idea is to have a "plugin-type" system, where new connectors can extend the `BaseConnector` class and implement the fetch posts method.

These are automatically detected by the registry, and automatically used in new Flask endpoints that give a list of possible sources.

Allows for an open-ended system where new data scrapers / API consumers can be added dynamically.
2026-03-09 21:29:03 +00:00
262a70dbf3 refactor(api): rename /upload endpoint
Ensures consistency with the other dataset-based endpoints and follows the REST-API rules more cleanly.
2026-03-09 20:55:12 +00:00
ca444e9cb0 refactor: move connectors to backend dir
They will now be more used in the backend.
2026-03-09 20:53:13 +00:00
738af5415b Merge pull request 'Editable and removable datasets' (#8) from feat/editable-datasets into main
Reviewed-on: #8
2026-03-05 16:55:48 +00:00
2b14a8a417 feat(frontend): add deletion modal confirmation box 2026-03-05 12:29:53 +00:00
a154b25415 fix(db): missing rollback on execute_batch method
Arguably more important on a batch function to have rollback.
2026-03-05 10:09:14 +00:00
eb273efe61 Merge remote-tracking branch 'origin/main' into feat/editable-datasets 2026-03-04 22:34:55 +00:00
a9001c79e1 build: add frontend to main docker compose
Forgot to add this earlier
2026-03-04 22:34:32 +00:00
eec8f2417e feat(frontend): add ability to delete datasets 2026-03-04 22:32:19 +00:00
f5835b5a97 feat(frontend): add frontend option to change name 2026-03-04 22:17:31 +00:00
64e3f9eea8 feat: implement PATCH dataset route
At the moment only allows for the updating of the name. Which seems to be the only editable part of dataset metadata.
2026-03-04 21:38:06 +00:00
4f01bf0419 fix(db): incorrect SQL condition when deleting dataset content 2026-03-04 21:35:10 +00:00
6948891677 Merge remote-tracking branch 'origin/main' into feat/editable-datasets 2026-03-04 21:30:13 +00:00
f1f33e2fe4 feat: implement delete dataset route 2026-03-04 21:29:01 +00:00
e20d0689e8 fix(celery): adjust try-catch logic to improve error handling
Capturing the instantiation of the database and dataset manager objects inside the try-catch will cause errors if something else fails.

If an exception occurs and the dataset_manager is not initialised, the code inside the catch block will fail.
2026-03-04 21:18:59 +00:00
fcdac6f3bb Merge pull request 'Fix the frontend API calls and implement logins on frontend' (#7) from feat/update-frontend-api-calls into main
Reviewed-on: #7
2026-03-04 20:20:50 +00:00
5fc1f1532f feat(user stats): updated styling and stats in user page
Interaction graph was taking up too much space and was the only thing on the screen. Further statistics were added however these may be removed in favour of more informative statistics
2026-03-04 20:20:34 +00:00
24277e0104 fix(frontend): move loading card higher up
Looks weird lower down on the screen
2026-03-04 20:09:55 +00:00
4e99b77492 fix(db): missing post ID in db schema
Caused surprisingly little errors. It only broke the interaction graph.
2026-03-04 20:05:20 +00:00
b6815c490a feat: add loading page for when dataset is loading
Originally there was a simple "Loading" text, however this looked bad and might lead a user to think that the page had frozen.

There is now a more comprehensive loading animation which users might be happy to sit on for a few minutes.
2026-03-04 18:39:20 +00:00
29c90ddfff feat: update name on topbar
Crosspost Analysis Engine sounds far cooler than "Ethnograph View"
2026-03-04 18:37:48 +00:00
3fe08b9c67 fix(backend): buggy reply_time_by_emotion metric
This metric was never stastically significant and held no real value. It also so happened to hold accidental NaN values in the dataframe which broke the frontend.

Happy to remove.
2026-03-04 18:37:11 +00:00
f9bc9cf9c9 fix: remove Datasets tab when not logged in 2026-03-03 20:32:33 +00:00
249528bb5c feat(frontend): remove "Upload" and "Last Stats" page
These are redundant and clunky, everything can be accessed from the Dataset tab
2026-03-03 20:30:42 +00:00
bd0e1a9050 refactor(frontend): move stylings out of logic into centralized file 2026-03-03 20:28:23 +00:00
e2ac4495fd chore(frontend): add extra types to frontend 2026-03-03 20:13:13 +00:00
f3b48525e2 feat(backend): increase default JWT expiration 2026-03-03 20:09:57 +00:00
55319461e5 feat: add "My Datasets" page 2026-03-03 19:52:12 +00:00
531ddb0467 fix(frontend): incorrect URLs in stats page 2026-03-03 18:36:46 +00:00
d11c5acb77 refactor: calculation of document titles into another class 2026-03-03 18:18:05 +00:00
f63f4e5f10 feat(frontend): add status page for loading dataset 2026-03-03 18:13:51 +00:00
23c58e20ae feat(frontend): add title to pages 2026-03-03 17:59:50 +00:00
207c4b67da feat(frontend): add dataset name requirements to the upload page 2026-03-03 17:28:46 +00:00
772205d3df feat(api): add ability to fetch all datasets by a user 2026-03-03 17:25:00 +00:00
b6de100a17 feat: overhaul upload page styling 2026-03-03 17:18:09 +00:00
5310568631 feat: add React layout and a topbar allowing for easy logins 2026-03-03 17:17:57 +00:00
4b33f17b4b fix: inconsistent styling in login page 2026-03-03 17:07:31 +00:00
64783e764d fix: remove unnecessary styling in index.css 2026-03-03 16:57:10 +00:00
8ac5207a11 feat: add login page 2026-03-03 15:55:01 +00:00
090a57f4dd build: add frontend to docker 2026-03-03 15:29:21 +00:00
c1a0324a03 build: add example env 2026-03-03 15:21:28 +00:00
cade2b1866 build: fix directories in docker compose 2026-03-03 15:19:57 +00:00
6e263cf30b Merge pull request 'Implement job queue for asynchronous NLP' (#6) from feat/implement-job-queue into main
Reviewed-on: #6
2026-03-03 14:26:37 +00:00
9d1e8960fc perf: update cultural analysis to use regex instead of Counter 2026-03-03 14:25:25 +00:00
0ede7fe071 fix(compose): add GPU support to celery worker 2026-03-03 14:18:43 +00:00
eb4187c559 feat(api): add status returns for NLP processing 2026-03-03 13:46:37 +00:00
63cd465189 feat(db): add status and constraints to the schema 2026-03-03 13:46:06 +00:00
f93e45b827 fix(dataset): silent erros if dataset did not exist 2026-03-03 13:13:40 +00:00
075e1fba85 fix: typo in exception naming 2026-03-03 13:12:28 +00:00
a4c527ce5b fix(db): execute not committing if fetch flag was set 2026-03-03 13:10:50 +00:00
6d60820800 build: add persistent model caching 2026-03-03 13:00:19 +00:00
3772f83d11 fix: add title column to db
This was accidentally removed in a previous merge
2026-03-03 12:41:02 +00:00
f4894759d7 feat: add docker-compose dev 2026-03-03 12:34:51 +00:00
3a58705635 feat: add celery & redis for background data processing 2026-03-03 12:27:14 +00:00
2e0e842525 build: update reqs and docker compose 2026-03-03 12:09:50 +00:00
14b472ea60 build: add dockerfile for constructing backend 2026-03-03 12:09:27 +00:00
c767f59b26 feat: add redis to docker compose 2026-03-03 11:27:01 +00:00
cc71c80df7 Merge pull request 'Refactor DB classes and management' (#5) from refactor/db-class into main
Reviewed-on: #5
2026-03-03 11:17:50 +00:00
6248b32ce2 refactor: move app.py into main server dir 2026-03-03 11:14:51 +00:00
07a3b204bf fix: incorrect docker compose db dirs 2026-03-03 11:14:41 +00:00
87bdc0245a refactor: move core files into separate dirs 2026-03-03 11:13:33 +00:00
8b8462fd58 chore: add non-existent database error check 2026-03-03 11:11:10 +00:00
36bede42d9 style: clean up imports 2026-03-03 11:08:56 +00:00
4bec0dd32c refactor: extract dataset functionality out of db class 2026-03-02 19:18:05 +00:00
4961ddc349 refactor: move db dir into server 2026-03-02 19:05:56 +00:00
45229a3f04 Merge pull request 'Missing title field in database events column' (#4) from fix/missing-title-field-database into main
Reviewed-on: #4
2026-03-02 19:00:14 +00:00
c9151da643 feat: add custom error for non-existent dataset 2026-03-02 18:59:31 +00:00
18c8539646 fix: server error when attmepting to access non-existant dataset 2026-03-02 18:55:27 +00:00
6d8f2fa4e0 feat: add custom exceptions file 2026-03-02 18:54:11 +00:00
1f6d92b1a8 fix: include title in db schema 2026-03-02 18:42:03 +00:00
2ae9479943 Merge pull request 'Fix broken filtering endpoints' (#3) from fix/broken-filtering into main
Reviewed-on: #3
2026-03-02 18:32:00 +00:00
dd44fad294 fix(db): incorrect NER column name in database saving 2026-03-02 18:30:52 +00:00
5ea71023b5 refactor: move query parameter extraction function out of flask app 2026-03-02 18:29:09 +00:00
37cb2c9ff4 feat(querying): make filters stateless
Stateless filters are required as the server cannot store them in the StatGen object
2026-03-02 16:18:02 +00:00
82a98f84bd refactor: combine query results into one endpoint 2026-03-01 19:06:49 +00:00
8b4adf4a63 refactor: update filtering method names 2026-03-01 18:44:46 +00:00
a6adea5a7d fix: broken stat_gen filter methods 2026-03-01 18:28:08 +00:00
47be4d9586 Merge pull request 'Storage of user data and datasets in PostGreSQL' (#2) from feat/database-integration into main
Reviewed-on: #2
2026-03-01 16:47:25 +00:00
7ddd625bf8 fix: database schema missing type column 2026-03-01 16:40:00 +00:00
07ab7529a9 refactor: update analysis classes to accept DataFrame as parameter instead of instance variable 2026-03-01 16:25:39 +00:00
d20790ed4b fix: incorrect dataset authorisation check 2026-03-01 16:10:42 +00:00
d3c4d883be Merge branch 'auth-test' of gitea:dylan/crosspost into auth-test 2026-03-01 16:01:48 +00:00
a975f3bdba docs: update gitignore 2026-03-01 16:01:47 +00:00
5fb7710dc2 feat: dataset now persists to database 2026-03-01 16:01:15 +00:00
0be9ff4896 feat: add dataset processor class 2026-03-01 15:01:34 +00:00
2493c6d465 feat: add save_dataset method to db 2026-03-01 15:01:07 +00:00
d0a01c8d2f feat: unify posts and comments tables 2026-03-01 15:00:54 +00:00
d73f4f1c45 Merge branch 'main' into auth-test 2026-02-25 08:59:32 +00:00
41f10c18cf feat: add nlp cols to database schema 2026-02-24 14:13:41 +00:00
058f3ae702 feat: update schema to include posts and comments 2026-02-23 22:53:15 +00:00
be6ab1f929 feat: add profile endpoint to view user details 2026-02-23 22:43:55 +00:00
3165bf1aa9 feat: add login endpoint 2026-02-23 22:40:26 +00:00
29a4e5bb22 feat: add database schema 2026-02-23 22:36:07 +00:00
dc919681fd docs: update requirements.txt 2026-02-23 22:27:46 +00:00
0589b2c8a5 feat: add /register endpoint 2026-02-23 22:27:32 +00:00
96a5bcc9e8 feat: add database & auth manager classes 2026-02-23 22:27:15 +00:00
66f1b26cc8 build: add docker compose for postgres database 2026-02-23 22:26:58 +00:00
98 changed files with 9481 additions and 1789 deletions

7
.gitignore vendored
View File

@@ -8,4 +8,9 @@ __pycache__/
# React App Vite
node_modules/
dist/
dist/
helper
db
report/build
.DS_Store

19
Dockerfile Normal file
View File

@@ -0,0 +1,19 @@
# Use slim to reduce size
FROM python:3.13-slim
# Prevent Python from buffering stdout
ENV PYTHONUNBUFFERED=1
# System deps required for psycopg2 + torch
RUN apt-get update && apt-get install -y \
build-essential \
libpq-dev \
gcc \
curl \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]

View File

@@ -1,29 +1,49 @@
# crosspost
**crosspost** is a browser-based tool designed to support *digital ethnography*, the study of how people interact, communicate, and form culture in online spaces such as forums, social media platforms, and comment-driven communities.
A web-based analytics platform for exploring online communities. Built as a final year CS project at UCC, crosspost ingests data from Reddit, YouTube, and Boards.ie, runs NLP analysis on it (emotion detection, topic classification, named entity recognition, stance markers), and surfaces the results through an interactive dashboard.
The motivating use case is digital ethnography — studying how people talk, what they talk about, and how culture forms in online spaces. The included dataset is centred on Cork, Ireland.
The project aims to make it easier for students, researchers, and journalists to collect, organise, and explore online discourse in a structured and ethical way, without requiring deep technical expertise.
## What it does
- Fetch posts and comments from Reddit, YouTube, and Boards.ie (or upload your own .jsonl file)
- Normalise everything into a unified schema regardless of source
- Run NLP analysis asynchronously in the background via Celery workers
- Explore results through a tabbed dashboard: temporal patterns, word clouds, emotion breakdowns, user activity, interaction graphs, topic clusters, and more
- Multi-user support — each user has their own datasets, isolated from everyone else
By combining data ingestion, analysis, and visualisation in a single system, crosspost turns raw online interactions into meaningful insights about how conversations emerge, evolve, and spread across platforms.
# Prerequisites
- Docker & Docker Compose
- A Reddit App (client id & secret)
- YouTube Data v3 API Key
## Goals for this project
- Collect data ethically: enable users to link/upload text, images, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well.
- Organise content: Store gathered material in a structured database with tagging for themes, dates, and sources.
Analyse patterns: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks.
- Visualise insights: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve.
- Have clearly stated and explained ethical and privacy guidelines for users. The student will design the architecture, implement data pipelines, integrate basic NLP models, and create an interactive dashboard.
# Setup
1) **Clone the Repo**
```
git clone https://github.com/your-username/crosspost.git
cd crosspost
```
Beyond programming, the project involves applying ethical research principles, handling data responsibly, and designing for non-technical users. By the end, the project will demonstrate how computer science can bridge technology and social research — turning raw online interactions into meaningful cultural insights.
2) **Configure Enviornment Vars**
```
cp example.env .env
```
Fill in each required empty env. Some are already filled in, these are sensible defaults that usually don't need to be changed
## Scope
3) **Start everything**
```
docker compose up -d
```
This project focuses on:
- Designing a modular data ingestion pipeline
- Implementing backend data processing and storage
- Integrating lightweight NLP-based analysis
- Building a simple, accessible frontend for exploration and visualisation
This starts:
- `crosspost_db` — PostgreSQL on port 5432
- `crosspost_redis` — Redis on port 6379
- `crosspost_flask` — Flask API on port 5000
- `crosspost_worker` — Celery worker for background NLP/fetching tasks
- `crosspost_frontend` — Vite dev server on port 5173
# Requirements
# Data Format for Manual Uploads
If you want to upload your own data rather than fetch it via the connectors, the expected format is newline-delimited JSON (.jsonl) where each line is a post object:
```json
{"id": "abc123", "author": "username", "title": "Post title", "content": "Post body", "url": "https://...", "timestamp": 1700000000.0, "source": "reddit", "comments": []}
```
- **Python** ≥ 3.9
- **Python packages** listed in `requirements.txt`
- npm ≥ version 11
# Notes
- **GPU support**: The Celery worker is configured with `--pool=solo` to avoid memory conflicts when multiple NLP models are loaded. If you have an NVIDIA GPU, uncomment the deploy.resources block in docker-compose.yml and make sure the NVIDIA Container Toolkit is installed.

View File

@@ -1,178 +0,0 @@
import requests
import logging
import time
from dto.post import Post
from dto.user import User
from dto.comment import Comment
logger = logging.getLogger(__name__)
class RedditAPI:
def __init__(self):
self.url = "https://www.reddit.com/"
self.source_name = "Reddit"
# Public Methods #
def search_new_subreddit_posts(self, search: str, subreddit: str, limit: int) -> list[Post]:
params = {
'q': search,
'limit': limit,
'restrict_sr': 'on',
'sort': 'new'
}
logger.info(f"Searching subreddit '{subreddit}' for '{search}' with limit {limit}")
url = f"r/{subreddit}/search.json"
posts = []
while len(posts) < limit:
batch_limit = min(100, limit - len(posts))
params['limit'] = batch_limit
data = self._fetch_post_overviews(url, params)
batch_posts = self._parse_posts(data)
logger.debug(f"Fetched {len(batch_posts)} posts from search in subreddit {subreddit}")
if not batch_posts:
break
posts.extend(batch_posts)
return posts
def get_new_subreddit_posts(self, subreddit: str, limit: int = 10) -> list[Post]:
posts = []
after = None
url = f"r/{subreddit}/new.json"
logger.info(f"Fetching new posts from subreddit: {subreddit}")
while len(posts) < limit:
batch_limit = min(100, limit - len(posts))
params = {
'limit': batch_limit,
'after': after
}
data = self._fetch_post_overviews(url, params)
batch_posts = self._parse_posts(data)
logger.debug(f"Fetched {len(batch_posts)} new posts from subreddit {subreddit}")
if not batch_posts:
break
posts.extend(batch_posts)
after = data['data'].get('after')
if not after:
break
return posts
def get_user(self, username: str) -> User:
data = self._fetch_post_overviews(f"user/{username}/about.json", {})
return self._parse_user(data)
## Private Methods ##
def _parse_posts(self, data) -> list[Post]:
posts = []
total_num_posts = len(data['data']['children'])
current_index = 0
for item in data['data']['children']:
current_index += 1
logger.debug(f"Parsing post {current_index} of {total_num_posts}")
post_data = item['data']
post = Post(
id=post_data['id'],
author=post_data['author'],
title=post_data['title'],
content=post_data.get('selftext', ''),
url=post_data['url'],
timestamp=post_data['created_utc'],
source=self.source_name,
comments=self._get_post_comments(post_data['id']))
post.subreddit = post_data['subreddit']
post.upvotes = post_data['ups']
posts.append(post)
return posts
def _get_post_comments(self, post_id: str) -> list[Comment]:
comments: list[Comment] = []
url = f"comments/{post_id}.json"
data = self._fetch_post_overviews(url, {})
if len(data) < 2:
return comments
comment_data = data[1]['data']['children']
def _parse_comment_tree(items, parent_id=None):
for item in items:
if item['kind'] != 't1':
continue
comment_info = item['data']
comment = Comment(
id=comment_info['id'],
post_id=post_id,
author=comment_info['author'],
content=comment_info.get('body', ''),
timestamp=comment_info['created_utc'],
reply_to=parent_id or comment_info.get('parent_id', None),
source=self.source_name
)
comments.append(comment)
# Process replies recursively
replies = comment_info.get('replies')
if replies and isinstance(replies, dict):
reply_items = replies.get('data', {}).get('children', [])
_parse_comment_tree(reply_items, parent_id=comment.id)
_parse_comment_tree(comment_data)
return comments
def _parse_user(self, data) -> User:
user_data = data['data']
user = User(
username=user_data['name'],
created_utc=user_data['created_utc'])
user.karma = user_data['total_karma']
return user
def _fetch_post_overviews(self, endpoint: str, params: dict) -> dict:
url = f"{self.url}{endpoint}"
max_retries = 15
backoff = 1 # seconds
for attempt in range(max_retries):
try:
response = requests.get(url, headers={'User-agent': 'python:ethnography-college-project:0.1 (by /u/ThisBirchWood)'}, params=params)
if response.status_code == 429:
wait_time = response.headers.get("Retry-After", backoff)
logger.warning(f"Rate limited by Reddit API. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
backoff *= 2
continue
if response.status_code == 500:
logger.warning("Server error from Reddit API. Retrying...")
time.sleep(backoff)
backoff *= 2
continue
response.raise_for_status()
return response.json()
except requests.RequestException as e:
print(f"Error fetching data from Reddit API: {e}")
return {}

View File

@@ -1,84 +0,0 @@
import os
import datetime
from dotenv import load_dotenv
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from dto.post import Post
from dto.comment import Comment
load_dotenv()
API_KEY = os.getenv("YOUTUBE_API_KEY")
class YouTubeAPI:
def __init__(self):
self.youtube = build('youtube', 'v3', developerKey=API_KEY)
def search_videos(self, query, limit):
request = self.youtube.search().list(
q=query,
part='snippet',
type='video',
maxResults=limit
)
response = request.execute()
return response.get('items', [])
def get_video_comments(self, video_id, limit):
request = self.youtube.commentThreads().list(
part='snippet',
videoId=video_id,
maxResults=limit,
textFormat='plainText'
)
try:
response = request.execute()
except HttpError as e:
print(f"Error fetching comments for video {video_id}: {e}")
return []
return response.get('items', [])
def fetch_videos(self, query, video_limit, comment_limit) -> list[Post]:
videos = self.search_videos(query, video_limit)
posts = []
for video in videos:
video_id = video['id']['videoId']
snippet = video['snippet']
title = snippet['title']
description = snippet['description']
published_at = datetime.datetime.strptime(snippet['publishedAt'], "%Y-%m-%dT%H:%M:%SZ").timestamp()
channel_title = snippet['channelTitle']
comments = []
comments_data = self.get_video_comments(video_id, comment_limit)
for comment_thread in comments_data:
comment_snippet = comment_thread['snippet']['topLevelComment']['snippet']
comment = Comment(
id=comment_thread['id'],
post_id=video_id,
content=comment_snippet['textDisplay'],
author=comment_snippet['authorDisplayName'],
timestamp=datetime.datetime.strptime(comment_snippet['publishedAt'], "%Y-%m-%dT%H:%M:%SZ").timestamp(),
reply_to=None,
source="YouTube"
)
comments.append(comment)
post = Post(
id=video_id,
content=f"{title}\n\n{description}",
author=channel_title,
timestamp=published_at,
url=f"https://www.youtube.com/watch?v={video_id}",
title=title,
source="YouTube",
comments=comments
)
posts.append(post)
return posts

View File

@@ -1,43 +0,0 @@
import json
import logging
from connectors.reddit_api import RedditAPI
from connectors.boards_api import BoardsAPI
from connectors.youtube_api import YouTubeAPI
posts_file = 'posts_test.jsonl'
reddit_connector = RedditAPI()
boards_connector = BoardsAPI()
youtube_connector = YouTubeAPI()
logging.basicConfig(level=logging.DEBUG)
logging.getLogger("urllib3").setLevel(logging.WARNING)
def remove_empty_posts(posts):
return [post for post in posts if post.content.strip() != ""]
def save_to_jsonl(filename, posts):
with open(filename, 'a', encoding='utf-8') as f:
for post in posts:
# Convert post object to dict if it's a dataclass
data = post.to_dict()
f.write(json.dumps(data) + '\n')
def main():
boards_posts = boards_connector.get_new_category_posts('cork-city', 1200, 1200)
save_to_jsonl(posts_file, boards_posts)
reddit_posts = reddit_connector.get_new_subreddit_posts('cork', 1200)
reddit_posts = remove_empty_posts(reddit_posts)
save_to_jsonl(posts_file, reddit_posts)
ireland_posts = reddit_connector.search_new_subreddit_posts('cork', 'ireland', 1200)
ireland_posts = remove_empty_posts(ireland_posts)
save_to_jsonl(posts_file, ireland_posts)
youtube_videos = youtube_connector.fetch_videos('cork city', 1200, 1200)
save_to_jsonl(posts_file, youtube_videos)
if __name__ == "__main__":
main()

72
docker-compose.dev.yml Normal file
View File

@@ -0,0 +1,72 @@
services:
postgres:
image: postgres:16
container_name: crosspost_db
restart: unless-stopped
env_file:
- .env
ports:
- "5432:5432"
volumes:
- ${POSTGRES_DIR}:/var/lib/postgresql/data
- ./server/db/schema.sql:/docker-entrypoint-initdb.d/schema.sql
redis:
image: redis:7
container_name: crosspost_redis
restart: unless-stopped
ports:
- "6379:6379"
backend:
build: .
container_name: crosspost_flask
volumes:
- .:/app
- model_cache:/models
env_file:
- .env
ports:
- "5000:5000"
command: gunicorn server.app:app --bind 0.0.0.0:5000 --workers 2 --threads 4
depends_on:
- postgres
- redis
worker:
build: .
volumes:
- .:/app
- model_cache:/models
container_name: crosspost_worker
env_file:
- .env
command: >
celery -A server.queue.celery_app.celery worker
--loglevel=debug
--pool=solo
depends_on:
- postgres
- redis
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
frontend:
build:
context: ./frontend
container_name: crosspost_frontend
volumes:
- ./frontend:/app
- /app/node_modules
ports:
- "5173:5173"
depends_on:
- backend
volumes:
model_cache:

69
docker-compose.yml Normal file
View File

@@ -0,0 +1,69 @@
services:
postgres:
image: postgres:16
container_name: crosspost_db
restart: unless-stopped
env_file:
- .env
ports:
- "5432:5432"
volumes:
- ${POSTGRES_DIR}:/var/lib/postgresql/data
- ./server/db/schema.sql:/docker-entrypoint-initdb.d/schema.sql
redis:
image: redis:7
container_name: crosspost_redis
restart: unless-stopped
ports:
- "6379:6379"
backend:
build: .
container_name: crosspost_flask
volumes:
- model_cache:/models
env_file:
- .env
ports:
- "5000:5000"
command: flask --app server.app run --host=0.0.0.0
depends_on:
- postgres
- redis
worker:
build: .
volumes:
- model_cache:/models
container_name: crosspost_worker
env_file:
- .env
command: >
celery -A server.queue.celery_app.celery worker
--loglevel=warning
--pool=solo
depends_on:
- postgres
- redis
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
frontend:
build:
context: ./frontend
container_name: crosspost_frontend
volumes:
- /app/node_modules
ports:
- "5173:5173"
depends_on:
- backend
volumes:
model_cache:

View File

@@ -1,8 +0,0 @@
# Generic User Data Transfer Object for social media platforms
class User:
def __init__(self, username: str, created_utc: int, ):
self.username = username
self.created_utc = created_utc
# Optionals
self.karma = None

30
example.env Normal file
View File

@@ -0,0 +1,30 @@
# API Keys
YOUTUBE_API_KEY=
REDDIT_CLIENT_ID=
REDDIT_CLIENT_SECRET=
# Database
# Database
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=mydatabase
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DIR=./db
# JWT
JWT_SECRET_KEY=
JWT_ACCESS_TOKEN_EXPIRES=28800
# Models
HF_HOME=/models/huggingface
TRANSFORMERS_CACHE=/models/huggingface
TORCH_HOME=/models/torch
# URLs
FRONTEND_URL=http://localhost:5173
BACKEND_URL=http://backend:5000
REDIS_URL=redis://redis:6379/0
# API & Scraping
MAX_FETCH_LIMIT=1000

13
frontend/Dockerfile Normal file
View File

@@ -0,0 +1,13 @@
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json* ./
RUN npm install
# Copy rest of the app
COPY . .
EXPOSE 5173
CMD ["npm", "run", "dev", "--", "--host", "0.0.0.0"]

View File

@@ -2,7 +2,7 @@
<html lang="en">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
<link rel="icon" type="image/png" href="/icon.png" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>frontend</title>
</head>

BIN
frontend/public/icon.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

View File

@@ -1,12 +1,34 @@
import { Routes, Route } from "react-router-dom";
import { useEffect } from "react";
import { Navigate, Route, Routes, useLocation } from "react-router-dom";
import AppLayout from "./components/AppLayout";
import DatasetsPage from "./pages/Datasets";
import DatasetStatusPage from "./pages/DatasetStatus";
import LoginPage from "./pages/Login";
import UploadPage from "./pages/Upload";
import AutoFetchPage from "./pages/AutoFetch";
import StatPage from "./pages/Stats";
import { getDocumentTitle } from "./utils/documentTitle";
import DatasetEditPage from "./pages/DatasetEdit";
function App() {
const location = useLocation();
useEffect(() => {
document.title = getDocumentTitle(location.pathname);
}, [location.pathname]);
return (
<Routes>
<Route path="/upload" element={<UploadPage />} />
<Route path="/stats" element={<StatPage />} />
<Route element={<AppLayout />}>
<Route path="/" element={<Navigate to="/login" replace />} />
<Route path="/login" element={<LoginPage />} />
<Route path="/upload" element={<UploadPage />} />
<Route path="/auto-fetch" element={<AutoFetchPage />} />
<Route path="/datasets" element={<DatasetsPage />} />
<Route path="/dataset/:datasetId/status" element={<DatasetStatusPage />} />
<Route path="/dataset/:datasetId/stats" element={<StatPage />} />
<Route path="/dataset/:datasetId/edit" element={<DatasetEditPage />} />
</Route>
</Routes>
);
}

View File

@@ -0,0 +1,135 @@
import { useCallback, useEffect, useState } from "react";
import axios from "axios";
import { Outlet, useLocation, useNavigate } from "react-router-dom";
import StatsStyling from "../styles/stats_styling";
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
type ProfileResponse = {
user?: Record<string, unknown>;
};
const styles = StatsStyling;
const getUserLabel = (user: Record<string, unknown> | null) => {
if (!user) {
return "Signed in";
}
const username = user.username;
if (typeof username === "string" && username.length > 0) {
return username;
}
const email = user.email;
if (typeof email === "string" && email.length > 0) {
return email;
}
return "Signed in";
};
const AppLayout = () => {
const location = useLocation();
const navigate = useNavigate();
const [isSignedIn, setIsSignedIn] = useState(false);
const [currentUser, setCurrentUser] = useState<Record<
string,
unknown
> | null>(null);
const syncAuthState = useCallback(async () => {
const token = localStorage.getItem("access_token");
if (!token) {
setIsSignedIn(false);
setCurrentUser(null);
delete axios.defaults.headers.common.Authorization;
return;
}
axios.defaults.headers.common.Authorization = `Bearer ${token}`;
try {
const response = await axios.get<ProfileResponse>(
`${API_BASE_URL}/profile`,
);
setIsSignedIn(true);
setCurrentUser(response.data.user ?? null);
} catch {
localStorage.removeItem("access_token");
delete axios.defaults.headers.common.Authorization;
setIsSignedIn(false);
setCurrentUser(null);
}
}, []);
useEffect(() => {
void syncAuthState();
}, [location.pathname, syncAuthState]);
const onAuthButtonClick = () => {
if (isSignedIn) {
localStorage.removeItem("access_token");
delete axios.defaults.headers.common.Authorization;
setIsSignedIn(false);
setCurrentUser(null);
navigate("/login", { replace: true });
return;
}
navigate("/login");
};
return (
<div style={styles.appShell}>
<div style={{ ...styles.container, ...styles.appHeaderWrap }}>
<div style={{ ...styles.card, ...styles.headerBar }}>
<div style={styles.appHeaderBrandRow}>
<span style={styles.appTitle}>CrossPost Analysis Engine</span>
<span
style={{
...styles.authStatusBadge,
...(isSignedIn
? styles.authStatusSignedIn
: styles.authStatusSignedOut),
}}
>
{isSignedIn
? `Signed in: ${getUserLabel(currentUser)}`
: "Not signed in"}
</span>
</div>
<div style={styles.controlsWrapped}>
{isSignedIn && (
<button
type="button"
style={
location.pathname === "/datasets"
? styles.buttonPrimary
: styles.buttonSecondary
}
onClick={() => navigate("/datasets")}
>
My datasets
</button>
)}
<button
type="button"
style={isSignedIn ? styles.buttonSecondary : styles.buttonPrimary}
onClick={onAuthButtonClick}
>
{isSignedIn ? "Sign out" : "Sign in"}
</button>
</div>
</div>
</div>
<Outlet />
</div>
);
};
export default AppLayout;

View File

@@ -1,52 +1,27 @@
import type { CSSProperties } from "react";
import StatsStyling from "../styles/stats_styling";
const styles = StatsStyling;
const Card = (props: {
label: string;
value: string | number;
sublabel?: string;
rightSlot?: React.ReactNode;
style?: CSSProperties
style?: CSSProperties;
}) => {
return (
<div style={{
background: "rgba(255,255,255,0.85)",
border: "1px solid rgba(15,23,42,0.08)",
borderRadius: 16,
padding: 14,
boxShadow: "0 12px 30px rgba(15,23,42,0.06)",
minHeight: 88,
...props.style
}}>
<div style={ {
display: "flex",
justifyContent: "space-between",
alignItems: "center",
gap: 10,
}}>
<div style={{
fontSize: 12,
fontWeight: 700,
color: "rgba(15, 23, 42, 0.65)",
letterSpacing: "0.02em",
textTransform: "uppercase"
}}>
{props.label}
</div>
<div style={{ ...styles.cardBase, ...props.style }}>
<div style={styles.cardTopRow}>
<div style={styles.cardLabel}>{props.label}</div>
{props.rightSlot ? <div>{props.rightSlot}</div> : null}
</div>
<div style={{
fontSize: 22,
fontWeight: 850,
marginTop: 6,
letterSpacing: "-0.02em",
}}>{props.value}</div>
{props.sublabel ? <div style={{
marginTop: 6,
fontSize: 12,
color: "rgba(15, 23, 42, 0.55)",
}}>{props.sublabel}</div> : null}
<div style={styles.cardValue}>{props.value}</div>
{props.sublabel ? (
<div style={styles.cardSubLabel}>{props.sublabel}</div>
) : null}
</div>
);
}
};
export default Card;
export default Card;

View File

@@ -0,0 +1,58 @@
import { Dialog, DialogPanel, DialogTitle } from "@headlessui/react";
import StatsStyling from "../styles/stats_styling";
type Props = {
open: boolean;
title: string;
message: string;
confirmLabel?: string;
cancelLabel?: string;
loading?: boolean;
onConfirm: () => void;
onCancel: () => void;
};
const styles = StatsStyling;
export default function ConfirmationModal({
open,
title,
message,
confirmLabel = "Confirm",
cancelLabel = "Cancel",
loading = false,
onConfirm,
onCancel,
}: Props) {
return (
<Dialog open={open} onClose={onCancel} style={styles.modalRoot}>
<div style={styles.modalBackdrop} />
<div style={styles.modalContainer}>
<DialogPanel style={{ ...styles.card, ...styles.modalPanel }}>
<DialogTitle style={styles.sectionTitle}>{title}</DialogTitle>
<p style={styles.sectionSubtitle}>{message}</p>
<div style={{ display: "flex", justifyContent: "flex-end", gap: 8 }}>
<button
type="button"
onClick={onCancel}
style={styles.buttonSecondary}
disabled={loading}
>
{cancelLabel}
</button>
<button
type="button"
onClick={onConfirm}
style={styles.buttonDanger}
disabled={loading}
>
{loading ? "Deleting..." : confirmLabel}
</button>
</div>
</DialogPanel>
</div>
</Dialog>
);
}

View File

@@ -0,0 +1,247 @@
import { useEffect, useState } from "react";
import { Dialog, DialogPanel, DialogTitle } from "@headlessui/react";
import StatsStyling from "../styles/stats_styling";
import type { DatasetRecord } from "../utils/corpusExplorer";
const styles = StatsStyling;
const INITIAL_RECORD_COUNT = 60;
const RECORD_BATCH_SIZE = 60;
const EXCERPT_LENGTH = 320;
const cleanText = (value: unknown) => {
if (typeof value !== "string") {
return "";
}
const trimmed = value.trim();
if (!trimmed) {
return "";
}
const lowered = trimmed.toLowerCase();
if (lowered === "nan" || lowered === "null" || lowered === "undefined") {
return "";
}
return trimmed;
};
const displayText = (value: unknown, fallback: string) => {
const cleaned = cleanText(value);
return cleaned || fallback;
};
type CorpusExplorerProps = {
open: boolean;
onClose: () => void;
title: string;
description: string;
records: DatasetRecord[];
loading: boolean;
error: string;
emptyMessage: string;
};
const formatRecordDate = (record: DatasetRecord) => {
if (typeof record.dt === "string" && record.dt) {
const date = new Date(record.dt);
if (!Number.isNaN(date.getTime())) {
return date.toLocaleString();
}
}
if (typeof record.date === "string" && record.date) {
return record.date;
}
if (typeof record.timestamp === "number") {
return new Date(record.timestamp * 1000).toLocaleString();
}
return "Unknown time";
};
const getRecordKey = (record: DatasetRecord, index: number) =>
String(record.id ?? record.post_id ?? `${record.author ?? "record"}-${index}`);
const getRecordTitle = (record: DatasetRecord) => {
if (record.type === "comment") {
return "";
}
const title = cleanText(record.title);
if (title) {
return title;
}
const content = cleanText(record.content);
if (!content) {
return "Untitled record";
}
return content.length > 120 ? `${content.slice(0, 117)}...` : content;
};
const CorpusExplorer = ({
open,
onClose,
title,
description,
records,
loading,
error,
emptyMessage,
}: CorpusExplorerProps) => {
const [visibleCount, setVisibleCount] = useState(INITIAL_RECORD_COUNT);
const [expandedKeys, setExpandedKeys] = useState<Record<string, boolean>>({});
useEffect(() => {
if (open) {
setVisibleCount(INITIAL_RECORD_COUNT);
setExpandedKeys({});
}
}, [open, title, records.length]);
const hasMoreRecords = visibleCount < records.length;
return (
<Dialog open={open} onClose={onClose} style={styles.modalRoot}>
<div style={styles.modalBackdrop} />
<div style={styles.modalContainer}>
<DialogPanel
style={{
...styles.card,
...styles.modalPanel,
width: "min(960px, 96vw)",
maxHeight: "88vh",
display: "flex",
flexDirection: "column",
gap: 12,
overflow: "hidden",
}}
>
<div style={styles.headerBar}>
<div style={{ minWidth: 0 }}>
<DialogTitle style={styles.sectionTitle}>{title}</DialogTitle>
<p style={styles.sectionSubtitle}>
{description} {loading ? "Loading records..." : `${records.length.toLocaleString()} records.`}
</p>
</div>
<button onClick={onClose} style={styles.buttonSecondary}>
Close
</button>
</div>
{error ? <p style={styles.sectionSubtitle}>{error}</p> : null}
{!loading && !error && !records.length ? (
<p style={styles.sectionSubtitle}>{emptyMessage}</p>
) : null}
{loading ? <div style={styles.topUserMeta}>Preparing corpus slice...</div> : null}
{!loading && !error && records.length ? (
<>
<div
style={{
...styles.topUsersList,
overflowY: "auto",
overflowX: "hidden",
paddingRight: 4,
}}
>
{records.slice(0, visibleCount).map((record, index) => {
const recordKey = getRecordKey(record, index);
const titleText = getRecordTitle(record);
const content = cleanText(record.content);
const isExpanded = !!expandedKeys[recordKey];
const canExpand = content.length > EXCERPT_LENGTH;
const excerpt =
canExpand && !isExpanded
? `${content.slice(0, EXCERPT_LENGTH - 3)}...`
: content || "No content available.";
return (
<div key={recordKey} style={styles.topUserItem}>
<div style={{ ...styles.headerBar, alignItems: "flex-start" }}>
<div style={{ minWidth: 0, flex: 1 }}>
{titleText ? <div style={styles.topUserName}>{titleText}</div> : null}
<div
style={{
...styles.topUserMeta,
overflowWrap: "anywhere",
wordBreak: "break-word",
}}
>
{displayText(record.author, "Unknown author")} {displayText(record.source, "Unknown source")} {displayText(record.type, "record")} {formatRecordDate(record)}
</div>
</div>
<div
style={{
...styles.topUserMeta,
marginLeft: 12,
textAlign: "right",
overflowWrap: "anywhere",
wordBreak: "break-word",
}}
>
{cleanText(record.topic) ? `Topic: ${cleanText(record.topic)}` : ""}
</div>
</div>
<div
style={{
...styles.topUserMeta,
marginTop: 8,
whiteSpace: "pre-wrap",
overflowWrap: "anywhere",
wordBreak: "break-word",
}}
>
{excerpt}
</div>
{canExpand ? (
<div style={{ marginTop: 10 }}>
<button
onClick={() =>
setExpandedKeys((current) => ({
...current,
[recordKey]: !current[recordKey],
}))
}
style={styles.buttonSecondary}
>
{isExpanded ? "Show Less" : "Show More"}
</button>
</div>
) : null}
</div>
);
})}
</div>
{hasMoreRecords ? (
<div style={{ display: "flex", justifyContent: "center" }}>
<button
onClick={() =>
setVisibleCount((current) => current + RECORD_BATCH_SIZE)
}
style={styles.buttonSecondary}
>
Show More Records
</button>
</div>
) : null}
</>
) : null}
</DialogPanel>
</div>
</Dialog>
);
};
export default CorpusExplorer;

View File

@@ -0,0 +1,249 @@
import Card from "./Card";
import StatsStyling from "../styles/stats_styling";
import type { CulturalAnalysisResponse } from "../types/ApiTypes";
import {
buildCertaintySpec,
buildDeonticSpec,
buildEntitySpec,
buildHedgeSpec,
buildIdentityBucketSpec,
buildPermissionSpec,
type CorpusExplorerSpec,
} from "../utils/corpusExplorer";
const styles = StatsStyling;
const exploreButtonStyle = { padding: "4px 8px", fontSize: 12 };
type CulturalStatsProps = {
data: CulturalAnalysisResponse;
onExplore: (spec: CorpusExplorerSpec) => void;
};
const renderExploreButton = (onClick: () => void) => (
<button
onClick={onClick}
style={{ ...styles.buttonSecondary, ...exploreButtonStyle }}
>
Explore
</button>
);
const CulturalStats = ({ data, onExplore }: CulturalStatsProps) => {
const identity = data.identity_markers;
const stance = data.stance_markers;
const inGroupWords = identity?.in_group_usage ?? 0;
const outGroupWords = identity?.out_group_usage ?? 0;
const totalGroupWords = inGroupWords + outGroupWords;
const inGroupWordRate =
typeof identity?.in_group_ratio === "number"
? identity.in_group_ratio * 100
: null;
const outGroupWordRate =
typeof identity?.out_group_ratio === "number"
? identity.out_group_ratio * 100
: null;
const rawEntities = data.avg_emotion_per_entity?.entity_emotion_avg ?? {};
const entities = Object.entries(rawEntities)
.sort((a, b) => b[1].post_count - a[1].post_count)
.slice(0, 20);
const topEmotion = (emotionAvg: Record<string, number> | undefined) => {
const entries = Object.entries(emotionAvg ?? {});
if (!entries.length) {
return "-";
}
entries.sort((a, b) => b[1] - a[1]);
const dominant = entries[0] ?? ["emotion_unknown", 0];
const dominantLabel = dominant[0].replace("emotion_", "");
return `${dominantLabel} (${(dominant[1] * 100).toFixed(1)}%)`;
};
return (
<div style={styles.page}>
<div style={{ ...styles.container, ...styles.grid }}>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Community Framing Overview</h2>
<p style={styles.sectionSubtitle}>
Simple view of how often people use "us" words vs "them" words, and
the tone around that language.
</p>
</div>
<Card
label="In-Group Words"
value={inGroupWords.toLocaleString()}
sublabel="Times we/us/our appears"
style={{ gridColumn: "span 3" }}
/>
<Card
label="Out-Group Words"
value={outGroupWords.toLocaleString()}
sublabel="Times they/them/their appears"
style={{ gridColumn: "span 3" }}
/>
<Card
label="In-Group Posts"
value={identity?.in_group_posts?.toLocaleString() ?? "-"}
sublabel='Posts leaning toward "us" language'
rightSlot={renderExploreButton(() =>
onExplore(buildIdentityBucketSpec("in")),
)}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Out-Group Posts"
value={identity?.out_group_posts?.toLocaleString() ?? "-"}
sublabel='Posts leaning toward "them" language'
rightSlot={renderExploreButton(() =>
onExplore(buildIdentityBucketSpec("out")),
)}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Balanced Posts"
value={identity?.tie_posts?.toLocaleString() ?? "-"}
sublabel="Posts with equal us/them signals"
rightSlot={renderExploreButton(() =>
onExplore(buildIdentityBucketSpec("tie")),
)}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Total Group Words"
value={totalGroupWords.toLocaleString()}
sublabel="In-group + out-group words"
style={{ gridColumn: "span 3" }}
/>
<Card
label="In-Group Share"
value={
inGroupWordRate === null ? "-" : `${inGroupWordRate.toFixed(2)}%`
}
sublabel="Share of all words"
style={{ gridColumn: "span 3" }}
/>
<Card
label="Out-Group Share"
value={
outGroupWordRate === null ? "-" : `${outGroupWordRate.toFixed(2)}%`
}
sublabel="Share of all words"
style={{ gridColumn: "span 3" }}
/>
<Card
label="Hedging Words"
value={stance?.hedge_total?.toLocaleString() ?? "-"}
sublabel={
typeof stance?.hedge_per_1k_tokens === "number"
? `${stance.hedge_per_1k_tokens.toFixed(1)} per 1k words`
: "Word frequency"
}
rightSlot={renderExploreButton(() => onExplore(buildHedgeSpec()))}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Certainty Words"
value={stance?.certainty_total?.toLocaleString() ?? "-"}
sublabel={
typeof stance?.certainty_per_1k_tokens === "number"
? `${stance.certainty_per_1k_tokens.toFixed(1)} per 1k words`
: "Word frequency"
}
rightSlot={renderExploreButton(() => onExplore(buildCertaintySpec()))}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Need/Should Words"
value={stance?.deontic_total?.toLocaleString() ?? "-"}
sublabel={
typeof stance?.deontic_per_1k_tokens === "number"
? `${stance.deontic_per_1k_tokens.toFixed(1)} per 1k words`
: "Word frequency"
}
rightSlot={renderExploreButton(() => onExplore(buildDeonticSpec()))}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Permission Words"
value={stance?.permission_total?.toLocaleString() ?? "-"}
sublabel={
typeof stance?.permission_per_1k_tokens === "number"
? `${stance.permission_per_1k_tokens.toFixed(1)} per 1k words`
: "Word frequency"
}
rightSlot={renderExploreButton(() => onExplore(buildPermissionSpec()))}
style={{ gridColumn: "span 3" }}
/>
<div style={{ ...styles.card, gridColumn: "span 6" }}>
<h2 style={styles.sectionTitle}>Mood in "Us" Posts</h2>
<p style={styles.sectionSubtitle}>
Most likely emotion when in-group wording is stronger.
</p>
<div style={styles.topUserName}>{topEmotion(identity?.in_group_emotion_avg)}</div>
<div style={{ marginTop: 12 }}>
<button
onClick={() => onExplore(buildIdentityBucketSpec("in"))}
style={styles.buttonSecondary}
>
Explore records
</button>
</div>
</div>
<div style={{ ...styles.card, gridColumn: "span 6" }}>
<h2 style={styles.sectionTitle}>Mood in "Them" Posts</h2>
<p style={styles.sectionSubtitle}>
Most likely emotion when out-group wording is stronger.
</p>
<div style={styles.topUserName}>{topEmotion(identity?.out_group_emotion_avg)}</div>
<div style={{ marginTop: 12 }}>
<button
onClick={() => onExplore(buildIdentityBucketSpec("out"))}
style={styles.buttonSecondary}
>
Explore records
</button>
</div>
</div>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Entity Mood Snapshot</h2>
<p style={styles.sectionSubtitle}>
Most mentioned entities and the mood that appears most with each.
</p>
{!entities.length ? (
<div style={styles.topUserMeta}>No entity-level cultural data available.</div>
) : (
<div
style={{
...styles.topUsersList,
maxHeight: 420,
overflowY: "auto",
}}
>
{entities.map(([entity, aggregate]) => (
<div
key={entity}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => onExplore(buildEntitySpec(entity))}
>
<div style={styles.topUserName}>{entity}</div>
<div style={styles.topUserMeta}>
{aggregate.post_count.toLocaleString()} posts Likely mood:{" "}
{topEmotion(aggregate.emotion_avg)}
</div>
</div>
))}
</div>
)}
</div>
</div>
</div>
);
};
export default CulturalStats;

View File

@@ -1,14 +1,25 @@
import type { ContentAnalysisResponse } from "../types/ApiTypes"
import type { EmotionalAnalysisResponse } from "../types/ApiTypes";
import StatsStyling from "../styles/stats_styling";
import {
buildDominantEmotionSpec,
buildSourceSpec,
buildTopicSpec,
type CorpusExplorerSpec,
} from "../utils/corpusExplorer";
const styles = StatsStyling;
type EmotionalStatsProps = {
contentData: ContentAnalysisResponse;
}
emotionalData: EmotionalAnalysisResponse;
onExplore: (spec: CorpusExplorerSpec) => void;
};
const EmotionalStats = ({contentData}: EmotionalStatsProps) => {
const rows = contentData.average_emotion_by_topic ?? [];
const EmotionalStats = ({ emotionalData, onExplore }: EmotionalStatsProps) => {
const rows = emotionalData.average_emotion_by_topic ?? [];
const overallEmotionAverage = emotionalData.overall_emotion_average ?? [];
const dominantEmotionDistribution =
emotionalData.dominant_emotion_distribution ?? [];
const emotionBySource = emotionalData.emotion_by_source ?? [];
const lowSampleThreshold = 20;
const stableSampleThreshold = 50;
const emotionKeys = rows.length
@@ -31,7 +42,7 @@ const EmotionalStats = ({contentData}: EmotionalStatsProps) => {
topic: String(row.topic),
count: Number(row.n ?? 0),
emotion: maxKey.replace("emotion_", "") || "unknown",
value: maxValue > Number.NEGATIVE_INFINITY ? maxValue : 0
value: maxValue > Number.NEGATIVE_INFINITY ? maxValue : 0,
};
});
@@ -45,8 +56,12 @@ const EmotionalStats = ({contentData}: EmotionalStatsProps) => {
.filter((count) => Number.isFinite(count) && count > 0)
.sort((a, b) => a - b);
const lowSampleTopics = strongestPerTopic.filter((topic) => topic.count < lowSampleThreshold).length;
const stableSampleTopics = strongestPerTopic.filter((topic) => topic.count >= stableSampleThreshold).length;
const lowSampleTopics = strongestPerTopic.filter(
(topic) => topic.count < lowSampleThreshold,
).length;
const stableSampleTopics = strongestPerTopic.filter(
(topic) => topic.count >= stableSampleThreshold,
).length;
const medianSampleSize = sampleSizes.length
? sampleSizes[Math.floor(sampleSizes.length / 2)]
@@ -64,42 +79,184 @@ const EmotionalStats = ({contentData}: EmotionalStatsProps) => {
return (
<div style={styles.page}>
<div style={{ ...styles.container, ...styles.card, marginTop: 16 }}>
<h2 style={styles.sectionTitle}>Average Emotion by Topic</h2>
<p style={styles.sectionSubtitle}>Read confidence together with sample size. Topics with fewer than {lowSampleThreshold} events are usually noisy and less reliable.</p>
<div style={{ display: "flex", flexWrap: "wrap", gap: 10, fontSize: 13, color: "#4b5563", marginTop: 6 }}>
<span><strong style={{ color: "#111827" }}>Topics:</strong> {strongestPerTopic.length}</span>
<span><strong style={{ color: "#111827" }}>Median Sample:</strong> {medianSampleSize} events</span>
<span><strong style={{ color: "#111827" }}>Low Sample (&lt;{lowSampleThreshold}):</strong> {lowSampleTopics}</span>
<span><strong style={{ color: "#111827" }}>Stable Sample ({stableSampleThreshold}+):</strong> {stableSampleTopics}</span>
<h2 style={styles.sectionTitle}>Topic Mood Overview</h2>
<p style={styles.sectionSubtitle}>
Use the strength score together with post count. Topics with fewer
than {lowSampleThreshold} events are often noisy.
</p>
<div style={styles.emotionalSummaryRow}>
<span>
<strong style={{ color: "#24292f" }}>Topics:</strong>{" "}
{strongestPerTopic.length}
</span>
<span>
<strong style={{ color: "#24292f" }}>Median Posts:</strong>{" "}
{medianSampleSize}
</span>
<span>
<strong style={{ color: "#24292f" }}>
Small Topics (&lt;{lowSampleThreshold}):
</strong>{" "}
{lowSampleTopics}
</span>
<span>
<strong style={{ color: "#24292f" }}>
Stable Topics ({stableSampleThreshold}+):
</strong>{" "}
{stableSampleTopics}
</span>
</div>
<p style={{ ...styles.sectionSubtitle, marginTop: 10, marginBottom: 0 }}>
Confidence reflects how strongly one emotion leads within a topic, not model accuracy. Use larger samples for stronger conclusions.
<p
style={{ ...styles.sectionSubtitle, marginTop: 10, marginBottom: 0 }}
>
Strength means how far the top emotion is ahead in that topic. It does
not mean model accuracy.
</p>
</div>
<div style={{ ...styles.container, ...styles.grid }}>
{strongestPerTopic.map((topic) => (
<div key={topic.topic} style={{ ...styles.card, gridColumn: "span 4" }}>
<h3 style={{ ...styles.sectionTitle, marginBottom: 6 }}>{topic.topic}</h3>
<div style={{ fontSize: 12, fontWeight: 700, color: "#6b7280", letterSpacing: "0.02em", textTransform: "uppercase" }}>
Top Emotion
<div style={{ ...styles.card, gridColumn: "span 4" }}>
<h2 style={styles.sectionTitle}>Mood Averages</h2>
<p style={styles.sectionSubtitle}>Average score for each emotion.</p>
{!overallEmotionAverage.length ? (
<div style={styles.topUserMeta}>
No overall emotion averages available.
</div>
<div style={{ fontSize: 24, fontWeight: 800, marginTop: 4, lineHeight: 1.2 }}>
{formatEmotion(topic.emotion)}
) : (
<div
style={{
...styles.topUsersList,
maxHeight: 260,
overflowY: "auto",
}}
>
{[...overallEmotionAverage]
.sort((a, b) => b.score - a.score)
.map((row) => (
<div
key={row.emotion}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => onExplore(buildDominantEmotionSpec(row.emotion))}
>
<div style={styles.topUserName}>
{formatEmotion(row.emotion)}
</div>
<div style={styles.topUserMeta}>{row.score.toFixed(3)}</div>
</div>
))}
</div>
<div style={{ display: "flex", justifyContent: "space-between", alignItems: "center", marginTop: 10, fontSize: 13, color: "#6b7280" }}>
<span>Confidence</span>
<span style={{ fontWeight: 700, color: "#111827" }}>{topic.value.toFixed(3)}</span>
)}
</div>
<div style={{ ...styles.card, gridColumn: "span 4" }}>
<h2 style={styles.sectionTitle}>Mood Split</h2>
<p style={styles.sectionSubtitle}>
How often each emotion is dominant.
</p>
{!dominantEmotionDistribution.length ? (
<div style={styles.topUserMeta}>
No dominant-emotion split available.
</div>
<div style={{ display: "flex", justifyContent: "space-between", alignItems: "center", marginTop: 4, fontSize: 13, color: "#6b7280" }}>
<span>Sample Size</span>
<span style={{ fontWeight: 700, color: "#111827" }}>{topic.count} events</span>
) : (
<div
style={{
...styles.topUsersList,
maxHeight: 260,
overflowY: "auto",
}}
>
{[...dominantEmotionDistribution]
.sort((a, b) => b.ratio - a.ratio)
.map((row) => (
<div
key={row.emotion}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => onExplore(buildDominantEmotionSpec(row.emotion))}
>
<div style={styles.topUserName}>
{formatEmotion(row.emotion)}
</div>
<div style={styles.topUserMeta}>
{(row.ratio * 100).toFixed(1)}% {" "}
{row.count.toLocaleString()} events
</div>
</div>
))}
</div>
)}
</div>
<div style={{ ...styles.card, gridColumn: "span 4" }}>
<h2 style={styles.sectionTitle}>Mood by Source</h2>
<p style={styles.sectionSubtitle}>Leading emotion in each source.</p>
{!emotionBySource.length ? (
<div style={styles.topUserMeta}>
No source emotion profile available.
</div>
) : (
<div
style={{
...styles.topUsersList,
maxHeight: 260,
overflowY: "auto",
}}
>
{[...emotionBySource]
.sort((a, b) => b.event_count - a.event_count)
.map((row) => (
<div
key={row.source}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => onExplore(buildSourceSpec(row.source))}
>
<div style={styles.topUserName}>{row.source}</div>
<div style={styles.topUserMeta}>
{formatEmotion(row.dominant_emotion)} {" "}
{row.dominant_score.toFixed(3)} {" "}
{row.event_count.toLocaleString()} events
</div>
</div>
))}
</div>
)}
</div>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Topic Snapshots</h2>
<p style={styles.sectionSubtitle}>
Per-topic mood with strength and post count.
</p>
<div style={{ ...styles.grid, marginTop: 10 }}>
{strongestPerTopic.map((topic) => (
<div
key={topic.topic}
style={{ ...styles.cardBase, gridColumn: "span 4", cursor: "pointer" }}
onClick={() => onExplore(buildTopicSpec(topic.topic))}
>
<h3 style={{ ...styles.sectionTitle, marginBottom: 6 }}>
{topic.topic}
</h3>
<div style={styles.emotionalTopicLabel}>Likely Mood</div>
<div style={styles.emotionalTopicValue}>
{formatEmotion(topic.emotion)}
</div>
<div style={styles.emotionalMetricRow}>
<span>Strength</span>
<span style={styles.emotionalMetricValue}>
{topic.value.toFixed(3)}
</span>
</div>
<div style={styles.emotionalMetricRowCompact}>
<span>Posts in Topic</span>
<span style={styles.emotionalMetricValue}>{topic.count}</span>
</div>
</div>
))}
</div>
))}
</div>
</div>
</div>
);
}
};
export default EmotionalStats;

View File

@@ -0,0 +1,262 @@
import Card from "./Card";
import StatsStyling from "../styles/stats_styling";
import type { InteractionAnalysisResponse } from "../types/ApiTypes";
import {
ResponsiveContainer,
BarChart,
Bar,
XAxis,
YAxis,
CartesianGrid,
Tooltip,
PieChart,
Pie,
Cell,
Legend,
} from "recharts";
const styles = StatsStyling;
type InteractionalStatsProps = {
data: InteractionAnalysisResponse;
};
const InteractionalStats = ({ data }: InteractionalStatsProps) => {
const graph = data.interaction_graph ?? {};
const userCount = Object.keys(graph).length;
let edgeCount = 0;
let interactionVolume = 0;
for (const targets of Object.values(graph)) {
for (const value of Object.values(targets)) {
edgeCount += 1;
interactionVolume += value;
}
}
const concentration = data.conversation_concentration;
const topTenCommentShare =
typeof concentration?.top_10pct_comment_share === "number"
? concentration?.top_10pct_comment_share
: null;
const topTenAuthorCount =
typeof concentration?.top_10pct_author_count === "number"
? concentration.top_10pct_author_count
: null;
const totalCommentingAuthors =
typeof concentration?.total_commenting_authors === "number"
? concentration.total_commenting_authors
: null;
const singleCommentAuthorRatio =
typeof concentration?.single_comment_author_ratio === "number"
? concentration.single_comment_author_ratio
: null;
const singleCommentAuthors =
typeof concentration?.single_comment_authors === "number"
? concentration.single_comment_authors
: null;
const topPairs = (data.top_interaction_pairs ?? [])
.filter((item): item is [[string, string], number] => {
if (!Array.isArray(item) || item.length !== 2) {
return false;
}
const pair = item[0];
const count = item[1];
return (
Array.isArray(pair) &&
pair.length === 2 &&
typeof pair[0] === "string" &&
typeof pair[1] === "string" &&
typeof count === "number"
);
})
.slice(0, 20);
const topPairChartData = topPairs
.slice(0, 8)
.map(([[source, target], value], index) => ({
pair: `${source} -> ${target}`,
replies: value,
rank: index + 1,
}));
const topTenSharePercent =
topTenCommentShare === null ? null : topTenCommentShare * 100;
const nonTopTenSharePercent =
topTenSharePercent === null ? null : Math.max(0, 100 - topTenSharePercent);
let concentrationPieData: { name: string; value: number }[] = [];
if (topTenSharePercent !== null && nonTopTenSharePercent !== null) {
concentrationPieData = [
{ name: "Top 10% authors", value: topTenSharePercent },
{ name: "Other authors", value: nonTopTenSharePercent },
];
}
const PIE_COLORS = ["#2b6777", "#c8d8e4"];
return (
<div style={styles.page}>
<div style={{ ...styles.container, ...styles.grid }}>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Conversation Overview</h2>
<p style={styles.sectionSubtitle}>
Who talks to who, how much they interact, and how concentrated the replies are.
</p>
</div>
<Card
label="Users in Network"
value={userCount.toLocaleString()}
sublabel="Users in the reply graph"
style={{ gridColumn: "span 4" }}
/>
<Card
label="User-to-User Links"
value={edgeCount.toLocaleString()}
sublabel="Unique reply directions"
style={{ gridColumn: "span 4" }}
/>
<Card
label="Total Replies"
value={interactionVolume.toLocaleString()}
sublabel="All reply links combined"
style={{ gridColumn: "span 4" }}
/>
<Card
label="Concentrated Replies"
value={
topTenSharePercent === null
? "-"
: `${topTenSharePercent.toFixed(1)}%`
}
sublabel={
topTenAuthorCount === null || totalCommentingAuthors === null
? "Reply share from the top 10% commenters"
: `${topTenAuthorCount.toLocaleString()} of ${totalCommentingAuthors.toLocaleString()} authors`
}
style={{ gridColumn: "span 6" }}
/>
<Card
label="Single-Comment Authors"
value={
singleCommentAuthorRatio === null
? "-"
: `${(singleCommentAuthorRatio * 100).toFixed(1)}%`
}
sublabel={
singleCommentAuthors === null
? "Authors who commented exactly once"
: `${singleCommentAuthors.toLocaleString()} authors commented exactly once`
}
style={{ gridColumn: "span 6" }}
/>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Conversation Visuals</h2>
<p style={styles.sectionSubtitle}>
Main reply links and concentration split.
</p>
<div style={{ ...styles.grid, marginTop: 12 }}>
<div style={{ ...styles.cardBase, gridColumn: "span 6" }}>
<h3 style={{ ...styles.sectionTitle, fontSize: "1rem" }}>
Top Interaction Pairs
</h3>
<div style={{ width: "100%", height: 300 }}>
<ResponsiveContainer>
<BarChart
data={topPairChartData}
layout="vertical"
margin={{ top: 8, right: 16, left: 16, bottom: 8 }}
>
<CartesianGrid strokeDasharray="3 3" stroke="#d9e2ec" />
<XAxis type="number" allowDecimals={false} />
<YAxis
type="category"
dataKey="rank"
tickFormatter={(value) => `#${value}`}
width={36}
/>
<Tooltip />
<Bar
dataKey="replies"
fill="#2b6777"
radius={[0, 6, 6, 0]}
/>
</BarChart>
</ResponsiveContainer>
</div>
</div>
<div style={{ ...styles.cardBase, gridColumn: "span 6" }}>
<h3 style={{ ...styles.sectionTitle, fontSize: "1rem" }}>
Top 10% vs Other Comment Share
</h3>
<div style={{ width: "100%", height: 300 }}>
<ResponsiveContainer>
<PieChart>
<Pie
data={concentrationPieData}
dataKey="value"
nameKey="name"
innerRadius={56}
outerRadius={88}
paddingAngle={2}
>
{concentrationPieData.map((entry, index) => (
<Cell
key={`${entry.name}-${index}`}
fill={PIE_COLORS[index % PIE_COLORS.length]}
/>
))}
</Pie>
<Tooltip />
<Legend verticalAlign="bottom" height={36} />
</PieChart>
</ResponsiveContainer>
</div>
</div>
</div>
</div>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Frequent Reply Paths</h2>
<p style={styles.sectionSubtitle}>
Most common user-to-user reply paths.
</p>
{!topPairs.length ? (
<div style={styles.topUserMeta}>
No interaction pair data available.
</div>
) : (
<div
style={{
...styles.topUsersList,
maxHeight: 420,
overflowY: "auto",
}}
>
{topPairs.map(([[source, target], value], index) => (
<div
key={`${source}->${target}-${index}`}
style={styles.topUserItem}
>
<div style={styles.topUserName}>
{source} -&gt; {target}
</div>
<div style={styles.topUserMeta}>
{value.toLocaleString()} replies
</div>
</div>
))}
</div>
)}
</div>
</div>
</div>
);
};
export default InteractionalStats;

View File

@@ -0,0 +1,137 @@
import Card from "./Card";
import StatsStyling from "../styles/stats_styling";
import type { LinguisticAnalysisResponse } from "../types/ApiTypes";
import {
buildNgramSpec,
buildWordSpec,
type CorpusExplorerSpec,
} from "../utils/corpusExplorer";
const styles = StatsStyling;
type LinguisticStatsProps = {
data: LinguisticAnalysisResponse;
onExplore: (spec: CorpusExplorerSpec) => void;
};
const LinguisticStats = ({ data, onExplore }: LinguisticStatsProps) => {
const lexical = data.lexical_diversity;
const words = data.word_frequencies ?? [];
const bigrams = data.common_two_phrases ?? [];
const trigrams = data.common_three_phrases ?? [];
const topWords = words.slice(0, 20);
const topBigrams = bigrams.slice(0, 10);
const topTrigrams = trigrams.slice(0, 10);
return (
<div style={styles.page}>
<div style={{ ...styles.container, ...styles.grid }}>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Language Overview</h2>
<p style={styles.sectionSubtitle}>
Quick read on how broad and repetitive the wording is.
</p>
</div>
<Card
label="Total Words"
value={lexical?.total_tokens?.toLocaleString() ?? "—"}
sublabel="Words after basic filtering"
style={{ gridColumn: "span 4" }}
/>
<Card
label="Unique Words"
value={lexical?.unique_tokens?.toLocaleString() ?? "—"}
sublabel="Different words used"
style={{ gridColumn: "span 4" }}
/>
<Card
label="Vocabulary Variety"
value={
typeof lexical?.ttr === "number" ? lexical.ttr.toFixed(4) : "—"
}
sublabel="Higher means less repetition"
style={{ gridColumn: "span 4" }}
/>
<div style={{ ...styles.card, gridColumn: "span 4" }}>
<h2 style={styles.sectionTitle}>Top Words</h2>
<p style={styles.sectionSubtitle}>Most used single words.</p>
<div
style={{
...styles.topUsersList,
maxHeight: 360,
overflowY: "auto",
}}
>
{topWords.map((item) => (
<div
key={item.word}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => onExplore(buildWordSpec(item.word))}
>
<div style={styles.topUserName}>{item.word}</div>
<div style={styles.topUserMeta}>
{item.count.toLocaleString()} uses
</div>
</div>
))}
</div>
</div>
<div style={{ ...styles.card, gridColumn: "span 4" }}>
<h2 style={styles.sectionTitle}>Top Bigrams</h2>
<p style={styles.sectionSubtitle}>Most used 2-word phrases.</p>
<div
style={{
...styles.topUsersList,
maxHeight: 360,
overflowY: "auto",
}}
>
{topBigrams.map((item) => (
<div
key={item.ngram}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => onExplore(buildNgramSpec(item.ngram))}
>
<div style={styles.topUserName}>{item.ngram}</div>
<div style={styles.topUserMeta}>
{item.count.toLocaleString()} uses
</div>
</div>
))}
</div>
</div>
<div style={{ ...styles.card, gridColumn: "span 4" }}>
<h2 style={styles.sectionTitle}>Top Trigrams</h2>
<p style={styles.sectionSubtitle}>Most used 3-word phrases.</p>
<div
style={{
...styles.topUsersList,
maxHeight: 360,
overflowY: "auto",
}}
>
{topTrigrams.map((item) => (
<div
key={item.ngram}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => onExplore(buildNgramSpec(item.ngram))}
>
<div style={styles.topUserName}>{item.ngram}</div>
<div style={styles.topUserMeta}>
{item.count.toLocaleString()} uses
</div>
</div>
))}
</div>
</div>
</div>
</div>
);
};
export default LinguisticStats;

View File

@@ -1,4 +1,4 @@
import { useState } from "react";
import { memo, useMemo } from "react";
import {
LineChart,
Line,
@@ -6,32 +6,55 @@ import {
YAxis,
Tooltip,
CartesianGrid,
ResponsiveContainer
ResponsiveContainer,
} from "recharts";
import ActivityHeatmap from "../stats/ActivityHeatmap";
import { ReactWordcloud } from '@cp949/react-wordcloud';
import { ReactWordcloud } from "@cp949/react-wordcloud";
import StatsStyling from "../styles/stats_styling";
import Card from "../components/Card";
import UserModal from "../components/UserModal";
import {
type SummaryResponse,
type FrequencyWord,
type UserAnalysisResponse,
import {
type SummaryResponse,
type FrequencyWord,
type UserEndpointResponse,
type TimeAnalysisResponse,
type ContentAnalysisResponse,
type User
} from '../types/ApiTypes'
type LinguisticAnalysisResponse,
} from "../types/ApiTypes";
import {
buildAllRecordsSpec,
buildDateBucketSpec,
buildOneTimeUsersSpec,
buildUserSpec,
type CorpusExplorerSpec,
} from "../utils/corpusExplorer";
const styles = StatsStyling;
const MAX_WORDCLOUD_WORDS = 250;
const exploreButtonStyle = { padding: "4px 8px", fontSize: 12 };
const WORDCLOUD_OPTIONS = {
rotations: 2,
rotationAngles: [0, 90] as [number, number],
fontSizes: [14, 60] as [number, number],
enableTooltip: true,
};
type SummaryStatsProps = {
userData: UserAnalysisResponse | null;
timeData: TimeAnalysisResponse | null;
contentData: ContentAnalysisResponse | null;
summary: SummaryResponse | null;
}
userData: UserEndpointResponse | null;
timeData: TimeAnalysisResponse | null;
linguisticData: LinguisticAnalysisResponse | null;
summary: SummaryResponse | null;
onExplore: (spec: CorpusExplorerSpec) => void;
};
type WordCloudPanelProps = {
words: { text: string; value: number }[];
};
const WordCloudPanel = memo(({ words }: WordCloudPanelProps) => (
<ReactWordcloud words={words} options={WORDCLOUD_OPTIONS} />
));
function formatDateRange(startUnix: number, endUnix: number) {
const start = new Date(startUnix * 1000);
@@ -44,174 +67,188 @@ function formatDateRange(startUnix: number, endUnix: number) {
day: "2-digit",
});
return `${fmt(start)} ${fmt(end)}`;
return `${fmt(start)} -> ${fmt(end)}`;
}
function convertFrequencyData(data: FrequencyWord[]) {
return data.map((d: FrequencyWord) => ({
text: d.word,
value: d.count,
}))
return data.map((d: FrequencyWord) => ({
text: d.word,
value: d.count,
}));
}
const SummaryStats = ({userData, timeData, contentData, summary}: SummaryStatsProps) => {
const [selectedUser, setSelectedUser] = useState<string | null>(null);
const selectedUserData: User | null = userData?.users.find((u) => u.author === selectedUser) ?? null;
const renderExploreButton = (onClick: () => void) => (
<button
onClick={onClick}
style={{ ...styles.buttonSecondary, ...exploreButtonStyle }}
>
Explore
</button>
);
console.log(summary)
const SummaryStats = ({
userData,
timeData,
linguisticData,
summary,
onExplore,
}: SummaryStatsProps) => {
const wordCloudWords = useMemo(
() =>
convertFrequencyData(
(linguisticData?.word_frequencies ?? []).slice(0, MAX_WORDCLOUD_WORDS),
),
[linguisticData?.word_frequencies],
);
return (
const topUsersPreview = useMemo(
() => (userData?.top_users ?? []).slice(0, 100),
[userData?.top_users],
);
return (
<div style={styles.page}>
<div style={{ ...styles.container, ...styles.grid }}>
<Card
label="Total Activity"
value={summary?.total_events ?? "-"}
sublabel="Posts + comments"
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
style={{ gridColumn: "span 4" }}
/>
<Card
label="Active People"
value={summary?.unique_users ?? "-"}
sublabel="Distinct users"
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
style={{ gridColumn: "span 4" }}
/>
<Card
label="Posts vs Comments"
value={
summary ? `${summary.total_posts} / ${summary.total_comments}` : "-"
}
sublabel={`Comments per post: ${summary?.comments_per_post ?? "-"}`}
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
style={{ gridColumn: "span 4" }}
/>
{/* main grid*/}
<div style={{ ...styles.container, ...styles.grid}}>
<Card
label="Total Events"
value={summary?.total_events ?? "—"}
sublabel="Posts + comments"
style={{
gridColumn: "span 4"
}}
/>
<Card
label="Unique Users"
value={summary?.unique_users ?? "—"}
sublabel="Distinct authors"
style={{
gridColumn: "span 4"
}}
/>
<Card
label="Posts / Comments"
value={
summary
? `${summary.total_posts} / ${summary.total_comments}`
: "—"
}
sublabel={`Comments per post: ${summary?.comments_per_post ?? "—"}`}
style={{
gridColumn: "span 4"
}}
/>
<Card
label="Time Range"
value={
summary?.time_range
? formatDateRange(summary.time_range.start, summary.time_range.end)
: "-"
}
sublabel="Based on dataset timestamps"
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
style={{ gridColumn: "span 4" }}
/>
<Card
label="Time Range"
value={
summary?.time_range
? formatDateRange(summary.time_range.start, summary.time_range.end)
: ""
}
sublabel="Based on dataset timestamps"
style={{
gridColumn: "span 4"
}}
/>
<Card
label="One-Time Users"
value={
typeof summary?.lurker_ratio === "number"
? `${Math.round(summary.lurker_ratio * 100)}%`
: "-"
}
sublabel="Users with only one event"
rightSlot={renderExploreButton(() => onExplore(buildOneTimeUsersSpec()))}
style={{ gridColumn: "span 4" }}
/>
<Card
label="Lurker Ratio"
value={
typeof summary?.lurker_ratio === "number"
? `${Math.round(summary.lurker_ratio * 100)}%`
: "—"
}
sublabel="Users with only 1 event"
style={{
gridColumn: "span 4"
}}
/>
<Card
label="Sources"
value={summary?.sources?.length ?? "-"}
sublabel={
summary?.sources?.length
? summary.sources.slice(0, 3).join(", ") +
(summary.sources.length > 3 ? "..." : "")
: "-"
}
rightSlot={renderExploreButton(() => onExplore(buildAllRecordsSpec()))}
style={{ gridColumn: "span 4" }}
/>
<Card
label="Sources"
value={summary?.sources?.length ?? "—"}
sublabel={
summary?.sources?.length
? summary.sources.slice(0, 3).join(", ") +
(summary.sources.length > 3 ? "…" : "")
: "—"
}
style={{
gridColumn: "span 4"
}}
/>
{/* events per day */}
<div style={{ ...styles.card, gridColumn: "span 5" }}>
<h2 style={styles.sectionTitle}>Events per Day</h2>
<p style={styles.sectionSubtitle}>Trend of activity over time</p>
<h2 style={styles.sectionTitle}>Activity Over Time</h2>
<p style={styles.sectionSubtitle}>How much posting happened each day.</p>
<div style={styles.chartWrapper}>
<div style={styles.chartWrapper}>
<ResponsiveContainer width="100%" height="100%">
<LineChart data={timeData?.events_per_day.filter((d) => new Date(d.date) >= new Date('2026-01-10'))}>
<LineChart
data={timeData?.events_per_day ?? []}
onClick={(state: unknown) => {
const payload = (state as { activePayload?: Array<{ payload?: { date?: string } }> })
?.activePayload?.[0]?.payload as
| { date?: string }
| undefined;
if (payload?.date) {
onExplore(buildDateBucketSpec(String(payload.date)));
}
}}
>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="date" />
<YAxis />
<Tooltip />
<Line type="monotone" dataKey="count" name="Events" />
</LineChart>
<Line
type="monotone"
dataKey="count"
name="Events"
isAnimationActive={false}
/>
</LineChart>
</ResponsiveContainer>
</div>
</div>
</div>
{/* Word Cloud */}
<div style={{ ...styles.card, gridColumn: "span 4" }}>
<h2 style={styles.sectionTitle}>Word Cloud</h2>
<p style={styles.sectionSubtitle}>Most common terms across events</p>
<h2 style={styles.sectionTitle}>Common Words</h2>
<p style={styles.sectionSubtitle}>
Frequently used words across the dataset.
</p>
<div style={styles.chartWrapper}>
<ReactWordcloud
words={convertFrequencyData(contentData?.word_frequencies ?? [])}
options={{
rotations: 2,
rotationAngles: [0, 90],
fontSizes: [14, 60],
enableTooltip: true,
}}
/>
</div>
<div style={styles.chartWrapper}>
<WordCloudPanel words={wordCloudWords} />
</div>
</div>
{/* Top Users */}
<div style={{...styles.card, ...styles.scrollArea, gridColumn: "span 3",
}}
<div
style={{ ...styles.card, ...styles.scrollArea, gridColumn: "span 3" }}
>
<h2 style={styles.sectionTitle}>Top Users</h2>
<p style={styles.sectionSubtitle}>Most active authors</p>
<h2 style={styles.sectionTitle}>Most Active Users</h2>
<p style={styles.sectionSubtitle}>Who posted the most events.</p>
<div style={styles.topUsersList}>
{userData?.top_users.slice(0, 100).map((item) => (
<div
<div style={styles.topUsersList}>
{topUsersPreview.map((item) => (
<div
key={`${item.author}-${item.source}`}
style={{ ...styles.topUserItem, cursor: "pointer" }}
onClick={() => setSelectedUser(item.author)}
>
onClick={() => onExplore(buildUserSpec(item.author))}
>
<div style={styles.topUserName}>{item.author}</div>
<div style={styles.topUserMeta}>
{item.source} {item.count} events
</div>
{item.source} {item.count} events
</div>
</div>
))}
</div>
</div>
</div>
{/* Heatmap */}
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>Heatmap</h2>
<p style={styles.sectionSubtitle}>Activity density across time</p>
<h2 style={styles.sectionTitle}>Weekly Activity Pattern</h2>
<p style={styles.sectionSubtitle}>
When activity tends to happen by weekday and hour.
</p>
<div style={styles.heatmapWrapper}>
<div style={styles.heatmapWrapper}>
<ActivityHeatmap data={timeData?.weekday_hour_heatmap ?? []} />
</div>
</div>
</div>
</div>
<UserModal
open={!!selectedUser}
onClose={() => setSelectedUser(null)}
username={selectedUser ?? ""}
userData={selectedUserData}
/>
</div>
</div>
);
}
);
};
export default SummaryStats;
export default SummaryStats;

View File

@@ -11,28 +11,22 @@ type Props = {
username: string;
};
export default function UserModal({ open, onClose, userData, username }: Props) {
return (
<Dialog open={open} onClose={onClose} style={{ position: "relative", zIndex: 50 }}>
<div
style={{
position: "fixed",
inset: 0,
background: "rgba(0,0,0,0.45)",
}}
/>
export default function UserModal({
open,
onClose,
userData,
username,
}: Props) {
const dominantEmotionEntry = Object.entries(
userData?.avg_emotions ?? {},
).sort((a, b) => b[1] - a[1])[0];
<div
style={{
position: "fixed",
inset: 0,
display: "flex",
alignItems: "center",
justifyContent: "center",
padding: 16,
}}
>
<DialogPanel style={{ ...styles.card, width: "min(520px, 95vw)" }}>
return (
<Dialog open={open} onClose={onClose} style={styles.modalRoot}>
<div style={styles.modalBackdrop} />
<div style={styles.modalContainer}>
<DialogPanel style={{ ...styles.card, ...styles.modalPanel }}>
<div style={styles.headerBar}>
<div>
<DialogTitle style={styles.sectionTitle}>{username}</DialogTitle>
@@ -48,7 +42,9 @@ export default function UserModal({ open, onClose, userData, username }: Props)
<p style={styles.sectionSubtitle}>No data for this user.</p>
) : (
<div style={styles.topUsersList}>
<div style={{...styles.topUserName, fontSize: 20}}>{userData.author}</div>
<div style={{ ...styles.topUserName, fontSize: 20 }}>
{userData.author}
</div>
<div style={styles.topUserItem}>
<div style={styles.topUserName}>Posts</div>
<div style={styles.topUserMeta}>{userData.post}</div>
@@ -77,7 +73,27 @@ export default function UserModal({ open, onClose, userData, username }: Props)
<div style={styles.topUserItem}>
<div style={styles.topUserName}>Vocab Richness</div>
<div style={styles.topUserMeta}>
{userData.vocab.vocab_richness} (avg {userData.vocab.avg_words_per_event} words/event)
{userData.vocab.vocab_richness} (avg{" "}
{userData.vocab.avg_words_per_event} words/event)
</div>
</div>
) : null}
{dominantEmotionEntry ? (
<div style={styles.topUserItem}>
<div style={styles.topUserName}>Dominant Avg Emotion</div>
<div style={styles.topUserMeta}>
{dominantEmotionEntry[0].replace("emotion_", "")} (
{dominantEmotionEntry[1].toFixed(3)})
</div>
</div>
) : null}
{userData.dominant_topic ? (
<div style={styles.topUserItem}>
<div style={styles.topUserName}>Most Common Topic</div>
<div style={styles.topUserMeta}>
{userData.dominant_topic.topic} ({userData.dominant_topic.count} events)
</div>
</div>
) : null}

View File

@@ -1,61 +1,230 @@
import { useEffect, useMemo, useRef, useState } from "react";
import ForceGraph3D from "react-force-graph-3d";
import {
type UserAnalysisResponse,
type InteractionGraph
} from '../types/ApiTypes';
import { type TopUser, type InteractionGraph } from "../types/ApiTypes";
import StatsStyling from "../styles/stats_styling";
import Card from "./Card";
import {
buildReplyPairSpec,
toText,
buildUserSpec,
type CorpusExplorerSpec,
} from "../utils/corpusExplorer";
const styles = StatsStyling;
function ApiToGraphData(apiData: InteractionGraph) {
const nodes = Object.keys(apiData).map(username => ({ id: username }));
const links = [];
for (const [source, targets] of Object.entries(apiData)) {
for (const [target, count] of Object.entries(targets)) {
links.push({ source, target, value: count });
}
type GraphLink = {
source: string;
target: string;
value: number;
};
function toGraphData(apiData: InteractionGraph) {
const links: GraphLink[] = [];
const connectedNodeIds = new Set<string>();
for (const [source, targets] of Object.entries(apiData)) {
for (const [target, count] of Object.entries(targets)) {
if (count < 2 || source === "[deleted]" || target === "[deleted]") {
continue;
}
links.push({ source, target, value: count });
connectedNodeIds.add(source);
connectedNodeIds.add(target);
}
// drop low-value and deleted interactions to reduce clutter
const filteredLinks = links.filter(link =>
link.value >= 2 &&
link.source !== "[deleted]" &&
link.target !== "[deleted]"
);
}
// also filter out nodes that are no longer connected after link filtering
const connectedNodeIds = new Set(filteredLinks.flatMap(link => [link.source, link.target]));
const filteredNodes = nodes.filter(node => connectedNodeIds.has(node.id));
const filteredNodes = Array.from(connectedNodeIds, (id) => ({ id }));
return { nodes: filteredNodes, links: filteredLinks};
return { nodes: filteredNodes, links };
}
type UserStatsProps = {
topUsers: TopUser[];
interactionGraph: InteractionGraph;
totalUsers: number;
mostCommentHeavyUser: { author: string; commentShare: number } | null;
onExplore: (spec: CorpusExplorerSpec) => void;
};
const UserStats = (props: { data: UserAnalysisResponse }) => {
const graphData = ApiToGraphData(props.data.interaction_graph);
const UserStats = ({
topUsers,
interactionGraph,
totalUsers,
mostCommentHeavyUser,
onExplore,
}: UserStatsProps) => {
const graphData = useMemo(
() => toGraphData(interactionGraph),
[interactionGraph],
);
const graphContainerRef = useRef<HTMLDivElement | null>(null);
const [graphSize, setGraphSize] = useState({ width: 720, height: 540 });
useEffect(() => {
const updateGraphSize = () => {
const containerWidth = graphContainerRef.current?.clientWidth ?? 720;
const nextWidth = Math.max(320, Math.floor(containerWidth));
const nextHeight = nextWidth < 700 ? 300 : 540;
setGraphSize({ width: nextWidth, height: nextHeight });
};
updateGraphSize();
window.addEventListener("resize", updateGraphSize);
return () => window.removeEventListener("resize", updateGraphSize);
}, []);
const connectedUsers = graphData.nodes.length;
const totalInteractions = graphData.links.reduce(
(sum, link) => sum + link.value,
0,
);
const avgInteractionsPerConnectedUser = connectedUsers
? totalInteractions / connectedUsers
: 0;
const strongestLink = graphData.links.reduce<GraphLink | null>(
(best, current) => {
if (!best || current.value > best.value) {
return current;
}
return best;
},
null,
);
const mostActiveUser = topUsers.find((u) => u.author !== "[deleted]");
const strongestLinkSource = strongestLink ? toText(strongestLink.source) : "";
const strongestLinkTarget = strongestLink ? toText(strongestLink.target) : "";
return (
<div style={styles.page}>
<h2 style={styles.sectionTitle}>User Interaction Graph</h2>
<p style={styles.sectionSubtitle}>
This graph visualizes interactions between users based on comments and replies.
Nodes represent users, and edges represent interactions (e.g., comments or replies) between them.
</p>
<div>
<div style={{ ...styles.container, ...styles.grid }}>
<Card
label="Users"
value={totalUsers.toLocaleString()}
sublabel={`${connectedUsers.toLocaleString()} users in filtered graph`}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Replies"
value={totalInteractions.toLocaleString()}
sublabel="Links with at least 2 replies"
style={{ gridColumn: "span 3" }}
/>
<Card
label="Replies per Connected User"
value={avgInteractionsPerConnectedUser.toFixed(1)}
sublabel="Average from visible graph links"
style={{ gridColumn: "span 3" }}
/>
<Card
label="Most Active User"
value={mostActiveUser?.author ?? "-"}
sublabel={
mostActiveUser
? `${mostActiveUser.count.toLocaleString()} events`
: "No user activity found"
}
rightSlot={
mostActiveUser ? (
<button
onClick={() => onExplore(buildUserSpec(mostActiveUser.author))}
style={styles.buttonSecondary}
>
Explore
</button>
) : null
}
style={{ gridColumn: "span 3" }}
/>
<Card
label="Strongest User Link"
value={
strongestLinkSource && strongestLinkTarget
? `${strongestLinkSource} -> ${strongestLinkTarget}`
: "-"
}
sublabel={
strongestLink
? `${strongestLink.value.toLocaleString()} replies`
: "No graph links after filtering"
}
rightSlot={
strongestLinkSource && strongestLinkTarget ? (
<button
onClick={() =>
onExplore(buildReplyPairSpec(strongestLinkSource, strongestLinkTarget))
}
style={styles.buttonSecondary}
>
Explore
</button>
) : null
}
style={{ gridColumn: "span 6" }}
/>
<Card
label="Most Comment-Heavy User"
value={mostCommentHeavyUser?.author ?? "-"}
sublabel={
mostCommentHeavyUser
? `${Math.round(mostCommentHeavyUser.commentShare * 100)}% comments`
: "No user distribution available"
}
rightSlot={
mostCommentHeavyUser ? (
<button
onClick={() => onExplore(buildUserSpec(mostCommentHeavyUser.author))}
style={styles.buttonSecondary}
>
Explore
</button>
) : null
}
style={{ gridColumn: "span 6" }}
/>
<div style={{ ...styles.card, gridColumn: "span 12" }}>
<h2 style={styles.sectionTitle}>User Interaction Graph</h2>
<p style={styles.sectionSubtitle}>
Each node is a user, and each link shows replies between them.
</p>
<div
ref={graphContainerRef}
style={{ width: "100%", height: graphSize.height }}
>
<ForceGraph3D
graphData={graphData}
nodeAutoColorBy="id"
linkDirectionalParticles={2}
linkDirectionalParticleSpeed={0.005}
linkWidth={(link) => Math.sqrt(link.value)}
nodeLabel={(node) => `${node.id}`}
width={graphSize.width}
height={graphSize.height}
graphData={graphData}
nodeAutoColorBy="id"
linkDirectionalParticles={1}
linkDirectionalParticleSpeed={0.004}
linkWidth={(link) => Math.sqrt(Number(link.value))}
nodeLabel={(node) => `${node.id}`}
onNodeClick={(node) => {
const userId = toText(node.id);
if (userId) {
onExplore(buildUserSpec(userId));
}
}}
onLinkClick={(link) => {
const source = toText(link.source);
const target = toText(link.target);
if (source && target) {
onExplore(buildReplyPairSpec(source, target));
}
}}
/>
</div>
</div>
</div>
</div>
);
}
};
export default UserStats;
export default UserStats;

View File

@@ -1,68 +1,65 @@
:root {
font-family: system-ui, Avenir, Helvetica, Arial, sans-serif;
line-height: 1.5;
font-weight: 400;
color-scheme: light dark;
color: rgba(255, 255, 255, 0.87);
background-color: #242424;
--bg-default: #f6f8fa;
--text-default: #24292f;
--border-default: #d0d7de;
--focus-ring: rgba(9, 105, 218, 0.22);
font-synthesis: none;
text-rendering: optimizeLegibility;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}
a {
font-weight: 500;
color: #646cff;
text-decoration: inherit;
}
a:hover {
color: #535bf2;
html,
body,
#root {
width: 100%;
height: 100%;
}
body {
margin: 0;
display: flex;
place-items: center;
min-width: 320px;
min-height: 100vh;
background: var(--bg-default);
color: var(--text-default);
font-family: "IBM Plex Sans", "Noto Sans", "Liberation Sans", "Segoe UI", sans-serif;
}
h1 {
font-size: 3.2em;
line-height: 1.1;
* {
box-sizing: border-box;
}
button {
border-radius: 8px;
border: 1px solid transparent;
padding: 0.6em 1.2em;
font-size: 1em;
font-weight: 500;
font-family: inherit;
background-color: #1a1a1a;
cursor: pointer;
transition: border-color 0.25s;
}
button:hover {
border-color: #646cff;
}
button:focus,
button:focus-visible {
outline: 4px auto -webkit-focus-ring-color;
button,
input,
select,
textarea {
font: inherit;
}
@media (prefers-color-scheme: light) {
:root {
color: #213547;
background-color: #ffffff;
input:focus,
button:focus-visible,
select:focus,
textarea:focus {
border-color: #0969da;
box-shadow: 0 0 0 3px var(--focus-ring);
outline: none;
}
@keyframes stats-spin {
from {
transform: rotate(0deg);
}
a:hover {
color: #747bff;
}
button {
background-color: #f9f9f9;
to {
transform: rotate(360deg);
}
}
@keyframes stats-pulse {
0%,
100% {
opacity: 0.5;
}
50% {
opacity: 1;
}
}

View File

@@ -0,0 +1,530 @@
import axios from "axios";
import { useEffect, useState } from "react";
import { useNavigate } from "react-router-dom";
import StatsStyling from "../styles/stats_styling";
const styles = StatsStyling;
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
type SourceOption = {
id: string;
label: string;
search_enabled?: boolean;
categories_enabled?: boolean;
searchEnabled?: boolean;
categoriesEnabled?: boolean;
};
type SourceConfig = {
sourceName: string;
limit: string;
search: string;
category: string;
};
type TopicMap = Record<string, string>;
const buildEmptySourceConfig = (sourceName = ""): SourceConfig => ({
sourceName,
limit: "100",
search: "",
category: "",
});
const supportsSearch = (source?: SourceOption): boolean =>
Boolean(source?.search_enabled ?? source?.searchEnabled);
const supportsCategories = (source?: SourceOption): boolean =>
Boolean(source?.categories_enabled ?? source?.categoriesEnabled);
const AutoFetchPage = () => {
const navigate = useNavigate();
const [datasetName, setDatasetName] = useState("");
const [sourceOptions, setSourceOptions] = useState<SourceOption[]>([]);
const [sourceConfigs, setSourceConfigs] = useState<SourceConfig[]>([]);
const [returnMessage, setReturnMessage] = useState("");
const [isLoadingSources, setIsLoadingSources] = useState(true);
const [isSubmitting, setIsSubmitting] = useState(false);
const [hasError, setHasError] = useState(false);
const [useCustomTopics, setUseCustomTopics] = useState(false);
const [customTopicsText, setCustomTopicsText] = useState("");
useEffect(() => {
axios
.get<SourceOption[]>(`${API_BASE_URL}/datasets/sources`)
.then((response) => {
const options = response.data || [];
setSourceOptions(options);
setSourceConfigs([buildEmptySourceConfig(options[0]?.id || "")]);
})
.catch((requestError: unknown) => {
setHasError(true);
if (axios.isAxiosError(requestError)) {
setReturnMessage(
`Failed to load available sources: ${String(
requestError.response?.data?.error || requestError.message,
)}`,
);
} else {
setReturnMessage("Failed to load available sources.");
}
})
.finally(() => {
setIsLoadingSources(false);
});
}, []);
const updateSourceConfig = (
index: number,
field: keyof SourceConfig,
value: string,
) => {
setSourceConfigs((previous) =>
previous.map((config, configIndex) =>
configIndex === index
? field === "sourceName"
? { ...config, sourceName: value, search: "", category: "" }
: { ...config, [field]: value }
: config,
),
);
};
const getSourceOption = (sourceName: string) =>
sourceOptions.find((option) => option.id === sourceName);
const addSourceConfig = () => {
setSourceConfigs((previous) => [
...previous,
buildEmptySourceConfig(sourceOptions[0]?.id || ""),
]);
};
const removeSourceConfig = (index: number) => {
setSourceConfigs((previous) =>
previous.filter((_, configIndex) => configIndex !== index),
);
};
const autoFetch = async () => {
const token = localStorage.getItem("access_token");
if (!token) {
setHasError(true);
setReturnMessage("You must be signed in to auto fetch a dataset.");
return;
}
const normalizedDatasetName = datasetName.trim();
if (!normalizedDatasetName) {
setHasError(true);
setReturnMessage("Please add a dataset name before continuing.");
return;
}
if (sourceConfigs.length === 0) {
setHasError(true);
setReturnMessage("Please add at least one source.");
return;
}
const normalizedSources = sourceConfigs.map((source) => {
const sourceOption = getSourceOption(source.sourceName);
return {
name: source.sourceName,
limit: Number(source.limit || 100),
search: supportsSearch(sourceOption)
? source.search.trim() || undefined
: undefined,
category: supportsCategories(sourceOption)
? source.category.trim() || undefined
: undefined,
};
});
const invalidSource = normalizedSources.find(
(source) =>
!source.name || !Number.isFinite(source.limit) || source.limit <= 0,
);
if (invalidSource) {
setHasError(true);
setReturnMessage(
"Every source needs a name and a limit greater than zero.",
);
return;
}
let normalizedTopics: TopicMap | undefined;
if (useCustomTopics) {
const customTopicsJson = customTopicsText.trim();
if (!customTopicsJson) {
setHasError(true);
setReturnMessage(
"Custom topics are enabled, so please provide a JSON topic map.",
);
return;
}
let parsedTopics: unknown;
try {
parsedTopics = JSON.parse(customTopicsJson);
} catch {
setHasError(true);
setReturnMessage("Custom topic list must be valid JSON.");
return;
}
if (
!parsedTopics ||
Array.isArray(parsedTopics) ||
typeof parsedTopics !== "object"
) {
setHasError(true);
setReturnMessage(
"Custom topic list must be a JSON object: {\"Topic\": \"keywords\"}.",
);
return;
}
const entries = Object.entries(parsedTopics);
if (entries.length === 0) {
setHasError(true);
setReturnMessage("Custom topic list cannot be empty.");
return;
}
const hasInvalidTopic = entries.some(
([topicName, keywords]) =>
!topicName.trim() ||
typeof keywords !== "string" ||
!keywords.trim(),
);
if (hasInvalidTopic) {
setHasError(true);
setReturnMessage(
"Every custom topic must have a non-empty name and keyword string.",
);
return;
}
normalizedTopics = Object.fromEntries(
entries.map(([topicName, keywords]) => [
topicName.trim(),
String(keywords).trim(),
]),
);
}
const requestBody: {
name: string;
sources: Array<{
name: string;
limit: number;
search?: string;
category?: string;
}>;
topics?: TopicMap;
} = {
name: normalizedDatasetName,
sources: normalizedSources,
};
if (normalizedTopics) {
requestBody.topics = normalizedTopics;
}
try {
setIsSubmitting(true);
setHasError(false);
setReturnMessage("");
const response = await axios.post(
`${API_BASE_URL}/datasets/fetch`,
requestBody,
{
headers: {
Authorization: `Bearer ${token}`,
},
},
);
const datasetId = Number(response.data.dataset_id);
setReturnMessage(
`Auto fetch queued successfully (dataset #${datasetId}). Redirecting to processing status...`,
);
setTimeout(() => {
navigate(`/dataset/${datasetId}/status`);
}, 400);
} catch (requestError: unknown) {
setHasError(true);
if (axios.isAxiosError(requestError)) {
const message = String(
requestError.response?.data?.error ||
requestError.message ||
"Auto fetch failed.",
);
setReturnMessage(`Auto fetch failed: ${message}`);
} else {
setReturnMessage("Auto fetch failed due to an unexpected error.");
}
} finally {
setIsSubmitting(false);
}
};
return (
<div style={styles.page}>
<div style={styles.containerWide}>
<div style={{ ...styles.card, ...styles.headerBar }}>
<div>
<h1 style={styles.sectionHeaderTitle}>Auto Fetch Dataset</h1>
<p style={styles.sectionHeaderSubtitle}>
Select sources and fetch settings, then queue processing
automatically.
</p>
<p
style={{
...styles.subtleBodyText,
marginTop: 6,
color: "#9a6700",
}}
>
Warning: Fetching more than 250 posts from any single site can
take hours due to rate limits.
</p>
</div>
<button
type="button"
style={{
...styles.buttonPrimary,
opacity: isSubmitting || isLoadingSources ? 0.75 : 1,
}}
onClick={autoFetch}
disabled={isSubmitting || isLoadingSources}
>
{isSubmitting ? "Queueing..." : "Auto Fetch and Analyze"}
</button>
</div>
<div
style={{
...styles.grid,
marginTop: 14,
gridTemplateColumns: "repeat(auto-fit, minmax(280px, 1fr))",
}}
>
<div style={{ ...styles.card, gridColumn: "auto" }}>
<h2 style={{ ...styles.sectionTitle, color: "#24292f" }}>
Dataset Name
</h2>
<p style={styles.sectionSubtitle}>
Use a clear label so you can identify this run later.
</p>
<input
style={{ ...styles.input, ...styles.inputFullWidth }}
type="text"
placeholder="Example: r/cork subreddit - Jan 2026"
value={datasetName}
onChange={(event) => setDatasetName(event.target.value)}
/>
</div>
<div style={{ ...styles.card, gridColumn: "auto" }}>
<h2 style={{ ...styles.sectionTitle, color: "#24292f" }}>
Sources
</h2>
<p style={styles.sectionSubtitle}>
Configure source, limit, optional search, and optional category.
</p>
{isLoadingSources && (
<p style={styles.subtleBodyText}>Loading sources...</p>
)}
{!isLoadingSources && sourceOptions.length === 0 && (
<p style={styles.subtleBodyText}>
No source connectors are currently available.
</p>
)}
{!isLoadingSources && sourceOptions.length > 0 && (
<div
style={{ display: "flex", flexDirection: "column", gap: 10 }}
>
{sourceConfigs.map((source, index) => {
const sourceOption = getSourceOption(source.sourceName);
const searchEnabled = supportsSearch(sourceOption);
const categoriesEnabled = supportsCategories(sourceOption);
return (
<div
key={`source-${index}`}
style={{
border: "1px solid #d0d7de",
borderRadius: 8,
padding: 12,
background: "#f6f8fa",
display: "grid",
gap: 8,
}}
>
<select
value={source.sourceName}
style={{ ...styles.input, ...styles.inputFullWidth }}
onChange={(event) =>
updateSourceConfig(
index,
"sourceName",
event.target.value,
)
}
>
{sourceOptions.map((option) => (
<option key={option.id} value={option.id}>
{option.label}
</option>
))}
</select>
<input
type="number"
min={1}
value={source.limit}
placeholder="Limit"
style={{ ...styles.input, ...styles.inputFullWidth }}
onChange={(event) =>
updateSourceConfig(index, "limit", event.target.value)
}
/>
<input
type="text"
value={source.search}
placeholder={
searchEnabled
? "Search term (optional)"
: "Search not supported for this source"
}
style={{ ...styles.input, ...styles.inputFullWidth }}
disabled={!searchEnabled}
onChange={(event) =>
updateSourceConfig(
index,
"search",
event.target.value,
)
}
/>
<input
type="text"
value={source.category}
placeholder={
categoriesEnabled
? "Category (optional)"
: "Categories not supported for this source"
}
style={{ ...styles.input, ...styles.inputFullWidth }}
disabled={!categoriesEnabled}
onChange={(event) =>
updateSourceConfig(
index,
"category",
event.target.value,
)
}
/>
{sourceConfigs.length > 1 && (
<button
type="button"
style={styles.buttonSecondary}
onClick={() => removeSourceConfig(index)}
>
Remove source
</button>
)}
</div>
);
})}
<button
type="button"
style={styles.buttonSecondary}
onClick={addSourceConfig}
>
Add another source
</button>
</div>
)}
</div>
<div style={{ ...styles.card, gridColumn: "auto" }}>
<h2 style={{ ...styles.sectionTitle, color: "#24292f" }}>
Topic List
</h2>
<p style={styles.sectionSubtitle}>
Use the default topic list, or provide your own JSON topic map.
</p>
<label
style={{
display: "flex",
alignItems: "center",
gap: 8,
fontSize: 14,
color: "#24292f",
marginBottom: 10,
}}
>
<input
type="checkbox"
checked={useCustomTopics}
onChange={(event) => setUseCustomTopics(event.target.checked)}
/>
Use custom topic list
</label>
<textarea
value={customTopicsText}
onChange={(event) => setCustomTopicsText(event.target.value)}
disabled={!useCustomTopics}
placeholder='{"Politics": "election, policy, government", "Housing": "rent, landlords, tenancy"}'
style={{
...styles.input,
...styles.inputFullWidth,
minHeight: 170,
resize: "vertical",
fontFamily:
'"IBM Plex Mono", "Fira Code", "JetBrains Mono", monospace',
}}
/>
<p style={styles.subtleBodyText}>
Format: JSON object where each key is a topic and each value is a
keyword string.
</p>
</div>
</div>
<div
style={{
...styles.card,
marginTop: 14,
...(hasError ? styles.alertCardError : styles.alertCardInfo),
}}
>
{returnMessage ||
"After queueing, your dataset is fetched and processed in the background automatically."}
</div>
</div>
</div>
);
};
export default AutoFetchPage;

View File

@@ -0,0 +1,217 @@
import StatsStyling from "../styles/stats_styling";
import { useNavigate, useParams } from "react-router-dom";
import { useEffect, useMemo, useState, type FormEvent } from "react";
import axios from "axios";
import ConfirmationModal from "../components/ConfirmationModal";
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
const styles = StatsStyling;
type DatasetInfoResponse = {
id: number;
name: string;
created_at: string;
};
const DatasetEditPage = () => {
const navigate = useNavigate();
const { datasetId } = useParams<{ datasetId: string }>();
const parsedDatasetId = useMemo(() => Number(datasetId), [datasetId]);
const [statusMessage, setStatusMessage] = useState("");
const [loading, setLoading] = useState(true);
const [isSaving, setIsSaving] = useState(false);
const [isDeleting, setIsDeleting] = useState(false);
const [isDeleteModalOpen, setIsDeleteModalOpen] = useState(false);
const [datasetName, setDatasetName] = useState("");
useEffect(() => {
if (!Number.isInteger(parsedDatasetId) || parsedDatasetId <= 0) {
setStatusMessage("Invalid dataset id.");
setLoading(false);
return;
}
const token = localStorage.getItem("access_token");
if (!token) {
setStatusMessage("You must be signed in to edit datasets.");
setLoading(false);
return;
}
axios
.get<DatasetInfoResponse>(`${API_BASE_URL}/dataset/${parsedDatasetId}`, {
headers: { Authorization: `Bearer ${token}` },
})
.then((response) => {
setDatasetName(response.data.name || "");
})
.catch((error: unknown) => {
if (axios.isAxiosError(error)) {
setStatusMessage(
String(error.response?.data?.error || error.message),
);
} else {
setStatusMessage("Could not get dataset info.");
}
})
.finally(() => {
setLoading(false);
});
}, [parsedDatasetId]);
const saveDatasetName = async (event: FormEvent<HTMLFormElement>) => {
event.preventDefault();
const trimmedName = datasetName.trim();
if (!trimmedName) {
setStatusMessage("Please enter a valid dataset name.");
return;
}
const token = localStorage.getItem("access_token");
if (!token) {
setStatusMessage("You must be signed in to save changes.");
return;
}
try {
setIsSaving(true);
setStatusMessage("");
await axios.patch(
`${API_BASE_URL}/dataset/${parsedDatasetId}`,
{ name: trimmedName },
{ headers: { Authorization: `Bearer ${token}` } },
);
navigate("/datasets", { replace: true });
} catch (error: unknown) {
if (axios.isAxiosError(error)) {
setStatusMessage(
String(
error.response?.data?.error || error.message || "Save failed.",
),
);
} else {
setStatusMessage("Save failed due to an unexpected error.");
}
} finally {
setIsSaving(false);
}
};
const deleteDataset = async () => {
const deleteToken = localStorage.getItem("access_token");
if (!deleteToken) {
setStatusMessage("You must be signed in to delete datasets.");
setIsDeleteModalOpen(false);
return;
}
try {
setIsDeleting(true);
setStatusMessage("");
await axios.delete(`${API_BASE_URL}/dataset/${parsedDatasetId}`, {
headers: { Authorization: `Bearer ${deleteToken}` },
});
setIsDeleteModalOpen(false);
navigate("/datasets", { replace: true });
} catch (error: unknown) {
if (axios.isAxiosError(error)) {
setStatusMessage(
String(
error.response?.data?.error || error.message || "Delete failed.",
),
);
} else {
setStatusMessage("Delete failed due to an unexpected error.");
}
} finally {
setIsDeleting(false);
}
};
return (
<div style={styles.page}>
<div style={styles.containerNarrow}>
<div style={{ ...styles.card, ...styles.headerBar }}>
<div>
<h1 style={styles.sectionHeaderTitle}>Edit Dataset</h1>
<p style={styles.sectionHeaderSubtitle}>
Update the dataset name shown in your datasets list.
</p>
</div>
</div>
<form
onSubmit={saveDatasetName}
style={{ ...styles.card, marginTop: 14, display: "grid", gap: 12 }}
>
<label
htmlFor="dataset-name"
style={{ fontSize: 13, color: "#374151", fontWeight: 600 }}
>
Dataset name
</label>
<input
id="dataset-name"
style={{ ...styles.input, ...styles.inputFullWidth }}
type="text"
placeholder="Example: Cork Discussions - Jan 2026"
value={datasetName}
onChange={(event) => setDatasetName(event.target.value)}
disabled={loading || isSaving}
/>
<div style={{ display: "flex", gap: 8, justifyContent: "flex-end" }}>
<button
type="button"
style={styles.buttonDanger}
onClick={() => setIsDeleteModalOpen(true)}
disabled={isSaving || isDeleting}
>
Delete Dataset
</button>
<button
type="button"
style={styles.buttonSecondary}
onClick={() => navigate("/datasets")}
disabled={isSaving || isDeleting}
>
Cancel
</button>
<button
type="submit"
style={{
...styles.buttonPrimary,
opacity: loading || isSaving ? 0.75 : 1,
}}
disabled={loading || isSaving || isDeleting}
>
{isSaving ? "Saving..." : "Save"}
</button>
{loading ? "Loading dataset details..." : statusMessage}
</div>
</form>
<ConfirmationModal
open={isDeleteModalOpen}
title="Delete Dataset"
message={`Are you sure you want to delete "${datasetName || "this dataset"}"? This action cannot be undone.`}
confirmLabel="Delete"
cancelLabel="Keep Dataset"
loading={isDeleting}
onCancel={() => setIsDeleteModalOpen(false)}
onConfirm={deleteDataset}
/>
</div>
</div>
);
};
export default DatasetEditPage;

View File

@@ -0,0 +1,126 @@
import { useEffect, useMemo, useState } from "react";
import axios from "axios";
import { useNavigate, useParams } from "react-router-dom";
import StatsStyling from "../styles/stats_styling";
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
type DatasetStatusResponse = {
status?: "fetching" | "processing" | "complete" | "error";
status_message?: string | null;
completed_at?: string | null;
};
const styles = StatsStyling;
const DatasetStatusPage = () => {
const navigate = useNavigate();
const { datasetId } = useParams<{ datasetId: string }>();
const [loading, setLoading] = useState(true);
const [status, setStatus] =
useState<DatasetStatusResponse["status"]>("processing");
const [statusMessage, setStatusMessage] = useState("");
const parsedDatasetId = useMemo(() => Number(datasetId), [datasetId]);
useEffect(() => {
if (!Number.isInteger(parsedDatasetId) || parsedDatasetId <= 0) {
setLoading(false);
setStatus("error");
setStatusMessage("Invalid dataset id.");
return;
}
let pollTimer: number | undefined;
const pollStatus = async () => {
try {
const response = await axios.get<DatasetStatusResponse>(
`${API_BASE_URL}/dataset/${parsedDatasetId}/status`,
);
const nextStatus = response.data.status ?? "processing";
setStatus(nextStatus);
setStatusMessage(String(response.data.status_message ?? ""));
setLoading(false);
if (nextStatus === "complete") {
window.setTimeout(() => {
navigate(`/dataset/${parsedDatasetId}/stats`, { replace: true });
}, 800);
}
} catch (error: unknown) {
setLoading(false);
setStatus("error");
if (axios.isAxiosError(error)) {
const message = String(
error.response?.data?.error || error.message || "Request failed",
);
setStatusMessage(message);
} else {
setStatusMessage("Unable to fetch dataset status.");
}
}
};
void pollStatus();
pollTimer = window.setInterval(() => {
if (status !== "complete" && status !== "error") {
void pollStatus();
}
}, 2000);
return () => {
if (pollTimer) {
window.clearInterval(pollTimer);
}
};
}, [navigate, parsedDatasetId, status]);
const isProcessing =
loading || status === "fetching" || status === "processing";
const isError = status === "error";
return (
<div style={styles.page}>
<div style={styles.containerNarrow}>
<div style={{ ...styles.card, marginTop: 28 }}>
<h1 style={styles.sectionHeaderTitle}>
{isProcessing
? "Processing dataset..."
: isError
? "Dataset processing failed"
: "Dataset ready"}
</h1>
<p style={{ ...styles.sectionSubtitle, marginTop: 10 }}>
{isProcessing &&
"Your dataset is being analyzed. This page will redirect to stats automatically once complete."}
{isError &&
"There was an issue while processing your dataset. Please review the error details."}
{status === "complete" &&
"Processing complete. Redirecting to your stats now..."}
</p>
<div
style={{
...styles.card,
...styles.statusMessageCard,
borderColor: isError
? "rgba(185, 28, 28, 0.28)"
: "rgba(0,0,0,0.06)",
background: isError ? "#fff5f5" : "#ffffff",
color: isError ? "#991b1b" : "#374151",
}}
>
{statusMessage ||
(isProcessing
? "Waiting for updates from the worker queue..."
: "No details provided.")}
</div>
</div>
</div>
</div>
);
};
export default DatasetStatusPage;

View File

@@ -0,0 +1,207 @@
import { useEffect, useState } from "react";
import axios from "axios";
import { useNavigate } from "react-router-dom";
import StatsStyling from "../styles/stats_styling";
const styles = StatsStyling;
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
type DatasetItem = {
id: number;
name?: string;
status?: "processing" | "complete" | "error" | "fetching" | string;
status_message?: string | null;
completed_at?: string | null;
created_at?: string | null;
};
const DatasetsPage = () => {
const navigate = useNavigate();
const [datasets, setDatasets] = useState<DatasetItem[]>([]);
const [loading, setLoading] = useState(true);
const [error, setError] = useState("");
useEffect(() => {
const token = localStorage.getItem("access_token");
if (!token) {
setLoading(false);
setError("You must be signed in to view datasets.");
return;
}
axios
.get<DatasetItem[]>(`${API_BASE_URL}/user/datasets`, {
headers: { Authorization: `Bearer ${token}` },
})
.then((response) => {
const sorted = [...(response.data || [])].sort((a, b) => b.id - a.id);
setDatasets(sorted);
})
.catch((requestError: unknown) => {
if (axios.isAxiosError(requestError)) {
setError(
String(requestError.response?.data?.error || requestError.message),
);
} else {
setError("Failed to load datasets.");
}
})
.finally(() => {
setLoading(false);
});
}, []);
if (loading) {
return (
<div style={styles.loadingPage}>
<div style={{ ...styles.loadingCard, transform: "translateY(-100px)" }}>
<div style={styles.loadingHeader}>
<div style={styles.loadingSpinner} />
<div>
<h2 style={styles.loadingTitle}>Loading datasets</h2>
</div>
</div>
<div style={styles.loadingSkeleton}>
<div
style={{
...styles.loadingSkeletonLine,
...styles.loadingSkeletonLineLong,
}}
/>
<div
style={{
...styles.loadingSkeletonLine,
...styles.loadingSkeletonLineMed,
}}
/>
<div
style={{
...styles.loadingSkeletonLine,
...styles.loadingSkeletonLineShort,
}}
/>
</div>
</div>
</div>
);
}
return (
<div style={styles.page}>
<div style={styles.containerWide}>
<div style={{ ...styles.card, ...styles.headerBar }}>
<div>
<h1 style={styles.sectionHeaderTitle}>My Datasets</h1>
<p style={styles.sectionHeaderSubtitle}>
View and reopen datasets you previously uploaded.
</p>
</div>
<div style={styles.controlsWrapped}>
<button
type="button"
style={styles.buttonPrimary}
onClick={() => navigate("/upload")}
>
Upload New Dataset
</button>
<button
type="button"
style={styles.buttonSecondary}
onClick={() => navigate("/auto-fetch")}
>
Auto Fetch Dataset
</button>
</div>
</div>
{error && (
<div
style={{
...styles.card,
marginTop: 14,
borderColor: "rgba(185, 28, 28, 0.28)",
background: "#fff5f5",
color: "#991b1b",
fontSize: 14,
}}
>
{error}
</div>
)}
{!error && datasets.length === 0 && (
<div style={{ ...styles.card, marginTop: 14, color: "#374151" }}>
No datasets yet. Upload one to get started.
</div>
)}
{!error && datasets.length > 0 && (
<div
style={{
...styles.card,
marginTop: 14,
padding: 0,
overflow: "hidden",
}}
>
<ul style={styles.listNoBullets}>
{datasets.map((dataset) => {
const isComplete =
dataset.status === "complete" || dataset.status === "error";
const editPath = `/dataset/${dataset.id}/edit`;
const targetPath = isComplete
? `/dataset/${dataset.id}/stats`
: `/dataset/${dataset.id}/status`;
return (
<li key={dataset.id} style={styles.datasetListItem}>
<div style={{ minWidth: 0 }}>
<div style={styles.datasetName}>
{dataset.name || `Dataset #${dataset.id}`}
</div>
<div style={styles.datasetMeta}>
ID #{dataset.id} Status: {dataset.status || "unknown"}
</div>
{dataset.status_message && (
<div style={styles.datasetMetaSecondary}>
{dataset.status_message}
</div>
)}
</div>
<div>
{isComplete && (
<button
type="button"
style={{ ...styles.buttonSecondary, margin: "5px" }}
onClick={() => navigate(editPath)}
>
Edit Dataset
</button>
)}
<button
type="button"
style={
isComplete
? styles.buttonPrimary
: styles.buttonSecondary
}
onClick={() => navigate(targetPath)}
>
{isComplete ? "Open stats" : "View status"}
</button>
</div>
</li>
);
})}
</ul>
</div>
)}
</div>
</div>
);
};
export default DatasetsPage;

View File

@@ -0,0 +1,168 @@
import { useEffect, useState } from "react";
import axios from "axios";
import { useNavigate } from "react-router-dom";
import StatsStyling from "../styles/stats_styling";
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
const styles = StatsStyling;
const LoginPage = () => {
const navigate = useNavigate();
const [isRegisterMode, setIsRegisterMode] = useState(false);
const [username, setUsername] = useState("");
const [email, setEmail] = useState("");
const [password, setPassword] = useState("");
const [loading, setLoading] = useState(false);
const [error, setError] = useState("");
const [info, setInfo] = useState("");
useEffect(() => {
const token = localStorage.getItem("access_token");
if (!token) {
return;
}
axios.defaults.headers.common.Authorization = `Bearer ${token}`;
axios
.get(`${API_BASE_URL}/profile`)
.then(() => {
navigate("/upload", { replace: true });
})
.catch(() => {
localStorage.removeItem("access_token");
delete axios.defaults.headers.common.Authorization;
});
}, [navigate]);
const handleSubmit = async (event: React.FormEvent<HTMLFormElement>) => {
event.preventDefault();
setError("");
setInfo("");
setLoading(true);
try {
if (isRegisterMode) {
await axios.post(`${API_BASE_URL}/register`, {
username,
email,
password,
});
setInfo("Account created. You can now sign in.");
setIsRegisterMode(false);
} else {
const response = await axios.post<{ access_token: string }>(
`${API_BASE_URL}/login`,
{ username, password },
);
const token = response.data.access_token;
localStorage.setItem("access_token", token);
axios.defaults.headers.common.Authorization = `Bearer ${token}`;
navigate("/upload");
}
} catch (requestError: unknown) {
if (axios.isAxiosError(requestError)) {
setError(
String(
requestError.response?.data?.error ||
requestError.message ||
"Request failed",
),
);
} else {
setError("Unexpected error occurred.");
}
} finally {
setLoading(false);
}
};
return (
<div style={styles.containerAuth}>
<div style={{ ...styles.card, ...styles.authCard }}>
<div style={styles.headingBlock}>
<h1 style={styles.headingXl}>
{isRegisterMode ? "Create your account" : "Welcome back"}
</h1>
<p style={styles.mutedText}>
{isRegisterMode
? "Register to start uploading and exploring your dataset insights."
: "Sign in to continue to your analytics workspace."}
</p>
</div>
<form onSubmit={handleSubmit} style={styles.authForm}>
<input
type="text"
placeholder="Username"
style={{ ...styles.input, ...styles.authControl }}
value={username}
onChange={(event) => setUsername(event.target.value)}
required
/>
{isRegisterMode && (
<input
type="email"
placeholder="Email"
style={{ ...styles.input, ...styles.authControl }}
value={email}
onChange={(event) => setEmail(event.target.value)}
required
/>
)}
<input
type="password"
placeholder="Password"
style={{ ...styles.input, ...styles.authControl }}
value={password}
onChange={(event) => setPassword(event.target.value)}
required
/>
<button
type="submit"
style={{
...styles.buttonPrimary,
...styles.authControl,
marginTop: 2,
}}
disabled={loading}
>
{loading
? "Please wait..."
: isRegisterMode
? "Create account"
: "Sign in"}
</button>
</form>
{error && <p style={styles.authErrorText}>{error}</p>}
{info && <p style={styles.authInfoText}>{info}</p>}
<div style={styles.authSwitchRow}>
<span style={styles.authSwitchLabel}>
{isRegisterMode ? "Already have an account?" : "New here?"}
</span>
<button
type="button"
style={styles.authSwitchButton}
onClick={() => {
setError("");
setInfo("");
setIsRegisterMode((value) => !value);
}}
>
{isRegisterMode ? "Switch to sign in" : "Create account"}
</button>
</div>
</div>
</div>
);
};
export default LoginPage;

View File

@@ -1,173 +1,772 @@
import { useEffect, useState, useRef } from "react";
import { useEffect, useRef, useState } from "react";
import axios from "axios";
import { useParams } from "react-router-dom";
import StatsStyling from "../styles/stats_styling";
import SummaryStats from "../components/SummaryStats";
import EmotionalStats from "../components/EmotionalStats";
import InteractionStats from "../components/UserStats";
import UserStats from "../components/UserStats";
import LinguisticStats from "../components/LinguisticStats";
import InteractionalStats from "../components/InteractionalStats";
import CulturalStats from "../components/CulturalStats";
import CorpusExplorer from "../components/CorpusExplorer";
import {
type SummaryResponse,
type UserAnalysisResponse,
import {
type SummaryResponse,
type TimeAnalysisResponse,
type ContentAnalysisResponse
} from '../types/ApiTypes'
type User,
type UserEndpointResponse,
type LinguisticAnalysisResponse,
type EmotionalAnalysisResponse,
type InteractionAnalysisResponse,
type CulturalAnalysisResponse,
} from "../types/ApiTypes";
import {
buildExplorerContext,
type CorpusExplorerSpec,
type DatasetRecord,
} from "../utils/corpusExplorer";
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
const styles = StatsStyling;
const DELETED_USERS = ["[deleted]", "automoderator"];
const isDeletedUser = (value: string | null | undefined) =>
DELETED_USERS.includes((value ?? "").trim().toLowerCase());
type ActiveView =
| "summary"
| "emotional"
| "user"
| "linguistic"
| "interactional"
| "cultural";
type UserStatsMeta = {
totalUsers: number;
mostCommentHeavyUser: { author: string; commentShare: number } | null;
};
type ExplorerState = {
open: boolean;
title: string;
description: string;
emptyMessage: string;
records: DatasetRecord[];
loading: boolean;
error: string;
};
const EMPTY_EXPLORER_STATE: ExplorerState = {
open: false,
title: "Corpus Explorer",
description: "",
emptyMessage: "No records found.",
records: [],
loading: false,
error: "",
};
const createExplorerState = (
spec: CorpusExplorerSpec,
patch: Partial<ExplorerState> = {},
): ExplorerState => ({
open: true,
title: spec.title,
description: spec.description,
emptyMessage: spec.emptyMessage ?? "No matching records found.",
records: [],
loading: false,
error: "",
...patch,
});
const compareRecordsByNewest = (a: DatasetRecord, b: DatasetRecord) => {
const aValue = String(a.dt ?? a.date ?? a.timestamp ?? "");
const bValue = String(b.dt ?? b.date ?? b.timestamp ?? "");
return bValue.localeCompare(aValue);
};
const parseJsonLikePayload = (value: string): unknown => {
const normalized = value
.replace(/\uFEFF/g, "")
.replace(/,\s*([}\]])/g, "$1")
.replace(/(:\s*)(NaN|Infinity|-Infinity)\b/g, "$1null")
.replace(/(\[\s*)(NaN|Infinity|-Infinity)\b/g, "$1null")
.replace(/(,\s*)(NaN|Infinity|-Infinity)\b/g, "$1null")
.replace(/(:\s*)None\b/g, "$1null")
.replace(/(:\s*)True\b/g, "$1true")
.replace(/(:\s*)False\b/g, "$1false")
.replace(/(\[\s*)None\b/g, "$1null")
.replace(/(\[\s*)True\b/g, "$1true")
.replace(/(\[\s*)False\b/g, "$1false")
.replace(/(,\s*)None\b/g, "$1null")
.replace(/(,\s*)True\b/g, "$1true")
.replace(/(,\s*)False\b/g, "$1false");
return JSON.parse(normalized);
};
const tryParseRecords = (value: string) => {
try {
return normalizeRecordPayload(parseJsonLikePayload(value));
} catch {
return null;
}
};
const parseRecordStringPayload = (payload: string): DatasetRecord[] | null => {
const trimmed = payload.trim();
if (!trimmed) {
return [];
}
const direct = tryParseRecords(trimmed);
if (direct) {
return direct;
}
const ndjsonLines = trimmed
.split(/\r?\n/)
.map((line) => line.trim())
.filter(Boolean);
if (ndjsonLines.length > 0) {
try {
return ndjsonLines.map((line) => parseJsonLikePayload(line)) as DatasetRecord[];
} catch {
}
}
const bracketStart = trimmed.indexOf("[");
const bracketEnd = trimmed.lastIndexOf("]");
if (bracketStart !== -1 && bracketEnd > bracketStart) {
const parsed = tryParseRecords(trimmed.slice(bracketStart, bracketEnd + 1));
if (parsed) {
return parsed;
}
}
const braceStart = trimmed.indexOf("{");
const braceEnd = trimmed.lastIndexOf("}");
if (braceStart !== -1 && braceEnd > braceStart) {
const parsed = tryParseRecords(trimmed.slice(braceStart, braceEnd + 1));
if (parsed) {
return parsed;
}
}
return null;
};
const normalizeRecordPayload = (payload: unknown): DatasetRecord[] => {
if (typeof payload === "string") {
const parsed = parseRecordStringPayload(payload);
if (parsed) {
return parsed;
}
const preview = payload.trim().slice(0, 120).replace(/\s+/g, " ");
throw new Error(
`Corpus endpoint returned a non-JSON string payload.${
preview ? ` Response preview: ${preview}` : ""
}`,
);
}
if (
payload &&
typeof payload === "object" &&
"error" in payload &&
typeof (payload as { error?: unknown }).error === "string"
) {
throw new Error((payload as { error: string }).error);
}
if (Array.isArray(payload)) {
return payload as DatasetRecord[];
}
if (
payload &&
typeof payload === "object" &&
"data" in payload &&
Array.isArray((payload as { data?: unknown }).data)
) {
return (payload as { data: DatasetRecord[] }).data;
}
if (
payload &&
typeof payload === "object" &&
"records" in payload &&
Array.isArray((payload as { records?: unknown }).records)
) {
return (payload as { records: DatasetRecord[] }).records;
}
if (
payload &&
typeof payload === "object" &&
"rows" in payload &&
Array.isArray((payload as { rows?: unknown }).rows)
) {
return (payload as { rows: DatasetRecord[] }).rows;
}
if (
payload &&
typeof payload === "object" &&
"result" in payload &&
Array.isArray((payload as { result?: unknown }).result)
) {
return (payload as { result: DatasetRecord[] }).result;
}
if (payload && typeof payload === "object") {
const values = Object.values(payload);
if (values.length === 1 && Array.isArray(values[0])) {
return values[0] as DatasetRecord[];
}
if (values.every((value) => value && typeof value === "object")) {
return values as DatasetRecord[];
}
}
throw new Error("Corpus endpoint returned an unexpected payload.");
};
const StatPage = () => {
const [error, setError] = useState('');
const { datasetId: routeDatasetId } = useParams<{ datasetId: string }>();
const [error, setError] = useState("");
const [loading, setLoading] = useState(false);
const [activeView, setActiveView] = useState<"summary" | "emotional" | "interaction">("summary");
const [activeView, setActiveView] = useState<ActiveView>("summary");
const [userData, setUserData] = useState<UserAnalysisResponse | null>(null);
const [userData, setUserData] = useState<UserEndpointResponse | null>(null);
const [timeData, setTimeData] = useState<TimeAnalysisResponse | null>(null);
const [contentData, setContentData] = useState<ContentAnalysisResponse | null>(null);
const [linguisticData, setLinguisticData] =
useState<LinguisticAnalysisResponse | null>(null);
const [emotionalData, setEmotionalData] =
useState<EmotionalAnalysisResponse | null>(null);
const [interactionData, setInteractionData] =
useState<InteractionAnalysisResponse | null>(null);
const [culturalData, setCulturalData] =
useState<CulturalAnalysisResponse | null>(null);
const [summary, setSummary] = useState<SummaryResponse | null>(null);
const [userStatsMeta, setUserStatsMeta] = useState<UserStatsMeta>({
totalUsers: 0,
mostCommentHeavyUser: null,
});
const [appliedFilters, setAppliedFilters] = useState<Record<string, string>>({});
const [allRecords, setAllRecords] = useState<DatasetRecord[] | null>(null);
const [allRecordsKey, setAllRecordsKey] = useState("");
const [explorerState, setExplorerState] = useState<ExplorerState>(
EMPTY_EXPLORER_STATE,
);
const searchInputRef = useRef<HTMLInputElement>(null);
const beforeDateRef = useRef<HTMLInputElement>(null);
const afterDateRef = useRef<HTMLInputElement>(null);
const getStats = () => {
const parsedDatasetId = Number(routeDatasetId ?? "");
const datasetId =
Number.isInteger(parsedDatasetId) && parsedDatasetId > 0
? parsedDatasetId
: null;
const getFilterParams = () => {
const params: Record<string, string> = {};
const query = (searchInputRef.current?.value ?? "").trim();
const start = (afterDateRef.current?.value ?? "").trim();
const end = (beforeDateRef.current?.value ?? "").trim();
if (query) {
params.search_query = query;
}
if (start) {
params.start_date = start;
}
if (end) {
params.end_date = end;
}
return params;
};
const getAuthHeaders = () => {
const token = localStorage.getItem("access_token");
if (!token) {
return null;
}
return {
Authorization: `Bearer ${token}`,
};
};
const getFilterKey = (params: Record<string, string>) =>
JSON.stringify(Object.entries(params).sort(([a], [b]) => a.localeCompare(b)));
const ensureFilteredRecords = async () => {
if (!datasetId) {
throw new Error("Missing dataset id.");
}
const authHeaders = getAuthHeaders();
if (!authHeaders) {
throw new Error("You must be signed in to load corpus records.");
}
const filterKey = getFilterKey(appliedFilters);
if (allRecords && allRecordsKey === filterKey) {
return allRecords;
}
const response = await axios.get<unknown>(
`${API_BASE_URL}/dataset/${datasetId}/all`,
{
params: appliedFilters,
headers: authHeaders,
},
);
const normalizedRecords = normalizeRecordPayload(response.data);
setAllRecords(normalizedRecords);
setAllRecordsKey(filterKey);
return normalizedRecords;
};
const openExplorer = async (spec: CorpusExplorerSpec) => {
setExplorerState(createExplorerState(spec, { loading: true }));
try {
const records = await ensureFilteredRecords();
const context = buildExplorerContext(records);
const matched = records
.filter((record) => spec.matcher(record, context))
.sort(compareRecordsByNewest);
setExplorerState(createExplorerState(spec, { records: matched }));
} catch (e) {
setExplorerState(
createExplorerState(spec, {
error: `Failed to load corpus records: ${String(e)}`,
}),
);
}
};
const getStats = (params: Record<string, string> = {}) => {
if (!datasetId) {
setError("Missing dataset id. Open /dataset/<id>/stats.");
return;
}
const authHeaders = getAuthHeaders();
if (!authHeaders) {
setError("You must be signed in to load stats.");
return;
}
setError("");
setLoading(true);
setAppliedFilters(params);
setAllRecords(null);
setAllRecordsKey("");
setExplorerState((current) => ({ ...current, open: false }));
Promise.all([
axios.get<TimeAnalysisResponse>("http://localhost:5000/stats/time"),
axios.get<UserAnalysisResponse>("http://localhost:5000/stats/user"),
axios.get<ContentAnalysisResponse>("http://localhost:5000/stats/content"),
axios.get<SummaryResponse>(`http://localhost:5000/stats/summary`),
])
.then(([timeRes, userRes, contentRes, summaryRes]) => {
setUserData(userRes.data || null);
setTimeData(timeRes.data || null);
setContentData(contentRes.data || null);
setSummary(summaryRes.data || null);
})
.catch((e) => setError("Failed to load statistics: " + String(e)))
axios.get<TimeAnalysisResponse>(`${API_BASE_URL}/dataset/${datasetId}/temporal`, {
params,
headers: authHeaders,
}),
axios.get<UserEndpointResponse>(`${API_BASE_URL}/dataset/${datasetId}/user`, {
params,
headers: authHeaders,
}),
axios.get<LinguisticAnalysisResponse>(
`${API_BASE_URL}/dataset/${datasetId}/linguistic`,
{
params,
headers: authHeaders,
},
),
axios.get<EmotionalAnalysisResponse>(`${API_BASE_URL}/dataset/${datasetId}/emotional`, {
params,
headers: authHeaders,
}),
axios.get<InteractionAnalysisResponse>(
`${API_BASE_URL}/dataset/${datasetId}/interactional`,
{
params,
headers: authHeaders,
},
),
axios.get<SummaryResponse>(`${API_BASE_URL}/dataset/${datasetId}/summary`, {
params,
headers: authHeaders,
}),
axios.get<CulturalAnalysisResponse>(`${API_BASE_URL}/dataset/${datasetId}/cultural`, {
params,
headers: authHeaders,
}),
])
.then(
([
timeRes,
userRes,
linguisticRes,
emotionalRes,
interactionRes,
summaryRes,
culturalRes,
]) => {
const usersList = userRes.data.users ?? [];
const topUsersList = userRes.data.top_users ?? [];
const interactionGraphRaw = interactionRes.data?.interaction_graph ?? {};
const topPairsRaw = interactionRes.data?.top_interaction_pairs ?? [];
const filteredUsers: typeof usersList = [];
for (const user of usersList) {
if (isDeletedUser(user.author)) continue;
filteredUsers.push(user);
}
const filteredTopUsers: typeof topUsersList = [];
for (const user of topUsersList) {
if (isDeletedUser(user.author)) continue;
filteredTopUsers.push(user);
}
let mostCommentHeavyUser: UserStatsMeta["mostCommentHeavyUser"] = null;
for (const user of filteredUsers) {
const currentShare = user.comment_share ?? 0;
if (!mostCommentHeavyUser || currentShare > mostCommentHeavyUser.commentShare) {
mostCommentHeavyUser = {
author: user.author,
commentShare: currentShare,
};
}
}
const topAuthors = new Set(filteredTopUsers.map((entry) => entry.author));
const summaryUsers: User[] = [];
for (const user of filteredUsers) {
if (topAuthors.has(user.author)) {
summaryUsers.push(user);
}
}
const filteredInteractionGraph: Record<string, Record<string, number>> = {};
for (const [source, targets] of Object.entries(interactionGraphRaw)) {
if (isDeletedUser(source)) {
continue;
}
const nextTargets: Record<string, number> = {};
for (const [target, count] of Object.entries(targets)) {
if (isDeletedUser(target)) {
continue;
}
nextTargets[target] = count;
}
filteredInteractionGraph[source] = nextTargets;
}
const filteredTopInteractionPairs: typeof topPairsRaw = [];
for (const pairEntry of topPairsRaw) {
const pair = pairEntry[0];
const source = pair[0];
const target = pair[1];
if (isDeletedUser(source) || isDeletedUser(target)) {
continue;
}
filteredTopInteractionPairs.push(pairEntry);
}
const filteredUserData: UserEndpointResponse = {
users: summaryUsers,
top_users: filteredTopUsers,
};
const filteredInteractionData: InteractionAnalysisResponse = {
...interactionRes.data,
interaction_graph: filteredInteractionGraph,
top_interaction_pairs: filteredTopInteractionPairs,
};
const filteredSummary: SummaryResponse = {
...summaryRes.data,
unique_users: filteredUsers.length,
};
setUserData(filteredUserData);
setUserStatsMeta({
totalUsers: filteredUsers.length,
mostCommentHeavyUser,
});
setTimeData(timeRes.data || null);
setLinguisticData(linguisticRes.data || null);
setEmotionalData(emotionalRes.data || null);
setInteractionData(filteredInteractionData || null);
setCulturalData(culturalRes.data || null);
setSummary(filteredSummary || null);
},
)
.catch((e) => setError(`Failed to load statistics: ${String(e)}`))
.finally(() => setLoading(false));
};
const onSubmitFilters = () => {
const query = searchInputRef.current?.value ?? "";
Promise.all([
axios.post("http://localhost:5000/filter/search", {
query: query
}),
])
.then(() => {
getStats();
})
.catch(e => {
setError("Failed to load filters: " + e.response);
})
getStats(getFilterParams());
};
const resetFilters = () => {
axios.get("http://localhost:5000/filter/reset")
.then(() => {
getStats();
})
.catch(e => {
setError(e);
})
if (searchInputRef.current) {
searchInputRef.current.value = "";
}
if (beforeDateRef.current) {
beforeDateRef.current.value = "";
}
if (afterDateRef.current) {
afterDateRef.current.value = "";
}
getStats();
};
useEffect(() => {
setError("");
setAllRecords(null);
setAllRecordsKey("");
setExplorerState(EMPTY_EXPLORER_STATE);
if (!datasetId) {
setError("Missing dataset id. Open /dataset/<id>/stats.");
return;
}
getStats();
}, [])
}, [datasetId]);
if (loading) return <p style={{...styles.page, minWidth: "100vh", minHeight: "100vh"}}>Loading insights</p>;
if (error) return <p style={{...styles.page}}>{error}</p>;
if (loading) {
return (
<div style={styles.loadingPage}>
<div style={{ ...styles.loadingCard, transform: "translateY(-100px)" }}>
<div style={styles.loadingHeader}>
<div style={styles.loadingSpinner} />
<div>
<h2 style={styles.loadingTitle}>Loading analytics</h2>
<p style={styles.loadingSubtitle}>
Fetching summary, timeline, user, and content insights.
</p>
</div>
</div>
return (
<div style={styles.page}>
<div style={{ ...styles.container, ...styles.card, ...styles.headerBar }}>
<div style={styles.controls}>
<input
type="text"
id="query"
ref={searchInputRef}
placeholder="Search events..."
style={styles.input}
/>
<div style={styles.loadingSkeleton}>
<div
style={{
...styles.loadingSkeletonLine,
...styles.loadingSkeletonLineLong,
}}
/>
<div
style={{
...styles.loadingSkeletonLine,
...styles.loadingSkeletonLineMed,
}}
/>
<div
style={{
...styles.loadingSkeletonLine,
...styles.loadingSkeletonLineShort,
}}
/>
</div>
</div>
</div>
);
}
if (error) return <p style={{ ...styles.page }}>{error}</p>;
<input
type="date"
ref={beforeDateRef}
placeholder="Search before date"
style={styles.input}
/>
return (
<div style={styles.page}>
<div style={{ ...styles.container, ...styles.card, ...styles.headerBar }}>
<div style={styles.controls}>
<input
type="text"
id="query"
ref={searchInputRef}
placeholder="Search events..."
style={styles.input}
/>
<input
<input
type="date"
ref={beforeDateRef}
placeholder="Search before date"
style={styles.input}
/>
<input
type="date"
ref={afterDateRef}
placeholder="Search before date"
style={styles.input}
/>
<button onClick={onSubmitFilters} style={styles.buttonPrimary}>
Search
</button>
<button onClick={resetFilters} style={styles.buttonSecondary}>
Reset
</button>
</div>
<div style={styles.dashboardMeta}>Analytics Dashboard</div>
<div style={styles.dashboardMeta}>Dataset #{datasetId ?? "-"}</div>
</div>
<div
style={{
...styles.container,
...styles.tabsRow,
justifyContent: "center",
}}
>
<button
onClick={() => setActiveView("summary")}
style={
activeView === "summary" ? styles.buttonPrimary : styles.buttonSecondary
}
>
Summary
</button>
<button
onClick={() => setActiveView("emotional")}
style={
activeView === "emotional"
? styles.buttonPrimary
: styles.buttonSecondary
}
>
Emotional
</button>
<button
onClick={() => setActiveView("user")}
style={activeView === "user" ? styles.buttonPrimary : styles.buttonSecondary}
>
Users
</button>
<button
onClick={() => setActiveView("linguistic")}
style={
activeView === "linguistic"
? styles.buttonPrimary
: styles.buttonSecondary
}
>
Linguistic
</button>
<button
onClick={() => setActiveView("interactional")}
style={
activeView === "interactional"
? styles.buttonPrimary
: styles.buttonSecondary
}
>
Interactional
</button>
<button
onClick={() => setActiveView("cultural")}
style={
activeView === "cultural" ? styles.buttonPrimary : styles.buttonSecondary
}
>
Cultural
</button>
</div>
{activeView === "summary" && (
<SummaryStats
userData={userData}
timeData={timeData}
linguisticData={linguisticData}
summary={summary}
onExplore={openExplorer}
/>
)}
<button onClick={onSubmitFilters} style={styles.buttonPrimary}>
Search
</button>
{activeView === "emotional" && emotionalData && (
<EmotionalStats emotionalData={emotionalData} onExplore={openExplorer} />
)}
<button onClick={resetFilters} style={styles.buttonSecondary}>
Reset
</button>
</div>
{activeView === "emotional" && !emotionalData && (
<div style={{ ...styles.container, ...styles.card, marginTop: 16 }}>
No emotional data available.
</div>
)}
<div style={{ fontSize: 13, color: "#6b7280" }}>Analytics Dashboard</div>
</div>
{activeView === "user" && userData && interactionData && (
<UserStats
topUsers={userData.top_users}
interactionGraph={interactionData.interaction_graph}
totalUsers={userStatsMeta.totalUsers}
mostCommentHeavyUser={userStatsMeta.mostCommentHeavyUser}
onExplore={openExplorer}
/>
)}
<div style={{ ...styles.container, display: "flex", gap: 8, marginTop: 12 }}>
<button
onClick={() => setActiveView("summary")}
style={activeView === "summary" ? styles.buttonPrimary : styles.buttonSecondary}
>
Summary
</button>
<button
onClick={() => setActiveView("emotional")}
style={activeView === "emotional" ? styles.buttonPrimary : styles.buttonSecondary}
>
Emotional
</button>
{activeView === "user" && (!userData || !interactionData) && (
<div style={{ ...styles.container, ...styles.card, marginTop: 16 }}>
No user network data available.
</div>
)}
<button
onClick={() => setActiveView("interaction")}
style={activeView === "interaction" ? styles.buttonPrimary : styles.buttonSecondary}
>
Interaction
</button>
</div>
{activeView === "linguistic" && linguisticData && (
<LinguisticStats data={linguisticData} onExplore={openExplorer} />
)}
{activeView === "summary" && (
<SummaryStats
userData={userData}
timeData={timeData}
contentData={contentData}
summary={summary}
{activeView === "linguistic" && !linguisticData && (
<div style={{ ...styles.container, ...styles.card, marginTop: 16 }}>
No linguistic data available.
</div>
)}
{activeView === "interactional" && interactionData && (
<InteractionalStats data={interactionData} />
)}
{activeView === "interactional" && !interactionData && (
<div style={{ ...styles.container, ...styles.card, marginTop: 16 }}>
No interactional data available.
</div>
)}
{activeView === "cultural" && culturalData && (
<CulturalStats data={culturalData} onExplore={openExplorer} />
)}
{activeView === "cultural" && !culturalData && (
<div style={{ ...styles.container, ...styles.card, marginTop: 16 }}>
No cultural data available.
</div>
)}
<CorpusExplorer
open={explorerState.open}
onClose={() => setExplorerState((current) => ({ ...current, open: false }))}
title={explorerState.title}
description={explorerState.description}
records={explorerState.records}
loading={explorerState.loading}
error={explorerState.error}
emptyMessage={explorerState.emptyMessage}
/>
)}
{activeView === "emotional" && contentData && (
<EmotionalStats contentData={contentData} />
)}
{activeView === "emotional" && !contentData && (
<div style={{ ...styles.container, ...styles.card, marginTop: 16 }}>
No emotional data available.
</div>
)}
{activeView === "interaction" && userData && (
<InteractionStats data={userData} />
)}
</div>
);
}
</div>
);
};
export default StatPage;

View File

@@ -1,56 +1,180 @@
import axios from 'axios'
import './../App.css'
import { useState } from 'react'
import { useNavigate } from 'react-router-dom'
import axios from "axios";
import { useState } from "react";
import { useNavigate } from "react-router-dom";
import StatsStyling from "../styles/stats_styling";
const styles = StatsStyling;
const API_BASE_URL = import.meta.env.VITE_BACKEND_URL;
const UploadPage = () => {
let postFile: File | undefined;
let topicBucketFile: File | undefined;
const [returnMessage, setReturnMessage] = useState('')
const navigate = useNavigate()
const [datasetName, setDatasetName] = useState("");
const [postFile, setPostFile] = useState<File | null>(null);
const [topicBucketFile, setTopicBucketFile] = useState<File | null>(null);
const [returnMessage, setReturnMessage] = useState("");
const [isSubmitting, setIsSubmitting] = useState(false);
const [hasError, setHasError] = useState(false);
const navigate = useNavigate();
const uploadFiles = async () => {
if (!postFile || !topicBucketFile) {
alert('Please upload all files before uploading.')
return
const normalizedDatasetName = datasetName.trim();
if (!normalizedDatasetName) {
setHasError(true);
setReturnMessage("Please add a dataset name before continuing.");
return;
}
const formData = new FormData()
formData.append('posts', postFile)
formData.append('topics', topicBucketFile)
if (!postFile || !topicBucketFile) {
setHasError(true);
setReturnMessage("Please upload both files before continuing.");
return;
}
const formData = new FormData();
formData.append("name", normalizedDatasetName);
formData.append("posts", postFile);
formData.append("topics", topicBucketFile);
try {
const response = await axios.post('http://localhost:5000/upload', formData, {
headers: {
'Content-Type': 'multipart/form-data',
},
})
console.log('Files uploaded successfully:', response.data)
setReturnMessage(`Upload successful! Posts: ${response.data.posts_count}, Comments: ${response.data.comments_count}`)
navigate('/stats')
} catch (error) {
console.error('Error uploading files:', error)
setReturnMessage('Error uploading files. Error details: ' + error)
}
}
return (
<div style={{...styles.container, ...styles.grid, margin: "0"}}>
<div style={{ ...styles.card }}>
<h2 style={{color: "black" }}>Posts File</h2>
<input style={{color: "black" }} type="file" onChange={(e) => postFile = e.target.files?.[0]}></input>
</div>
<div style={{ ...styles.card }}>
<h2 style={{color: "black" }}>Topic Buckets File</h2>
<input style={{color: "black" }} type="file" onChange={(e) => topicBucketFile = e.target.files?.[0]}></input>
</div>
<button onClick={uploadFiles}>Upload</button>
setIsSubmitting(true);
setHasError(false);
setReturnMessage("");
<p>{returnMessage}</p>
const response = await axios.post(
`${API_BASE_URL}/datasets/upload`,
formData,
{
headers: {
"Content-Type": "multipart/form-data",
},
},
);
const datasetId = Number(response.data.dataset_id);
setReturnMessage(
`Upload queued successfully (dataset #${datasetId}). Redirecting to processing status...`,
);
setTimeout(() => {
navigate(`/dataset/${datasetId}/status`);
}, 400);
} catch (error: unknown) {
setHasError(true);
if (axios.isAxiosError(error)) {
const message = String(
error.response?.data?.error || error.message || "Upload failed.",
);
setReturnMessage(`Upload failed: ${message}`);
} else {
setReturnMessage("Upload failed due to an unexpected error.");
}
} finally {
setIsSubmitting(false);
}
};
return (
<div style={styles.page}>
<div style={styles.containerWide}>
<div style={{ ...styles.card, ...styles.headerBar }}>
<div>
<h1 style={styles.sectionHeaderTitle}>Upload Dataset</h1>
<p style={styles.sectionHeaderSubtitle}>
Name your dataset, then upload posts and topic map files to
generate analytics.
</p>
</div>
<button
type="button"
style={{
...styles.buttonPrimary,
opacity: isSubmitting ? 0.75 : 1,
}}
onClick={uploadFiles}
disabled={isSubmitting}
>
{isSubmitting ? "Uploading..." : "Upload and Analyze"}
</button>
</div>
<div
style={{
...styles.grid,
marginTop: 14,
gridTemplateColumns: "repeat(auto-fit, minmax(280px, 1fr))",
}}
>
<div style={{ ...styles.card, gridColumn: "auto" }}>
<h2 style={{ ...styles.sectionTitle, color: "#24292f" }}>
Dataset Name
</h2>
<p style={styles.sectionSubtitle}>
Use a clear label so you can identify this upload later.
</p>
<input
style={{ ...styles.input, ...styles.inputFullWidth }}
type="text"
placeholder="Example: Cork Discussions - Jan 2026"
value={datasetName}
onChange={(event) => setDatasetName(event.target.value)}
/>
</div>
<div style={{ ...styles.card, gridColumn: "auto" }}>
<h2 style={{ ...styles.sectionTitle, color: "#24292f" }}>
Posts File (.jsonl)
</h2>
<p style={styles.sectionSubtitle}>
Upload the raw post records export.
</p>
<input
style={{ ...styles.input, ...styles.inputFullWidth }}
type="file"
accept=".jsonl"
onChange={(event) => setPostFile(event.target.files?.[0] ?? null)}
/>
<p style={styles.subtleBodyText}>
{postFile ? `Selected: ${postFile.name}` : "No file selected"}
</p>
</div>
<div style={{ ...styles.card, gridColumn: "auto" }}>
<h2 style={{ ...styles.sectionTitle, color: "#24292f" }}>
Topics File (.json)
</h2>
<p style={styles.sectionSubtitle}>
Upload your topic bucket mapping file.
</p>
<input
style={{ ...styles.input, ...styles.inputFullWidth }}
type="file"
accept=".json"
onChange={(event) =>
setTopicBucketFile(event.target.files?.[0] ?? null)
}
/>
<p style={styles.subtleBodyText}>
{topicBucketFile
? `Selected: ${topicBucketFile.name}`
: "No file selected"}
</p>
</div>
</div>
<div
style={{
...styles.card,
marginTop: 14,
...(hasError ? styles.alertCardError : styles.alertCardInfo),
}}
>
{returnMessage ||
"After upload, your dataset is queued for processing and you'll land on stats."}
</div>
</div>
</div>
)
}
);
};
export default UploadPage;

View File

@@ -1,4 +1,5 @@
import { ResponsiveHeatMap } from "@nivo/heatmap";
import { memo, useMemo } from "react";
type ApiRow = Record<number, number>;
type ActivityHeatmapProps = {
@@ -25,8 +26,7 @@ const DAYS = [
"Sunday",
];
const hourLabel = (h: number) =>
`${h.toString().padStart(2, "0")}:00`;
const hourLabel = (h: number) => `${h.toString().padStart(2, "0")}:00`;
const convertWeeklyData = (dataset: ApiRow[]): ChartSeries[] => {
return dataset.map((dayData, index) => ({
@@ -40,32 +40,37 @@ const convertWeeklyData = (dataset: ApiRow[]): ChartSeries[] => {
}));
};
const ActivityHeatmap = ({ data }: ActivityHeatmapProps) => {
const convertedData = convertWeeklyData(data);
const convertedData = useMemo(() => convertWeeklyData(data), [data]);
const maxValue = Math.max(
...convertedData.flatMap(day =>
day.data.map(point => point.y)
)
const maxValue = useMemo(() => {
let max = 0;
for (const day of convertedData) {
for (const point of day.data) {
if (point.y > max) {
max = point.y;
}
}
}
return max;
}, [convertedData]);
return (
<ResponsiveHeatMap
data={convertedData}
valueFormat=">-.2s"
axisTop={{ tickRotation: -90 }}
axisRight={{ legend: "Weekday", legendOffset: 70 }}
axisLeft={{ legend: "Weekday", legendOffset: -72 }}
colors={{
type: "diverging",
scheme: "red_yellow_blue",
divergeAt: 0.3,
minValue: 0,
maxValue: maxValue,
}}
/>
);
};
return (
<ResponsiveHeatMap
data={convertedData}
valueFormat=">-.2s"
axisTop={{ tickRotation: -90 }}
axisRight={{ legend: 'Weekday', legendOffset: 70 }}
axisLeft={{ legend: 'Weekday', legendOffset: -72 }}
colors={{
type: 'diverging',
scheme: 'red_yellow_blue',
divergeAt: 0.3,
minValue: 0,
maxValue: maxValue
}}
/>
)
}
export default ActivityHeatmap;
export default memo(ActivityHeatmap);

View File

@@ -0,0 +1,42 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const appLayoutStyles: StyleMap = {
appHeaderWrap: {
padding: "16px 24px 0",
},
appHeaderBrandRow: {
display: "flex",
alignItems: "center",
gap: 10,
flexWrap: "wrap",
},
appTitle: {
margin: 0,
color: palette.textPrimary,
fontSize: 18,
fontWeight: 600,
},
authStatusBadge: {
padding: "3px 8px",
borderRadius: 6,
fontSize: 12,
fontWeight: 600,
fontFamily: '"IBM Plex Sans", "Noto Sans", "Liberation Sans", "Segoe UI", sans-serif',
},
authStatusSignedIn: {
border: `1px solid ${palette.statusPositiveBorder}`,
background: palette.statusPositiveBg,
color: palette.statusPositiveText,
},
authStatusSignedOut: {
border: `1px solid ${palette.statusNegativeBorder}`,
background: palette.statusNegativeBg,
color: palette.statusNegativeText,
},
};

View File

@@ -0,0 +1,92 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const authStyles: StyleMap = {
containerAuth: {
maxWidth: 560,
margin: "0 auto",
padding: "48px 24px",
},
headingXl: {
margin: 0,
color: palette.textPrimary,
fontSize: 28,
fontWeight: 600,
lineHeight: 1.1,
},
headingBlock: {
marginBottom: 22,
textAlign: "center",
},
mutedText: {
margin: "8px 0 0",
color: palette.textSecondary,
fontSize: 14,
},
authCard: {
padding: 28,
},
authForm: {
display: "grid",
gap: 12,
maxWidth: 380,
margin: "0 auto",
},
inputFullWidth: {
width: "100%",
maxWidth: "100%",
boxSizing: "border-box",
},
authControl: {
width: "100%",
maxWidth: "100%",
boxSizing: "border-box",
},
authErrorText: {
color: palette.dangerText,
margin: "12px auto 0",
fontSize: 14,
maxWidth: 380,
textAlign: "center",
},
authInfoText: {
color: palette.successText,
margin: "12px auto 0",
fontSize: 14,
maxWidth: 380,
textAlign: "center",
},
authSwitchRow: {
marginTop: 16,
display: "flex",
alignItems: "center",
justifyContent: "center",
gap: 8,
flexWrap: "wrap",
},
authSwitchLabel: {
color: palette.textSecondary,
fontSize: 14,
},
authSwitchButton: {
border: "none",
background: "transparent",
color: palette.brandGreenBorder,
fontSize: 14,
fontWeight: 600,
cursor: "pointer",
padding: 0,
},
};

View File

@@ -0,0 +1,42 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const cardStyles: StyleMap = {
cardBase: {
background: palette.surface,
border: `1px solid ${palette.borderDefault}`,
borderRadius: 8,
padding: 14,
boxShadow: `0 1px 0 ${palette.shadowSubtle}`,
minHeight: 88,
},
cardTopRow: {
display: "flex",
justifyContent: "space-between",
alignItems: "center",
gap: 10,
},
cardLabel: {
fontSize: 12,
fontWeight: 600,
color: palette.textSecondary,
letterSpacing: "0.02em",
textTransform: "uppercase",
},
cardValue: {
fontSize: 24,
fontWeight: 700,
marginTop: 6,
letterSpacing: "-0.02em",
color: palette.textPrimary,
},
cardSubLabel: {
marginTop: 6,
fontSize: 12,
color: palette.textSecondary,
},
};

View File

@@ -0,0 +1,55 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const datasetStyles: StyleMap = {
sectionHeaderTitle: {
margin: 0,
color: palette.textPrimary,
fontSize: 28,
fontWeight: 600,
},
sectionHeaderSubtitle: {
margin: "8px 0 0",
color: palette.textSecondary,
fontSize: 14,
},
listNoBullets: {
listStyle: "none",
margin: 0,
padding: 0,
},
datasetListItem: {
display: "flex",
alignItems: "center",
justifyContent: "space-between",
gap: 12,
padding: "14px 16px",
borderBottom: `1px solid ${palette.borderMuted}`,
},
datasetName: {
fontWeight: 600,
color: palette.textPrimary,
},
datasetMeta: {
fontSize: 13,
color: palette.textSecondary,
marginTop: 4,
},
datasetMetaSecondary: {
fontSize: 13,
color: palette.textSecondary,
marginTop: 2,
},
subtleBodyText: {
margin: "10px 0 0",
fontSize: 13,
color: palette.textBody,
},
};

View File

@@ -0,0 +1,51 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const emotionalStyles: StyleMap = {
emotionalSummaryRow: {
display: "flex",
flexWrap: "wrap",
gap: 10,
fontSize: 13,
color: palette.textTertiary,
marginTop: 6,
},
emotionalTopicLabel: {
fontSize: 12,
fontWeight: 600,
color: palette.textSecondary,
letterSpacing: "0.02em",
textTransform: "uppercase",
},
emotionalTopicValue: {
fontSize: 24,
fontWeight: 800,
marginTop: 4,
lineHeight: 1.2,
},
emotionalMetricRow: {
display: "flex",
justifyContent: "space-between",
alignItems: "center",
marginTop: 10,
fontSize: 13,
color: palette.textSecondary,
},
emotionalMetricRowCompact: {
display: "flex",
justifyContent: "space-between",
alignItems: "center",
marginTop: 4,
fontSize: 13,
color: palette.textSecondary,
},
emotionalMetricValue: {
fontWeight: 600,
color: palette.textPrimary,
},
};

View File

@@ -0,0 +1,106 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const feedbackStyles: StyleMap = {
loadingPage: {
width: "100%",
minHeight: "100vh",
padding: 20,
display: "flex",
alignItems: "center",
justifyContent: "center",
},
loadingCard: {
width: "min(560px, 92vw)",
background: palette.surface,
border: `1px solid ${palette.borderDefault}`,
borderRadius: 8,
boxShadow: `0 1px 0 ${palette.shadowSubtle}`,
padding: 20,
},
loadingHeader: {
display: "flex",
alignItems: "center",
gap: 12,
},
loadingSpinner: {
width: 18,
height: 18,
borderRadius: "50%",
border: `2px solid ${palette.borderDefault}`,
borderTopColor: palette.brandGreen,
animation: "stats-spin 0.9s linear infinite",
flexShrink: 0,
},
loadingTitle: {
margin: 0,
fontSize: 16,
fontWeight: 600,
color: palette.textPrimary,
},
loadingSubtitle: {
margin: "6px 0 0",
fontSize: 13,
color: palette.textSecondary,
},
loadingSkeleton: {
marginTop: 16,
display: "grid",
gap: 8,
},
loadingSkeletonLine: {
height: 9,
borderRadius: 999,
background: palette.canvas,
animation: "stats-pulse 1.25s ease-in-out infinite",
},
loadingSkeletonLineLong: {
width: "100%",
},
loadingSkeletonLineMed: {
width: "78%",
},
loadingSkeletonLineShort: {
width: "62%",
},
alertCardError: {
borderColor: palette.alertErrorBorder,
background: palette.alertErrorBg,
color: palette.alertErrorText,
fontSize: 14,
},
alertCardInfo: {
borderColor: palette.alertInfoBorder,
background: palette.surface,
color: palette.textBody,
fontSize: 14,
},
statusMessageCard: {
marginTop: 12,
boxShadow: "none",
},
dashboardMeta: {
fontSize: 13,
color: palette.textSecondary,
},
tabsRow: {
display: "flex",
gap: 8,
marginTop: 12,
},
};

View File

@@ -0,0 +1,167 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const foundationStyles: StyleMap = {
appShell: {
minHeight: "100vh",
background: palette.canvas,
fontFamily: '"IBM Plex Sans", "Noto Sans", "Liberation Sans", "Segoe UI", sans-serif',
color: palette.textPrimary,
},
page: {
width: "100%",
minHeight: "100vh",
padding: 20,
background: palette.canvas,
fontFamily: '"IBM Plex Sans", "Noto Sans", "Liberation Sans", "Segoe UI", sans-serif',
color: palette.textPrimary,
overflowX: "hidden",
boxSizing: "border-box",
},
container: {
maxWidth: 1240,
margin: "0 auto",
},
containerWide: {
maxWidth: 1100,
margin: "0 auto",
},
containerNarrow: {
maxWidth: 720,
margin: "0 auto",
},
card: {
background: palette.surface,
borderRadius: 8,
padding: 16,
border: `1px solid ${palette.borderDefault}`,
boxShadow: `0 1px 0 ${palette.shadowSubtle}`,
},
headerBar: {
display: "flex",
flexWrap: "wrap",
alignItems: "center",
justifyContent: "space-between",
gap: 10,
},
controls: {
display: "flex",
gap: 8,
alignItems: "center",
},
controlsWrapped: {
display: "flex",
gap: 8,
alignItems: "center",
flexWrap: "wrap",
},
input: {
width: 280,
maxWidth: "70vw",
padding: "8px 10px",
borderRadius: 6,
border: `1px solid ${palette.borderDefault}`,
outline: "none",
fontSize: 14,
background: palette.surface,
color: palette.textPrimary,
},
buttonPrimary: {
padding: "8px 12px",
borderRadius: 6,
border: `1px solid ${palette.brandGreenBorder}`,
background: palette.brandGreen,
color: palette.surface,
fontWeight: 600,
cursor: "pointer",
boxShadow: "none",
},
buttonSecondary: {
padding: "8px 12px",
borderRadius: 6,
border: `1px solid ${palette.borderDefault}`,
background: palette.canvas,
color: palette.textPrimary,
fontWeight: 600,
cursor: "pointer",
},
buttonDanger: {
padding: "8px 12px",
borderRadius: 6,
border: `1px solid ${palette.borderDefault}`,
background: palette.dangerText,
color: palette.textPrimary,
fontWeight: 600,
cursor: "pointer",
},
grid: {
marginTop: 12,
display: "grid",
gridTemplateColumns: "repeat(12, 1fr)",
gap: 12,
},
sectionTitle: {
margin: 0,
fontSize: 17,
fontWeight: 600,
},
sectionSubtitle: {
margin: "6px 0 14px",
fontSize: 13,
color: palette.textSecondary,
},
chartWrapper: {
width: "100%",
height: 350,
},
heatmapWrapper: {
width: "100%",
height: 320,
},
topUsersList: {
display: "flex",
flexDirection: "column",
gap: 10,
},
topUserItem: {
padding: "10px 12px",
borderRadius: 8,
background: palette.canvas,
border: `1px solid ${palette.borderMuted}`,
},
topUserName: {
fontWeight: 600,
fontSize: 14,
color: palette.textPrimary,
},
topUserMeta: {
fontSize: 13,
color: palette.textSecondary,
},
scrollArea: {
maxHeight: 420,
overflowY: "auto",
},
};

View File

@@ -0,0 +1,28 @@
import { palette } from "./palette";
import type { StyleMap } from "./types";
export const modalStyles: StyleMap = {
modalRoot: {
position: "relative",
zIndex: 50,
},
modalBackdrop: {
position: "fixed",
inset: 0,
background: palette.modalBackdrop,
},
modalContainer: {
position: "fixed",
inset: 0,
display: "flex",
alignItems: "center",
justifyContent: "center",
padding: 16,
},
modalPanel: {
width: "min(520px, 95vw)",
},
};

View File

@@ -0,0 +1,26 @@
export const palette = {
canvas: "#f6f8fa",
surface: "#ffffff",
textPrimary: "#24292f",
textSecondary: "#57606a",
textTertiary: "#4b5563",
textBody: "#374151",
borderDefault: "#d0d7de",
borderMuted: "#d8dee4",
shadowSubtle: "rgba(27, 31, 36, 0.04)",
brandGreen: "#2da44e",
brandGreenBorder: "#1f883d",
statusPositiveBorder: "#b7dfc8",
statusPositiveBg: "#edf9f1",
statusPositiveText: "#1f6f43",
statusNegativeBorder: "#f3c1c1",
statusNegativeBg: "#fff2f2",
statusNegativeText: "#9a2929",
dangerText: "#b91c1c",
successText: "#166534",
alertErrorBorder: "rgba(185, 28, 28, 0.28)",
alertErrorBg: "#fff5f5",
alertErrorText: "#991b1b",
alertInfoBorder: "rgba(0,0,0,0.06)",
modalBackdrop: "rgba(0,0,0,0.45)",
} as const;

View File

@@ -0,0 +1,3 @@
import type { CSSProperties } from "react";
export type StyleMap = Record<string, CSSProperties>;

View File

@@ -1,136 +1,22 @@
import type { CSSProperties } from "react";
import { appLayoutStyles } from "./stats/appLayout";
import { authStyles } from "./stats/auth";
import { cardStyles } from "./stats/cards";
import { datasetStyles } from "./stats/datasets";
import { emotionalStyles } from "./stats/emotional";
import { feedbackStyles } from "./stats/feedback";
import { foundationStyles } from "./stats/foundations";
import { modalStyles } from "./stats/modal";
const StatsStyling: Record<string, CSSProperties> = {
page: {
width: "100%",
minHeight: "100vh",
padding: 24,
background: "#f6f7fb",
fontFamily:
'-apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Inter, Arial, sans-serif',
color: "#111827",
overflowX: "hidden",
boxSizing: "border-box"
},
container: {
maxWidth: 1400,
margin: "0 auto",
},
card: {
background: "white",
borderRadius: 16,
padding: 16,
border: "1px solid rgba(0,0,0,0.06)",
boxShadow: "0 6px 20px rgba(0,0,0,0.06)",
},
headerBar: {
display: "flex",
flexWrap: "wrap",
alignItems: "center",
justifyContent: "space-between",
gap: 12,
},
controls: {
display: "flex",
gap: 10,
alignItems: "center",
},
input: {
width: 320,
maxWidth: "70vw",
padding: "10px 12px",
borderRadius: 12,
border: "1px solid rgba(0,0,0,0.12)",
outline: "none",
fontSize: 14,
background: "#fff",
color: "black"
},
buttonPrimary: {
padding: "10px 14px",
borderRadius: 12,
border: "1px solid rgba(0,0,0,0.08)",
background: "#2563eb",
color: "white",
fontWeight: 600,
cursor: "pointer",
boxShadow: "0 6px 16px rgba(37,99,235,0.25)",
},
buttonSecondary: {
padding: "10px 14px",
borderRadius: 12,
border: "1px solid rgba(0,0,0,0.12)",
background: "#fff",
color: "#111827",
fontWeight: 600,
cursor: "pointer",
},
grid: {
marginTop: 18,
display: "grid",
gridTemplateColumns: "repeat(12, 1fr)",
gap: 16,
},
sectionTitle: {
margin: 0,
fontSize: 16,
fontWeight: 700,
},
sectionSubtitle: {
margin: "6px 0 14px",
fontSize: 13,
color: "#6b7280",
},
chartWrapper: {
width: "100%",
height: 350,
},
heatmapWrapper: {
width: "100%",
height: 320,
},
topUsersList: {
display: "flex",
flexDirection: "column",
gap: 10,
},
topUserItem: {
padding: "10px 12px",
borderRadius: 12,
background: "#f9fafb",
border: "1px solid rgba(0,0,0,0.06)",
},
topUserName: {
fontWeight: 700,
fontSize: 14,
color: "black"
},
topUserMeta: {
fontSize: 13,
color: "#6b7280",
},
scrollArea: {
maxHeight: 450,
overflowY: "auto",
},
...foundationStyles,
...appLayoutStyles,
...authStyles,
...datasetStyles,
...feedbackStyles,
...cardStyles,
...emotionalStyles,
...modalStyles,
};
export default StatsStyling;
export default StatsStyling;

View File

@@ -1,20 +1,28 @@
// User Responses
type TopUser = {
author: string;
source: string;
count: number
// Shared types
type FrequencyWord = {
word: string;
count: number;
};
type FrequencyWord = {
word: string;
count: number;
}
type NGram = {
count: number;
ngram: string;
};
type AverageEmotionByTopic = {
topic: string;
n: number;
[emotion: string]: string | number;
}
type Emotion = {
emotion_anger: number;
emotion_disgust: number;
emotion_fear: number;
emotion_joy: number;
emotion_sadness: number;
};
// User
type TopUser = {
author: string;
source: string;
count: number;
};
type Vocab = {
author: string;
@@ -26,46 +34,160 @@ type Vocab = {
top_words: FrequencyWord[];
};
type DominantTopic = {
topic: string;
count: number;
};
type User = {
author: string;
post: number;
comment: number;
comment_post_ratio: number;
comment_share: number;
avg_emotions?: Record<string, number>;
dominant_topic?: DominantTopic | null;
vocab?: Vocab | null;
};
type InteractionGraph = Record<string, Record<string, number>>;
type UserEndpointResponse = {
top_users: TopUser[];
users: User[];
};
type UserAnalysisResponse = {
top_users: TopUser[];
users: User[];
interaction_graph: InteractionGraph;
};
// Time Analysis
// Time
type EventsPerDay = {
date: Date;
count: number;
}
date: Date;
count: number;
};
type HeatmapCell = {
date: Date;
hour: number;
count: number;
}
date: Date;
hour: number;
count: number;
};
type TimeAnalysisResponse = {
events_per_day: EventsPerDay[];
weekday_hour_heatmap: HeatmapCell[];
burstiness: number;
}
events_per_day: EventsPerDay[];
weekday_hour_heatmap: HeatmapCell[];
};
// Content (combines emotional and linguistic)
type AverageEmotionByTopic = Emotion & {
n: number;
topic: string;
[key: string]: string | number;
};
type OverallEmotionAverage = {
emotion: string;
score: number;
};
type DominantEmotionDistribution = {
emotion: string;
count: number;
ratio: number;
};
type EmotionBySource = {
source: string;
dominant_emotion: string;
dominant_score: number;
event_count: number;
};
// Content Analysis
type ContentAnalysisResponse = {
word_frequencies: FrequencyWord[];
average_emotion_by_topic: AverageEmotionByTopic[];
}
word_frequencies: FrequencyWord[];
average_emotion_by_topic: AverageEmotionByTopic[];
common_three_phrases: NGram[];
common_two_phrases: NGram[];
overall_emotion_average?: OverallEmotionAverage[];
dominant_emotion_distribution?: DominantEmotionDistribution[];
emotion_by_source?: EmotionBySource[];
};
// Linguistic
type LinguisticAnalysisResponse = {
word_frequencies: FrequencyWord[];
common_two_phrases: NGram[];
common_three_phrases: NGram[];
lexical_diversity?: Record<string, number>;
};
// Emotional
type EmotionalAnalysisResponse = {
average_emotion_by_topic: AverageEmotionByTopic[];
overall_emotion_average?: OverallEmotionAverage[];
dominant_emotion_distribution?: DominantEmotionDistribution[];
emotion_by_source?: EmotionBySource[];
};
// Interactional
type ConversationConcentration = {
total_commenting_authors: number;
top_10pct_author_count: number;
top_10pct_comment_share: number;
single_comment_authors: number;
single_comment_author_ratio: number;
};
type InteractionAnalysisResponse = {
top_interaction_pairs?: [[string, string], number][];
conversation_concentration?: ConversationConcentration;
interaction_graph: InteractionGraph;
};
// Cultural
type IdentityMarkers = {
in_group_usage: number;
out_group_usage: number;
in_group_ratio: number;
out_group_ratio: number;
in_group_posts: number;
out_group_posts: number;
tie_posts: number;
in_group_emotion_avg?: Record<string, number>;
out_group_emotion_avg?: Record<string, number>;
};
type StanceMarkers = {
hedge_total: number;
certainty_total: number;
deontic_total: number;
permission_total: number;
hedge_per_1k_tokens: number;
certainty_per_1k_tokens: number;
deontic_per_1k_tokens: number;
permission_per_1k_tokens: number;
hedge_emotion_avg?: Record<string, number>;
certainty_emotion_avg?: Record<string, number>;
deontic_emotion_avg?: Record<string, number>;
permission_emotion_avg?: Record<string, number>;
};
type EntityEmotionAggregate = {
post_count: number;
emotion_avg: Record<string, number>;
};
type AverageEmotionPerEntity = {
entity_emotion_avg: Record<string, EntityEmotionAggregate>;
};
type CulturalAnalysisResponse = {
identity_markers?: IdentityMarkers;
stance_markers?: StanceMarkers;
avg_emotion_per_entity?: AverageEmotionPerEntity;
};
// Summary
type SummaryResponse = {
@@ -82,22 +204,36 @@ type SummaryResponse = {
sources: string[];
};
// Filtering Response
// Filter
type FilterResponse = {
rows: number
data: any;
}
rows: number;
data: any;
};
export type {
TopUser,
Vocab,
User,
InteractionGraph,
UserAnalysisResponse,
FrequencyWord,
AverageEmotionByTopic,
SummaryResponse,
TimeAnalysisResponse,
ContentAnalysisResponse,
FilterResponse
}
TopUser,
DominantTopic,
Vocab,
User,
InteractionGraph,
ConversationConcentration,
UserAnalysisResponse,
UserEndpointResponse,
FrequencyWord,
AverageEmotionByTopic,
OverallEmotionAverage,
DominantEmotionDistribution,
EmotionBySource,
SummaryResponse,
TimeAnalysisResponse,
ContentAnalysisResponse,
LinguisticAnalysisResponse,
EmotionalAnalysisResponse,
InteractionAnalysisResponse,
IdentityMarkers,
StanceMarkers,
EntityEmotionAggregate,
AverageEmotionPerEntity,
CulturalAnalysisResponse,
FilterResponse,
};

View File

@@ -0,0 +1,371 @@
type EntityRecord = {
text?: string;
[key: string]: unknown;
};
type DatasetRecord = {
id?: string | number;
post_id?: string | number | null;
parent_id?: string | number | null;
author?: string | null;
title?: string | null;
content?: string | null;
timestamp?: string | number | null;
date?: string | null;
dt?: string | null;
hour?: number | null;
weekday?: string | null;
reply_to?: string | number | null;
source?: string | null;
topic?: string | null;
topic_confidence?: number | null;
type?: string | null;
ner_entities?: EntityRecord[] | null;
emotion_anger?: number | null;
emotion_disgust?: number | null;
emotion_fear?: number | null;
emotion_joy?: number | null;
emotion_sadness?: number | null;
[key: string]: unknown;
};
type CorpusExplorerContext = {
authorByPostId: Map<string, string>;
authorEventCounts: Map<string, number>;
authorCommentCounts: Map<string, number>;
};
type CorpusExplorerSpec = {
title: string;
description: string;
emptyMessage?: string;
matcher: (record: DatasetRecord, context: CorpusExplorerContext) => boolean;
};
const IN_GROUP_PATTERN = /\b(we|us|our|ourselves)\b/gi;
const OUT_GROUP_PATTERN = /\b(they|them|their|themselves)\b/gi;
const HEDGE_PATTERN = /\b(maybe|perhaps|possibly|probably|likely|seems|seem|i think|i feel|i guess|kind of|sort of|somewhat)\b/i;
const CERTAINTY_PATTERN = /\b(definitely|certainly|clearly|obviously|undeniably|always|never)\b/i;
const DEONTIC_PATTERN = /\b(must|should|need|needs|have to|has to|ought|required|require)\b/i;
const PERMISSION_PATTERN = /\b(can|allowed|okay|ok|permitted)\b/i;
const EMOTION_KEYS = [
"emotion_anger",
"emotion_disgust",
"emotion_fear",
"emotion_joy",
"emotion_sadness",
] as const;
const toText = (value: unknown) => {
if (typeof value === "string") {
return value;
}
if (typeof value === "number" || typeof value === "boolean") {
return String(value);
}
if (value && typeof value === "object" && "id" in value) {
const id = (value as { id?: unknown }).id;
if (typeof id === "string" || typeof id === "number") {
return String(id);
}
}
return "";
};
const normalize = (value: unknown) => toText(value).trim().toLowerCase();
const getAuthor = (record: DatasetRecord) => toText(record.author).trim();
const getRecordText = (record: DatasetRecord) =>
`${record.title ?? ""} ${record.content ?? ""}`.trim();
const escapeRegExp = (value: string) =>
value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
const buildPhrasePattern = (phrase: string) => {
const tokens = phrase
.toLowerCase()
.trim()
.split(/\s+/)
.filter(Boolean)
.map(escapeRegExp);
if (!tokens.length) {
return null;
}
return new RegExp(`\\b${tokens.join("\\s+")}\\b`, "i");
};
const countMatches = (pattern: RegExp, text: string) =>
Array.from(text.matchAll(new RegExp(pattern.source, "gi"))).length;
const getDateBucket = (record: DatasetRecord) => {
if (typeof record.date === "string" && record.date) {
return record.date.slice(0, 10);
}
if (typeof record.dt === "string" && record.dt) {
return record.dt.slice(0, 10);
}
if (typeof record.timestamp === "number") {
return new Date(record.timestamp * 1000).toISOString().slice(0, 10);
}
if (typeof record.timestamp === "string" && record.timestamp) {
const numeric = Number(record.timestamp);
if (Number.isFinite(numeric)) {
return new Date(numeric * 1000).toISOString().slice(0, 10);
}
}
return "";
};
const getDominantEmotion = (record: DatasetRecord) => {
let bestKey = "";
let bestValue = Number.NEGATIVE_INFINITY;
for (const key of EMOTION_KEYS) {
const value = Number(record[key] ?? Number.NEGATIVE_INFINITY);
if (value > bestValue) {
bestValue = value;
bestKey = key;
}
}
return bestKey.replace("emotion_", "");
};
const matchesPhrase = (record: DatasetRecord, phrase: string) => {
const pattern = buildPhrasePattern(phrase);
if (!pattern) {
return false;
}
return pattern.test(getRecordText(record));
};
const recordIdentityBucket = (record: DatasetRecord) => {
const text = getRecordText(record);
const inHits = countMatches(IN_GROUP_PATTERN, text);
const outHits = countMatches(OUT_GROUP_PATTERN, text);
if (inHits > outHits) {
return "in";
}
if (outHits > inHits) {
return "out";
}
return "tie";
};
const buildExplorerContext = (records: DatasetRecord[]): CorpusExplorerContext => {
const authorByPostId = new Map<string, string>();
const authorEventCounts = new Map<string, number>();
const authorCommentCounts = new Map<string, number>();
for (const record of records) {
const author = getAuthor(record);
if (!author) {
continue;
}
authorEventCounts.set(author, (authorEventCounts.get(author) ?? 0) + 1);
if (record.type === "comment") {
authorCommentCounts.set(author, (authorCommentCounts.get(author) ?? 0) + 1);
}
if (record.post_id !== null && record.post_id !== undefined) {
authorByPostId.set(String(record.post_id), author);
}
}
return { authorByPostId, authorEventCounts, authorCommentCounts };
};
const buildAllRecordsSpec = (): CorpusExplorerSpec => ({
title: "Corpus Explorer",
description: "All records in the current filtered dataset.",
emptyMessage: "No records match the current filters.",
matcher: () => true,
});
const buildUserSpec = (author: string): CorpusExplorerSpec => {
const target = normalize(author);
return {
title: `User: ${author}`,
description: `All records authored by ${author}.`,
emptyMessage: `No records found for ${author}.`,
matcher: (record) => normalize(record.author) === target,
};
};
const buildTopicSpec = (topic: string): CorpusExplorerSpec => {
const target = normalize(topic);
return {
title: `Topic: ${topic}`,
description: `Records assigned to the ${topic} topic bucket.`,
emptyMessage: `No records found in the ${topic} topic bucket.`,
matcher: (record) => normalize(record.topic) === target,
};
};
const buildDateBucketSpec = (date: string): CorpusExplorerSpec => ({
title: `Date Bucket: ${date}`,
description: `Records from the ${date} activity bucket.`,
emptyMessage: `No records found on ${date}.`,
matcher: (record) => getDateBucket(record) === date,
});
const buildWordSpec = (word: string): CorpusExplorerSpec => ({
title: `Word: ${word}`,
description: `Records containing the word ${word}.`,
emptyMessage: `No records mention ${word}.`,
matcher: (record) => matchesPhrase(record, word),
});
const buildNgramSpec = (ngram: string): CorpusExplorerSpec => ({
title: `N-gram: ${ngram}`,
description: `Records containing the phrase ${ngram}.`,
emptyMessage: `No records contain the phrase ${ngram}.`,
matcher: (record) => matchesPhrase(record, ngram),
});
const buildEntitySpec = (entity: string): CorpusExplorerSpec => {
const target = normalize(entity);
return {
title: `Entity: ${entity}`,
description: `Records mentioning the ${entity} entity.`,
emptyMessage: `No records found for the ${entity} entity.`,
matcher: (record) => {
const entities = Array.isArray(record.ner_entities) ? record.ner_entities : [];
return entities.some((item) => normalize(item?.text) === target) || matchesPhrase(record, entity);
},
};
};
const buildSourceSpec = (source: string): CorpusExplorerSpec => {
const target = normalize(source);
return {
title: `Source: ${source}`,
description: `Records from the ${source} source.`,
emptyMessage: `No records found for ${source}.`,
matcher: (record) => normalize(record.source) === target,
};
};
const buildDominantEmotionSpec = (emotion: string): CorpusExplorerSpec => {
const target = normalize(emotion);
return {
title: `Dominant Emotion: ${emotion}`,
description: `Records where ${emotion} is the strongest emotion score.`,
emptyMessage: `No records found with dominant emotion ${emotion}.`,
matcher: (record) => getDominantEmotion(record) === target,
};
};
const buildReplyPairSpec = (source: string, target: string): CorpusExplorerSpec => {
const sourceName = normalize(source);
const targetName = normalize(target);
return {
title: `Reply Path: ${source} -> ${target}`,
description: `Reply records authored by ${source} in response to ${target}.`,
emptyMessage: `No reply records found for ${source} -> ${target}.`,
matcher: (record, context) => {
if (normalize(record.author) !== sourceName) {
return false;
}
const replyTo = record.reply_to;
if (replyTo === null || replyTo === undefined || replyTo === "") {
return false;
}
return normalize(context.authorByPostId.get(String(replyTo))) === targetName;
},
};
};
const buildOneTimeUsersSpec = (): CorpusExplorerSpec => ({
title: "One-Time Users",
description: "Records written by authors who appear exactly once in the filtered corpus.",
emptyMessage: "No one-time-user records found.",
matcher: (record, context) => {
const author = getAuthor(record);
return !!author && context.authorEventCounts.get(author) === 1;
},
});
const buildIdentityBucketSpec = (bucket: "in" | "out" | "tie"): CorpusExplorerSpec => {
const labels = {
in: "In-Group Posts",
out: "Out-Group Posts",
tie: "Balanced Posts",
} as const;
return {
title: labels[bucket],
description: `Records in the ${labels[bucket].toLowerCase()} cultural bucket.`,
emptyMessage: `No records found for ${labels[bucket].toLowerCase()}.`,
matcher: (record) => recordIdentityBucket(record) === bucket,
};
};
const buildPatternSpec = (
title: string,
description: string,
pattern: RegExp,
): CorpusExplorerSpec => ({
title,
description,
emptyMessage: `No records found for ${title.toLowerCase()}.`,
matcher: (record) => pattern.test(getRecordText(record)),
});
const buildHedgeSpec = () =>
buildPatternSpec("Hedging Words", "Records containing hedging language.", HEDGE_PATTERN);
const buildCertaintySpec = () =>
buildPatternSpec("Certainty Words", "Records containing certainty language.", CERTAINTY_PATTERN);
const buildDeonticSpec = () =>
buildPatternSpec("Need/Should Words", "Records containing deontic language.", DEONTIC_PATTERN);
const buildPermissionSpec = () =>
buildPatternSpec("Permission Words", "Records containing permission language.", PERMISSION_PATTERN);
export type { DatasetRecord, CorpusExplorerSpec };
export {
buildAllRecordsSpec,
buildCertaintySpec,
buildDateBucketSpec,
buildDeonticSpec,
buildDominantEmotionSpec,
buildEntitySpec,
buildExplorerContext,
buildHedgeSpec,
buildIdentityBucketSpec,
buildNgramSpec,
buildOneTimeUsersSpec,
buildPermissionSpec,
buildReplyPairSpec,
buildSourceSpec,
buildTopicSpec,
buildUserSpec,
buildWordSpec,
getDateBucket,
toText,
};

View File

@@ -0,0 +1,20 @@
const DEFAULT_TITLE = "Ethnograph View";
const STATIC_TITLES: Record<string, string> = {
"/login": "Sign In",
"/upload": "Upload Dataset",
"/auto-fetch": "Auto Fetch Dataset",
"/datasets": "My Datasets",
};
export const getDocumentTitle = (pathname: string) => {
if (pathname.includes("status")) {
return "Processing Dataset";
}
if (pathname.includes("stats")) {
return "Ethnography Analysis";
}
return STATIC_TITLES[pathname] ?? DEFAULT_TITLE;
};

View File

@@ -1,4 +0,0 @@
import server.app
if __name__ == "__main__":
server.app.app.run(debug=True)

BIN
report/img/analysis_bar.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

BIN
report/img/architecture.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 274 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 90 KiB

BIN
report/img/frontend.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 302 KiB

BIN
report/img/gantt.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

BIN
report/img/heatmap.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 114 KiB

BIN
report/img/kpi_card.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.7 KiB

BIN
report/img/moods.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

BIN
report/img/navbar.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

BIN
report/img/ngrams.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

BIN
report/img/nlp_backoff.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 143 KiB

BIN
report/img/pipeline.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

BIN
report/img/reddit_bot.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 232 KiB

BIN
report/img/schema.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

BIN
report/img/signature.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

BIN
report/img/ucc_crest.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

1401
report/main.tex Normal file

File diff suppressed because it is too large Load Diff

149
report/references.bib Normal file
View File

@@ -0,0 +1,149 @@
@online{reddit_api,
author = {{Reddit Inc.}},
title = {Reddit API Documentation},
year = {2025},
url = {https://www.reddit.com/dev/api/},
urldate = {2026-04-08}
}
@misc{hartmann2022emotionenglish,
author={Hartmann, Jochen},
title={Emotion English DistilRoBERTa-base},
year={2022},
howpublished = {\url{https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/}},
}
@misc{all_mpnet_base_v2,
author={Microsoft Research},
title={All-MPNet-Base-V2},
year={2021},
howpublished = {\url{https://huggingface.co/sentence-transformers/all-mpnet-base-v2}},
}
@misc{minilm_l6_v2,
author={Microsoft Research},
title={MiniLM-L6-V2},
year={2021},
howpublished = {\url{https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2}},
}
@misc{dslim_bert_base_ner,
author={deepset},
title={dslim/bert-base-NER},
year={2018},
howpublished = {\url{https://huggingface.co/dslim/bert-base-NER}},
}
@inproceedings{demszky2020goemotions,
author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith},
booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)},
title = {{GoEmotions: A Dataset of Fine-Grained Emotions}},
year = {2020}
}
@article{dominguez2007virtual,
author = {Domínguez, Daniel and Beaulieu, Anne and Estalella, Adolfo and Gómez, Edgar and Schnettler, Bernt and Read, Rosie},
title = {Virtual Ethnography},
journal = {Forum Qualitative Sozialforschung / Forum: Qualitative Social Research},
year = {2007},
volume = {8},
number = {3},
url = {http://nbn-resolving.de/urn:nbn:de:0114-fqs0703E19}
}
@article{sun2014lurkers,
author = {Sun, Na and Rau, Pei-Luen Patrick and Ma, Liang},
title = {Understanding Lurkers in Online Communities: A Literature Review},
journal = {Computers in Human Behavior},
year = {2014},
volume = {38},
pages = {110--117},
doi = {10.1016/j.chb.2014.05.022}
}
@article{ahmad2024sentiment,
author = {Ahmad, Waqar and others},
title = {Recent Advancements and Challenges of NLP-based Sentiment Analysis: A State-of-the-art Review},
journal = {Natural Language Processing Journal},
year = {2024},
doi = {10.1016/j.nlp.2024.100059}
}
@article{coleman2010ethnographic,
ISSN = {00846570},
URL = {http://www.jstor.org/stable/25735124},
abstract = {This review surveys and divides the ethnographic corpus on digital media into three broad but overlapping categories: the cultural politics of digital media, the vernacular cultures of digital media, and the prosaics of digital media. Engaging these three categories of scholarship on digital media, I consider how ethnographers are exploring the complex relationships between the local practices and global implications of digital media, their materiality and politics, and thier banal, as well as profound, presence in cultural life and modes of communication. I consider the way these media have become central to the articulation of cherished beliefs, ritual practices, and modes of being in the world; the fact that digital media culturally matters is undeniable but showing how, where, and why it matters is necessary to push against peculiarly narrow presumptions about the universality of digital experience.},
author = {E. Gabriella Coleman},
journal = {Annual Review of Anthropology},
pages = {487--505},
publisher = {Annual Reviews},
title = {Ethnographic Approaches to Digital Media},
urldate = {2026-04-15},
volume = {39},
year = {2010}
}
@article{shen2021stance,
author = {Shen, Qian and Tao, Yating},
title = {Stance Markers in {English} Medical Research Articles and Newspaper Opinion Columns: A Comparative Corpus-Based Study},
journal = {PLOS ONE},
volume = {16},
number = {3},
pages = {e0247981},
year = {2021},
doi = {10.1371/journal.pone.0247981}
}
@incollection{medvedev2019anatomy,
author = {Medvedev, Alexey N. and Lambiotte, Renaud and Delvenne, Jean-Charles},
title = {The Anatomy of Reddit: An Overview of Academic Research},
booktitle = {Dynamics On and Of Complex Networks III},
series = {Springer Proceedings in Complexity},
publisher = {Springer},
year = {2019},
pages = {183--204}
}
@misc{cook2023ethnography,
author = {Cook, Chloe},
title = {What is the Difference Between Ethnography and Digital Ethnography?},
year = {2023},
month = jan,
day = {19},
howpublished = {\url{https://ethosapp.com/blog/what-is-the-difference-between-ethnography-and-digital-ethnography/}},
note = {Accessed: 2026-04-16},
organization = {EthOS}
}
@misc{giuffre2026sentiment,
author = {Giuffre, Steven},
title = {What is Sentiment Analysis?},
year = {2026},
month = mar,
howpublished = {\url{https://www.vonage.com/resources/articles/sentiment-analysis/}},
note = {Accessed: 2026-04-16},
organization = {Vonage}
}
@misc{mungalpara2022stemming,
author = {Mungalpara, Jaimin},
title = {Stemming Lemmatization Stopwords and {N}-Grams in {NLP}},
year = {2022},
month = jul,
day = {26},
howpublished = {\url{https://jaimin-ml2001.medium.com/stemming-lemmatization-stopwords-and-n-grams-in-nlp-96f8e8b6aa6f}},
note = {Accessed: 2026-04-16},
organization = {Medium}
}
@misc{chugani2025ethicalscraping,
author = {Chugani, Vinod},
title = {Ethical Web Scraping: Principles and Practices},
year = {2025},
month = apr,
day = {21},
howpublished = {\url{https://www.datacamp.com/blog/ethical-web-scraping}},
note = {Accessed: 2026-04-16},
organization = {DataCamp}
}

View File

@@ -1,12 +1,19 @@
beautifulsoup4==4.14.3
Flask==3.1.2
celery==5.6.2
redis==7.2.1
Flask==3.1.3
Flask_Bcrypt==1.0.1
flask_cors==6.0.2
Flask_JWT_Extended==4.7.1
google_api_python_client==2.188.0
keybert==0.9.0
nltk==3.9.2
pandas==3.0.0
python-dotenv==1.2.1
numpy==2.4.2
pandas==3.0.1
psycopg2==2.9.11
psycopg2_binary==2.9.11
python-dotenv==1.2.2
Requests==2.32.5
sentence_transformers==5.2.2
torch==2.10.0
transformers==5.1.0
gunicorn==25.3.0

View File

@@ -1,35 +1,33 @@
import pandas as pd
import re
from collections import Counter
from typing import Any
class CulturalAnalysis:
def __init__(self, df: pd.DataFrame, content_col: str = "content", topic_col: str = "topic"):
self.df = df
def __init__(self, content_col: str = "content", topic_col: str = "topic"):
self.content_col = content_col
self.topic_col = topic_col
def get_identity_markers(self):
df = self.df.copy()
def get_identity_markers(self, original_df: pd.DataFrame) -> dict[str, Any]:
df = original_df.copy()
s = df[self.content_col].fillna("").astype(str).str.lower()
in_group_words = {"we", "us", "our", "ourselves"}
out_group_words = {"they", "them", "their", "themselves"}
emotion_exclusions = {"emotion_neutral", "emotion_surprise"}
emotion_cols = [
c for c in df.columns
c
for c in df.columns
if c.startswith("emotion_") and c not in emotion_exclusions
]
# Tokenize per row
tokens_per_row = s.apply(lambda txt: re.findall(r"\b[a-z]{2,}\b", txt))
in_pattern = re.compile(r"\b(we|us|our|ourselves)\b")
out_pattern = re.compile(r"\b(they|them|their|themselves)\b")
token_pattern = re.compile(r"\b[a-z]{2,}\b")
total_tokens = int(tokens_per_row.map(len).sum())
in_hits = tokens_per_row.map(lambda toks: sum(t in in_group_words for t in toks)).astype(int)
out_hits = tokens_per_row.map(lambda toks: sum(t in out_group_words for t in toks)).astype(int)
in_hits = s.str.count(in_pattern)
out_hits = s.str.count(out_pattern)
total_tokens = s.str.count(token_pattern).sum()
in_count = int(in_hits.sum())
out_count = int(out_hits.sum())
@@ -43,7 +41,6 @@ class CulturalAnalysis:
"out_group_usage": out_count,
"in_group_ratio": round(in_count / max(total_tokens, 1), 5),
"out_group_ratio": round(out_count / max(total_tokens, 1), 5),
"in_group_posts": int(in_mask.sum()),
"out_group_posts": int(out_mask.sum()),
"tie_posts": int(tie_mask.sum()),
@@ -52,103 +49,131 @@ class CulturalAnalysis:
if emotion_cols:
emo = df[emotion_cols].apply(pd.to_numeric, errors="coerce").fillna(0.0)
in_avg = emo.loc[in_mask].mean() if in_mask.any() else pd.Series(0.0, index=emotion_cols)
out_avg = emo.loc[out_mask].mean() if out_mask.any() else pd.Series(0.0, index=emotion_cols)
in_avg = (
emo.loc[in_mask].mean()
if in_mask.any()
else pd.Series(0.0, index=emotion_cols)
)
out_avg = (
emo.loc[out_mask].mean()
if out_mask.any()
else pd.Series(0.0, index=emotion_cols)
)
result["in_group_emotion_avg"] = in_avg.to_dict()
result["out_group_emotion_avg"] = out_avg.to_dict()
return result
def get_stance_markers(self) -> dict[str, Any]:
s = self.df[self.content_col].fillna("").astype(str)
hedges = {
"maybe", "perhaps", "possibly", "probably", "likely", "seems", "seem",
"i think", "i feel", "i guess", "kind of", "sort of", "somewhat"
}
certainty = {
"definitely", "certainly", "clearly", "obviously", "undeniably", "always", "never"
}
def get_stance_markers(self, df: pd.DataFrame) -> dict[str, Any]:
s = df[self.content_col].fillna("").astype(str)
emotion_exclusions = {"emotion_neutral", "emotion_surprise"}
emotion_cols = [
c
for c in df.columns
if c.startswith("emotion_") and c not in emotion_exclusions
]
deontic = {
"must", "should", "need", "needs", "have to", "has to", "ought", "required", "require"
}
hedge_pattern = re.compile(
r"\b(maybe|perhaps|possibly|probably|likely|seems|seem|i think|i feel|i guess|kind of|sort of|somewhat)\b"
)
certainty_pattern = re.compile(
r"\b(definitely|certainly|clearly|obviously|undeniably|always|never)\b"
)
deontic_pattern = re.compile(
r"\b(must|should|need|needs|have to|has to|ought|required|require)\b"
)
permission_pattern = re.compile(r"\b(can|allowed|okay|ok|permitted)\b")
permission = {"can", "allowed", "okay", "ok", "permitted"}
hedge_counts = s.str.count(hedge_pattern)
certainty_counts = s.str.count(certainty_pattern)
deontic_counts = s.str.count(deontic_pattern)
perm_counts = s.str.count(permission_pattern)
def count_phrases(text: str, phrases: set[str]) -> int:
c = 0
for p in phrases:
if " " in p:
c += len(re.findall(r"\b" + re.escape(p) + r"\b", text))
else:
c += len(re.findall(r"\b" + re.escape(p) + r"\b", text))
return c
token_counts = s.apply(lambda t: len(re.findall(r"\b[a-z]{2,}\b", t))).replace(
0, 1
)
hedge_counts = s.apply(lambda t: count_phrases(t, hedges))
certainty_counts = s.apply(lambda t: count_phrases(t, certainty))
deontic_counts = s.apply(lambda t: count_phrases(t, deontic))
perm_counts = s.apply(lambda t: count_phrases(t, permission))
token_counts = s.apply(lambda t: len(re.findall(r"\b[a-z]{2,}\b", t))).replace(0, 1)
return {
result = {
"hedge_total": int(hedge_counts.sum()),
"certainty_total": int(certainty_counts.sum()),
"deontic_total": int(deontic_counts.sum()),
"permission_total": int(perm_counts.sum()),
"hedge_per_1k_tokens": round(1000 * hedge_counts.sum() / token_counts.sum(), 3),
"certainty_per_1k_tokens": round(1000 * certainty_counts.sum() / token_counts.sum(), 3),
"deontic_per_1k_tokens": round(1000 * deontic_counts.sum() / token_counts.sum(), 3),
"permission_per_1k_tokens": round(1000 * perm_counts.sum() / token_counts.sum(), 3),
"hedge_per_1k_tokens": round(
1000 * hedge_counts.sum() / token_counts.sum(), 3
),
"certainty_per_1k_tokens": round(
1000 * certainty_counts.sum() / token_counts.sum(), 3
),
"deontic_per_1k_tokens": round(
1000 * deontic_counts.sum() / token_counts.sum(), 3
),
"permission_per_1k_tokens": round(
1000 * perm_counts.sum() / token_counts.sum(), 3
),
}
def get_avg_emotions_per_entity(self, top_n: int = 25, min_posts: int = 10) -> dict[str, Any]:
if "entities" not in self.df.columns:
if emotion_cols:
emo = df[emotion_cols].apply(pd.to_numeric, errors="coerce").fillna(0.0)
result["hedge_emotion_avg"] = (
emo.loc[hedge_counts > 0].mean()
if (hedge_counts > 0).any()
else pd.Series(0.0, index=emotion_cols)
).to_dict()
result["certainty_emotion_avg"] = (
emo.loc[certainty_counts > 0].mean()
if (certainty_counts > 0).any()
else pd.Series(0.0, index=emotion_cols)
).to_dict()
result["deontic_emotion_avg"] = (
emo.loc[deontic_counts > 0].mean()
if (deontic_counts > 0).any()
else pd.Series(0.0, index=emotion_cols)
).to_dict()
result["permission_emotion_avg"] = (
emo.loc[perm_counts > 0].mean()
if (perm_counts > 0).any()
else pd.Series(0.0, index=emotion_cols)
).to_dict()
return result
def get_avg_emotions_per_entity(
self, df: pd.DataFrame, top_n: int = 25, min_posts: int = 10
) -> dict[str, Any]:
if "ner_entities" not in df.columns:
return {"entity_emotion_avg": {}}
df = self.df
emotion_cols = [c for c in df.columns if c.startswith("emotion_")]
entity_counter = Counter()
entity_df = df[["ner_entities"] + emotion_cols].explode("ner_entities")
for row in df["entities"].dropna():
if isinstance(row, list):
for ent in row:
if isinstance(ent, dict):
text = ent.get("text")
if isinstance(text, str):
text = text.strip()
if len(text) >= 3: # filter short junk
entity_counter[text] += 1
top_entities = entity_counter.most_common(top_n)
entity_df["entity_text"] = entity_df["ner_entities"].apply(
lambda e: (
e.get("text").strip()
if isinstance(e, dict)
and isinstance(e.get("text"), str)
and len(e.get("text")) >= 3
else None
)
)
entity_df = entity_df.dropna(subset=["entity_text"])
entity_counts = entity_df["entity_text"].value_counts().head(top_n)
entity_emotion_avg = {}
for entity_text, _ in top_entities:
mask = df["entities"].apply(
lambda ents: isinstance(ents, list) and
any(isinstance(e, dict) and e.get("text") == entity_text for e in ents)
)
post_count = int(mask.sum())
if post_count >= min_posts:
for entity_text, count in entity_counts.items():
if count >= min_posts:
emo_means = (
df.loc[mask, emotion_cols]
.apply(pd.to_numeric, errors="coerce")
.fillna(0.0)
entity_df[entity_df["entity_text"] == entity_text][emotion_cols]
.mean()
.to_dict()
)
entity_emotion_avg[entity_text] = {
"post_count": post_count,
"emotion_avg": emo_means
"post_count": int(count),
"emotion_avg": emo_means,
}
return {
"entity_emotion_avg": entity_emotion_avg
}
return {"entity_emotion_avg": entity_emotion_avg}

View File

@@ -1,36 +1,86 @@
import pandas as pd
class EmotionalAnalysis:
def __init__(self, df: pd.DataFrame):
self.df = df
def avg_emotion_by_topic(self) -> dict:
emotion_cols = [
col for col in self.df.columns
if col.startswith("emotion_")
]
class EmotionalAnalysis:
def _emotion_cols(self, df: pd.DataFrame) -> list[str]:
return [col for col in df.columns if col.startswith("emotion_")]
def avg_emotion_by_topic(self, df: pd.DataFrame) -> list[dict]:
emotion_cols = self._emotion_cols(df)
if not emotion_cols:
return []
counts = (
self.df[
(self.df["topic"] != "Misc")
]
.groupby("topic")
.size()
.rename("n")
df[(df["topic"] != "Misc")].groupby("topic").size().reset_index(name="n")
)
avg_emotion_by_topic = (
self.df[
(self.df["topic"] != "Misc")
]
df[(df["topic"] != "Misc")]
.groupby("topic")[emotion_cols]
.mean()
.reset_index()
)
avg_emotion_by_topic = avg_emotion_by_topic.merge(
counts,
on="topic"
)
avg_emotion_by_topic = avg_emotion_by_topic.merge(counts, on="topic")
return avg_emotion_by_topic.to_dict(orient='records')
return avg_emotion_by_topic.to_dict(orient="records")
def overall_emotion_average(self, df: pd.DataFrame) -> list[dict]:
emotion_cols = self._emotion_cols(df)
if not emotion_cols:
return []
means = df[emotion_cols].mean()
return [
{
"emotion": col.replace("emotion_", ""),
"score": float(means[col]),
}
for col in emotion_cols
]
def dominant_emotion_distribution(self, df: pd.DataFrame) -> list[dict]:
emotion_cols = self._emotion_cols(df)
if not emotion_cols or df.empty:
return []
dominant_per_row = df[emotion_cols].idxmax(axis=1)
counts = dominant_per_row.value_counts()
total = max(len(dominant_per_row), 1)
return [
{
"emotion": col.replace("emotion_", ""),
"count": int(count),
"ratio": round(float(count / total), 4),
}
for col, count in counts.items()
]
def emotion_by_source(self, df: pd.DataFrame) -> list[dict]:
emotion_cols = self._emotion_cols(df)
if not emotion_cols or "source" not in df.columns or df.empty:
return []
source_counts = df.groupby("source").size()
source_means = df.groupby("source")[emotion_cols].mean().reset_index()
rows = source_means.to_dict(orient="records")
output = []
for row in rows:
source = row["source"]
dominant_col = max(emotion_cols, key=lambda col: float(row.get(col, 0)))
output.append(
{
"source": str(source),
"dominant_emotion": dominant_col.replace("emotion_", ""),
"dominant_score": round(float(row.get(dominant_col, 0)), 4),
"event_count": int(source_counts.get(source, 0)),
}
)
return output

View File

@@ -0,0 +1,42 @@
import pandas as pd
from server.analysis.nlp import NLP
class DatasetEnrichment:
def __init__(self, df: pd.DataFrame, topics: dict):
self.df = self._explode_comments(df)
self.topics = topics
self.nlp = NLP(self.df, "title", "content", self.topics)
def _explode_comments(self, df) -> pd.DataFrame:
comments_df = df[["id", "comments"]].explode("comments")
comments_df = comments_df[
comments_df["comments"].apply(lambda x: isinstance(x, dict))
]
comments_df = pd.json_normalize(comments_df["comments"])
posts_df = df.drop(columns=["comments"])
posts_df["type"] = "post"
posts_df["parent_id"] = None
comments_df["type"] = "comment"
comments_df["parent_id"] = comments_df.get("post_id")
df = pd.concat([posts_df, comments_df])
df.drop(columns=["post_id"], inplace=True, errors="ignore")
return df
def enrich(self) -> pd.DataFrame:
self.df["timestamp"] = pd.to_numeric(self.df["timestamp"], errors="raise")
self.df["date"] = pd.to_datetime(self.df["timestamp"], unit="s").dt.date
self.df["dt"] = pd.to_datetime(self.df["timestamp"], unit="s", utc=True)
self.df["hour"] = self.df["dt"].dt.hour
self.df["weekday"] = self.df["dt"].dt.day_name()
self.nlp.add_emotion_cols()
self.nlp.add_topic_col()
self.nlp.add_ner_cols()
return self.df

View File

@@ -1,132 +1,22 @@
import pandas as pd
import re
from collections import Counter
class InteractionAnalysis:
def __init__(self, df: pd.DataFrame, word_exclusions: set[str]):
self.df = df
def __init__(self, word_exclusions: set[str]):
self.word_exclusions = word_exclusions
def _tokenize(self, text: str):
tokens = re.findall(r"\b[a-z]{3,}\b", text)
return [t for t in tokens if t not in self.word_exclusions]
def _vocab_richness_per_user(
self, min_words: int = 20, top_most_used_words: int = 100
) -> list:
df = self.df.copy()
df["content"] = df["content"].fillna("").astype(str).str.lower()
df["tokens"] = df["content"].apply(self._tokenize)
def interaction_graph(self, df: pd.DataFrame):
interactions = {a: {} for a in df["author"].dropna().unique()}
rows = []
for author, group in df.groupby("author"):
all_tokens = [t for tokens in group["tokens"] for t in tokens]
# reply_to refers to the comment id, this allows us to map comment/post ids to usernames
id_to_author = df.set_index("post_id")["author"].to_dict()
total_words = len(all_tokens)
unique_words = len(set(all_tokens))
events = len(group)
# Min amount of words for a user, any less than this might give weird results
if total_words < min_words:
continue
# 100% = they never reused a word (excluding stop words)
vocab_richness = unique_words / total_words
avg_words = total_words / max(events, 1)
counts = Counter(all_tokens)
top_words = [
{"word": w, "count": int(c)}
for w, c in counts.most_common(top_most_used_words)
]
rows.append(
{
"author": author,
"events": int(events),
"total_words": int(total_words),
"unique_words": int(unique_words),
"vocab_richness": round(vocab_richness, 3),
"avg_words_per_event": round(avg_words, 2),
"top_words": top_words,
}
)
rows = sorted(rows, key=lambda x: x["vocab_richness"], reverse=True)
return rows
def top_users(self) -> list:
counts = (
self.df.groupby(["author", "source"]).size().sort_values(ascending=False)
)
top_users = [
{"author": author, "source": source, "count": int(count)}
for (author, source), count in counts.items()
]
return top_users
def per_user_analysis(self) -> dict:
per_user = self.df.groupby(["author", "type"]).size().unstack(fill_value=0)
emotion_cols = [col for col in self.df.columns if col.startswith("emotion_")]
avg_emotions_by_author = {}
if emotion_cols:
avg_emotions = self.df.groupby("author")[emotion_cols].mean().fillna(0.0)
avg_emotions_by_author = {
author: {emotion: float(score) for emotion, score in row.items()}
for author, row in avg_emotions.iterrows()
}
# ensure columns always exist
for col in ("post", "comment"):
if col not in per_user.columns:
per_user[col] = 0
per_user["comment_post_ratio"] = per_user["comment"] / per_user["post"].replace(
0, 1
)
per_user["comment_share"] = per_user["comment"] / (
per_user["post"] + per_user["comment"]
).replace(0, 1)
per_user = per_user.sort_values("comment_post_ratio", ascending=True)
per_user_records = per_user.reset_index().to_dict(orient="records")
vocab_rows = self._vocab_richness_per_user()
vocab_by_author = {row["author"]: row for row in vocab_rows}
# merge vocab richness + per_user information
merged_users = []
for row in per_user_records:
author = row["author"]
merged_users.append(
{
"author": author,
"post": int(row.get("post", 0)),
"comment": int(row.get("comment", 0)),
"comment_post_ratio": float(row.get("comment_post_ratio", 0)),
"comment_share": float(row.get("comment_share", 0)),
"avg_emotions": avg_emotions_by_author.get(author, {}),
"vocab": vocab_by_author.get(author, {"vocab_richness": 0, "avg_words_per_event": 0, "top_words": []}),
}
)
merged_users.sort(key=lambda u: u["comment_post_ratio"])
return merged_users
def interaction_graph(self):
interactions = {a: {} for a in self.df["author"].dropna().unique()}
# reply_to refers to the comment id, this allows us to map comment ids to usernames
id_to_author = self.df.set_index("id")["author"].to_dict()
for _, row in self.df.iterrows():
for _, row in df.iterrows():
a = row["author"]
reply_id = row["reply_to"]
@@ -141,89 +31,40 @@ class InteractionAnalysis:
return interactions
def average_thread_depth(self):
depths = []
id_to_reply = self.df.set_index("id")["reply_to"].to_dict()
for _, row in self.df.iterrows():
depth = 0
current_id = row["id"]
def top_interaction_pairs(self, df: pd.DataFrame, top_n=10):
graph = self.interaction_graph(df)
pairs = []
while True:
reply_to = id_to_reply.get(current_id)
if pd.isna(reply_to) or reply_to == "":
break
for a, targets in graph.items():
for b, count in targets.items():
pairs.append(((a, b), count))
depth += 1
current_id = reply_to
pairs.sort(key=lambda x: x[1], reverse=True)
return pairs[:top_n]
depths.append(depth)
def conversation_concentration(self, df: pd.DataFrame) -> dict:
if "type" not in df.columns:
return {}
if not depths:
return 0
comments = df[df["type"] == "comment"]
if comments.empty:
return {}
return round(sum(depths) / len(depths), 2)
author_counts = comments["author"].value_counts()
total_comments = len(comments)
total_authors = len(author_counts)
def average_thread_length_by_emotion(self):
emotion_exclusions = {"emotion_neutral", "emotion_surprise"}
emotion_cols = [
c
for c in self.df.columns
if c.startswith("emotion_") and c not in emotion_exclusions
]
id_to_reply = self.df.set_index("id")["reply_to"].to_dict()
length_cache = {}
def thread_length_from(start_id):
if start_id in length_cache:
return length_cache[start_id]
seen = set()
length = 1
current = start_id
while True:
if current in seen:
# infinite loop shouldn't happen, but just in case
break
seen.add(current)
reply_to = id_to_reply.get(current)
if (
reply_to is None
or (isinstance(reply_to, float) and pd.isna(reply_to))
or reply_to == ""
):
break
length += 1
current = reply_to
if current in length_cache:
length += length_cache[current] - 1
break
length_cache[start_id] = length
return length
emotion_to_lengths = {}
# Fill NaNs in emotion cols to avoid max() issues
emo_df = self.df[["id"] + emotion_cols].copy()
emo_df[emotion_cols] = emo_df[emotion_cols].fillna(0)
for _, row in emo_df.iterrows():
msg_id = row["id"]
length = thread_length_from(msg_id)
emotions = {c: row[c] for c in emotion_cols}
dominant = max(emotions, key=emotions.get)
emotion_to_lengths.setdefault(dominant, []).append(length)
top_10_pct_n = max(1, int(total_authors * 0.1))
top_10_pct_share = round(
author_counts.head(top_10_pct_n).sum() / total_comments, 4
)
return {
emotion: round(sum(lengths) / len(lengths), 2)
for emotion, lengths in emotion_to_lengths.items()
"total_commenting_authors": total_authors,
"top_10pct_author_count": top_10_pct_n,
"top_10pct_comment_share": float(top_10_pct_share),
"single_comment_authors": int((author_counts == 1).sum()),
"single_comment_author_ratio": float(
round((author_counts == 1).sum() / total_authors, 4)
),
}

View File

@@ -1,42 +1,57 @@
import pandas as pd
import re
from collections import Counter
from itertools import islice
from dataclasses import dataclass
import pandas as pd
@dataclass(frozen=True)
class NGramConfig:
min_token_length: int = 3
min_count: int = 2
max_results: int = 100
class LinguisticAnalysis:
def __init__(self, df: pd.DataFrame, word_exclusions: set[str]):
self.df = df
def __init__(self, word_exclusions: set[str]):
self.word_exclusions = word_exclusions
self.ngram_config = NGramConfig()
def _tokenize(self, text: str):
tokens = re.findall(r"\b[a-z]{3,}\b", text)
return [t for t in tokens if t not in self.word_exclusions]
def _tokenize(self, text: str, *, include_exclusions: bool = False) -> list[str]:
pattern = rf"\b[a-z]{{{self.ngram_config.min_token_length},}}\b"
tokens = re.findall(pattern, text)
if include_exclusions:
return tokens
return [token for token in tokens if token not in self.word_exclusions]
def _clean_text(self, text: str) -> str:
text = re.sub(r"http\S+", "", text) # remove URLs
text = re.sub(r"http\S+", "", text) # remove URLs
text = re.sub(r"www\S+", "", text)
text = re.sub(r"&\w+;", "", text) # remove HTML entities
text = re.sub(r"\bamp\b", "", text) # remove stray amp
text = re.sub(r"&\w+;", "", text) # remove HTML entities
text = re.sub(r"\bamp\b", "", text) # remove stray amp
text = re.sub(r"\S+\.(jpg|jpeg|png|webp|gif)", "", text)
return text
def word_frequencies(self, limit: int = 100) -> dict:
texts = (
self.df["content"]
.dropna()
.astype(str)
.str.lower()
)
def _content_texts(self, df: pd.DataFrame) -> pd.Series:
return df["content"].dropna().astype(str).apply(self._clean_text).str.lower()
def _valid_ngram(self, tokens: tuple[str, ...]) -> bool:
if any(token in self.word_exclusions for token in tokens):
return False
if len(set(tokens)) == 1:
return False
return True
def word_frequencies(self, df: pd.DataFrame, limit: int = 100) -> list[dict]:
texts = self._content_texts(df)
words = []
for text in texts:
tokens = re.findall(r"\b[a-z]{3,}\b", text)
words.extend(
w for w in tokens
if w not in self.word_exclusions
)
words.extend(self._tokenize(text))
counts = Counter(words)
@@ -48,25 +63,58 @@ class LinguisticAnalysis:
)
return word_frequencies.to_dict(orient="records")
def ngrams(self, n=2, limit=100):
texts = self.df["content"].dropna().astype(str).apply(self._clean_text).str.lower()
def ngrams(self, df: pd.DataFrame, n: int = 2, limit: int | None = None) -> list[dict]:
if n < 2:
raise ValueError("n must be at least 2")
texts = self._content_texts(df)
all_ngrams = []
result_limit = limit or self.ngram_config.max_results
for text in texts:
tokens = re.findall(r"\b[a-z]{3,}\b", text)
tokens = self._tokenize(text, include_exclusions=True)
# stop word removal causes strange behaviors in ngrams
#tokens = [w for w in tokens if w not in self.word_exclusions]
if len(tokens) < n:
continue
ngrams = zip(*(islice(tokens, i, None) for i in range(n)))
all_ngrams.extend([" ".join(ng) for ng in ngrams])
for index in range(len(tokens) - n + 1):
ngram_tokens = tuple(tokens[index : index + n])
if self._valid_ngram(ngram_tokens):
all_ngrams.append(" ".join(ngram_tokens))
counts = Counter(all_ngrams)
filtered_counts = [
(ngram, count)
for ngram, count in counts.items()
if count >= self.ngram_config.min_count
]
if not filtered_counts:
return []
return (
pd.DataFrame(counts.items(), columns=["ngram", "count"])
.sort_values("count", ascending=False)
.head(limit)
pd.DataFrame(filtered_counts, columns=["ngram", "count"])
.sort_values(["count", "ngram"], ascending=[False, True])
.head(result_limit)
.to_dict(orient="records")
)
)
def lexical_diversity(self, df: pd.DataFrame) -> dict:
tokens = (
df["content"]
.fillna("")
.astype(str)
.str.lower()
.str.findall(r"\b[a-z]{2,}\b")
.explode()
)
tokens = tokens[~tokens.isin(self.word_exclusions)]
total = max(len(tokens), 1)
unique = int(tokens.nunique())
return {
"total_tokens": total,
"unique_tokens": unique,
"ttr": round(unique / total, 4),
}

View File

@@ -6,6 +6,7 @@ from typing import Any
from transformers import pipeline
from sentence_transformers import SentenceTransformer
class NLP:
_topic_models: dict[str, SentenceTransformer] = {}
_emotion_classifiers: dict[str, Any] = {}
@@ -32,7 +33,7 @@ class NLP:
)
self.entity_recognizer = self._get_entity_recognizer(
self.device_str, self.pipeline_device
)
)
except RuntimeError as exc:
if self.use_cuda and "out of memory" in str(exc).lower():
torch.cuda.empty_cache()
@@ -90,7 +91,7 @@ class NLP:
)
cls._emotion_classifiers[device_str] = classifier
return classifier
@classmethod
def _get_entity_recognizer(cls, device_str: str, pipeline_device: int) -> Any:
recognizer = cls._entity_recognizers.get(device_str)
@@ -207,8 +208,7 @@ class NLP:
self.df.drop(columns=existing_drop, inplace=True)
remaining_emotion_cols = [
c for c in self.df.columns
if c.startswith("emotion_")
c for c in self.df.columns if c.startswith("emotion_")
]
if remaining_emotion_cols:
@@ -227,8 +227,6 @@ class NLP:
self.df[remaining_emotion_cols] = normalized.values
def add_topic_col(self, confidence_threshold: float = 0.3) -> None:
titles = self.df[self.title_col].fillna("").astype(str)
contents = self.df[self.content_col].fillna("").astype(str)
@@ -257,7 +255,7 @@ class NLP:
self.df.loc[self.df["topic_confidence"] < confidence_threshold, "topic"] = (
"Misc"
)
def add_ner_cols(self, max_chars: int = 512) -> None:
texts = (
self.df[self.content_col]
@@ -302,8 +300,4 @@ class NLP:
for label in all_labels:
col_name = f"entity_{label}"
self.df[col_name] = [
d.get(label, 0) for d in entity_count_dicts
]
self.df[col_name] = [d.get(label, 0) for d in entity_count_dicts]

189
server/analysis/stat_gen.py Normal file
View File

@@ -0,0 +1,189 @@
import nltk
import json
import pandas as pd
from nltk.corpus import stopwords
from server.analysis.cultural import CulturalAnalysis
from server.analysis.emotional import EmotionalAnalysis
from server.analysis.interactional import InteractionAnalysis
from server.analysis.linguistic import LinguisticAnalysis
from server.analysis.summary import SummaryAnalysis
from server.analysis.temporal import TemporalAnalysis
from server.analysis.user import UserAnalysis
DOMAIN_STOPWORDS = {
"www",
"https",
"http",
"boards",
"boardsie",
"comment",
"comments",
"discussion",
"thread",
"post",
"posts",
"would",
"get",
"one",
}
EXCLUDED_AUTHORS = {"[deleted]", "automoderator"}
nltk.download("stopwords")
EXCLUDE_WORDS = set(stopwords.words("english")) | DOMAIN_STOPWORDS
class StatGen:
def __init__(self) -> None:
self.temporal_analysis = TemporalAnalysis()
self.emotional_analysis = EmotionalAnalysis()
self.interaction_analysis = InteractionAnalysis(EXCLUDE_WORDS)
self.linguistic_analysis = LinguisticAnalysis(EXCLUDE_WORDS)
self.cultural_analysis = CulturalAnalysis()
self.summary_analysis = SummaryAnalysis()
self.user_analysis = UserAnalysis(EXCLUDE_WORDS)
## Private Methods
def _prepare_filtered_df(self, df: pd.DataFrame, filters: dict | None = None) -> pd.DataFrame:
filters = filters or {}
filtered_df = df.copy()
if "author" in filtered_df.columns:
normalized_authors = (
filtered_df["author"].fillna("").astype(str).str.strip().str.lower()
)
filtered_df = filtered_df[~normalized_authors.isin(EXCLUDED_AUTHORS)]
search_query = filters.get("search_query", None)
start_date_filter = filters.get("start_date", None)
end_date_filter = filters.get("end_date", None)
data_source_filter = filters.get("data_sources", None)
if search_query:
mask = filtered_df["content"].str.contains(
search_query, case=False, na=False
) | filtered_df["author"].str.contains(search_query, case=False, na=False)
# Only include title if the column exists
if "title" in filtered_df.columns:
mask = mask | filtered_df["title"].str.contains(
search_query, case=False, na=False, regex=False
)
filtered_df = filtered_df[mask]
if start_date_filter:
filtered_df = filtered_df[(filtered_df["dt"] >= start_date_filter)]
if end_date_filter:
filtered_df = filtered_df[(filtered_df["dt"] <= end_date_filter)]
if data_source_filter:
filtered_df = filtered_df[filtered_df["source"].isin(data_source_filter)]
return filtered_df
def _json_ready_records(self, df: pd.DataFrame) -> list[dict]:
return json.loads(
df.to_json(orient="records", date_format="iso", date_unit="s")
)
## Public Methods
def filter_dataset(self, df: pd.DataFrame, filters: dict | None = None) -> list[dict]:
filtered_df = self._prepare_filtered_df(df, filters)
return self._json_ready_records(filtered_df)
def temporal(
self,
df: pd.DataFrame,
filters: dict | None = None,
dataset_id: int | None = None,
) -> dict:
filtered_df = self._prepare_filtered_df(df, filters)
return {
"events_per_day": self.temporal_analysis.posts_per_day(filtered_df),
"weekday_hour_heatmap": self.temporal_analysis.heatmap(filtered_df),
}
def linguistic(
self,
df: pd.DataFrame,
filters: dict | None = None,
dataset_id: int | None = None,
) -> dict:
filtered_df = self._prepare_filtered_df(df, filters)
return {
"word_frequencies": self.linguistic_analysis.word_frequencies(filtered_df),
"common_two_phrases": self.linguistic_analysis.ngrams(filtered_df),
"common_three_phrases": self.linguistic_analysis.ngrams(filtered_df, n=3),
"lexical_diversity": self.linguistic_analysis.lexical_diversity(filtered_df)
}
def emotional(
self,
df: pd.DataFrame,
filters: dict | None = None,
dataset_id: int | None = None,
) -> dict:
filtered_df = self._prepare_filtered_df(df, filters)
return {
"average_emotion_by_topic": self.emotional_analysis.avg_emotion_by_topic(filtered_df),
"overall_emotion_average": self.emotional_analysis.overall_emotion_average(filtered_df),
"dominant_emotion_distribution": self.emotional_analysis.dominant_emotion_distribution(filtered_df),
"emotion_by_source": self.emotional_analysis.emotion_by_source(filtered_df)
}
def user(
self,
df: pd.DataFrame,
filters: dict | None = None,
dataset_id: int | None = None,
) -> dict:
filtered_df = self._prepare_filtered_df(df, filters)
return {
"top_users": self.user_analysis.top_users(filtered_df),
"users": self.user_analysis.per_user_analysis(filtered_df)
}
def interactional(
self,
df: pd.DataFrame,
filters: dict | None = None,
dataset_id: int | None = None,
) -> dict:
filtered_df = self._prepare_filtered_df(df, filters)
return {
"top_interaction_pairs": self.interaction_analysis.top_interaction_pairs(filtered_df, top_n=100),
"interaction_graph": self.interaction_analysis.interaction_graph(filtered_df),
"conversation_concentration": self.interaction_analysis.conversation_concentration(filtered_df)
}
def cultural(
self,
df: pd.DataFrame,
filters: dict | None = None,
dataset_id: int | None = None,
) -> dict:
filtered_df = self._prepare_filtered_df(df, filters)
return {
"identity_markers": self.cultural_analysis.get_identity_markers(filtered_df),
"stance_markers": self.cultural_analysis.get_stance_markers(filtered_df),
"avg_emotion_per_entity": self.cultural_analysis.get_avg_emotions_per_entity(filtered_df)
}
def summary(
self,
df: pd.DataFrame,
filters: dict | None = None,
dataset_id: int | None = None,
) -> dict:
filtered_df = self._prepare_filtered_df(df, filters)
return self.summary_analysis.summary(filtered_df)

View File

@@ -0,0 +1,64 @@
import pandas as pd
class SummaryAnalysis:
def total_events(self, df: pd.DataFrame) -> int:
return int(len(df))
def total_posts(self, df: pd.DataFrame) -> int:
return int(len(df[df["type"] == "post"]))
def total_comments(self, df: pd.DataFrame) -> int:
return int(len(df[df["type"] == "comment"]))
def unique_users(self, df: pd.DataFrame) -> int:
return int(len(df["author"].dropna().unique()))
def comments_per_post(self, total_comments: int, total_posts: int) -> float:
return round(total_comments / max(total_posts, 1), 2)
def lurker_ratio(self, df: pd.DataFrame) -> float:
events_per_user = df.groupby("author").size()
return round((events_per_user == 1).mean(), 2)
def time_range(self, df: pd.DataFrame) -> dict:
return {
"start": int(df["dt"].min().timestamp()),
"end": int(df["dt"].max().timestamp()),
}
def sources(self, df: pd.DataFrame) -> list:
return df["source"].dropna().unique().tolist()
def empty_summary(self) -> dict:
return {
"total_events": 0,
"total_posts": 0,
"total_comments": 0,
"unique_users": 0,
"comments_per_post": 0,
"lurker_ratio": 0,
"time_range": {
"start": None,
"end": None,
},
"sources": [],
}
def summary(self, df: pd.DataFrame) -> dict:
if df.empty:
return self.empty_summary()
total_posts = self.total_posts(df)
total_comments = self.total_comments(df)
return {
"total_events": self.total_events(df),
"total_posts": total_posts,
"total_comments": total_comments,
"unique_users": self.unique_users(df),
"comments_per_post": self.comments_per_post(total_comments, total_posts),
"lurker_ratio": self.lurker_ratio(df),
"time_range": self.time_range(df),
"sources": self.sources(df),
}

View File

@@ -1,16 +1,14 @@
import pandas as pd
class TemporalAnalysis:
def __init__(self, df: pd.DataFrame):
self.df = df
def avg_reply_time_per_emotion(self) -> dict:
df = self.df.copy()
def avg_reply_time_per_emotion(self, df: pd.DataFrame) -> list[dict]:
df = df.copy()
replies = df[
(df["type"] == "comment") &
(df["reply_to"].notna()) &
(df["reply_to"] != "")
(df["type"] == "comment")
& (df["reply_to"].notna())
& (df["reply_to"] != "")
]
id_to_time = df.set_index("id")["dt"].to_dict()
@@ -23,48 +21,51 @@ class TemporalAnalysis:
return None
return (row["dt"] - parent_time).total_seconds()
replies["reply_time"] = replies.apply(compute_reply_time, axis=1)
emotion_cols = [col for col in df.columns if col.startswith("emotion_") and col not in ("emotion_neutral", "emotion_surprise")]
emotion_cols = [
col
for col in df.columns
if col.startswith("emotion_")
and col not in ("emotion_neutral", "emotion_surprise")
]
replies["dominant_emotion"] = replies[emotion_cols].idxmax(axis=1)
grouped = (
replies
.groupby("dominant_emotion")["reply_time"]
replies.groupby("dominant_emotion")["reply_time"]
.agg(["mean", "count"])
.reset_index()
)
return grouped.to_dict(orient="records")
def posts_per_day(self) -> dict:
per_day = (
self.df.groupby("date")
.size()
.reset_index(name="count")
)
def posts_per_day(self, df: pd.DataFrame) -> list[dict]:
per_day = df.groupby("date").size().reset_index(name="count")
return per_day.to_dict(orient="records")
def heatmap(self) -> dict:
def heatmap(self, df: pd.DataFrame) -> list[dict]:
weekday_order = [
"Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
]
self.df["weekday"] = pd.Categorical(
self.df["weekday"],
categories=weekday_order,
ordered=True
df = df.copy()
df["weekday"] = pd.Categorical(
df["weekday"], categories=weekday_order, ordered=True
)
heatmap = (
self.df
.groupby(["weekday", "hour"], observed=True)
df.groupby(["weekday", "hour"], observed=True)
.size()
.unstack(fill_value=0)
.reindex(columns=range(24), fill_value=0)
)
heatmap.columns = heatmap.columns.map(str)
return heatmap.to_dict(orient="records")
return heatmap.to_dict(orient="records")

152
server/analysis/user.py Normal file
View File

@@ -0,0 +1,152 @@
import pandas as pd
import re
from collections import Counter
class UserAnalysis:
def __init__(self, word_exclusions: set[str]):
self.word_exclusions = word_exclusions
def _tokenize(self, text: str):
tokens = re.findall(r"\b[a-z]{3,}\b", text)
return [t for t in tokens if t not in self.word_exclusions]
def _vocab_richness_per_user(
self, df: pd.DataFrame, min_words: int = 20, top_most_used_words: int = 100
) -> list:
df = df.copy()
df["content"] = df["content"].fillna("").astype(str).str.lower()
df["tokens"] = df["content"].apply(self._tokenize)
rows = []
for author, group in df.groupby("author"):
all_tokens = [t for tokens in group["tokens"] for t in tokens]
total_words = len(all_tokens)
unique_words = len(set(all_tokens))
events = len(group)
# Min amount of words for a user, any less than this might give weird results
if total_words < min_words:
continue
# 100% = they never reused a word (excluding stop words)
vocab_richness = unique_words / total_words
avg_words = total_words / max(events, 1)
counts = Counter(all_tokens)
top_words = [
{"word": w, "count": int(c)}
for w, c in counts.most_common(top_most_used_words)
]
rows.append(
{
"author": author,
"events": int(events),
"total_words": int(total_words),
"unique_words": int(unique_words),
"vocab_richness": round(vocab_richness, 3),
"avg_words_per_event": round(avg_words, 2),
"top_words": top_words,
}
)
rows = sorted(rows, key=lambda x: x["vocab_richness"], reverse=True)
return rows
def top_users(self, df: pd.DataFrame) -> list:
counts = df.groupby(["author", "source"]).size().sort_values(ascending=False)
top_users = [
{"author": author, "source": source, "count": int(count)}
for (author, source), count in counts.items()
]
return top_users
def per_user_analysis(self, df: pd.DataFrame) -> dict:
per_user = df.groupby(["author", "type"]).size().unstack(fill_value=0)
emotion_cols = [col for col in df.columns if col.startswith("emotion_")]
dominant_topic_by_author = {}
avg_emotions_by_author = {}
if emotion_cols:
avg_emotions = df.groupby("author")[emotion_cols].mean().fillna(0.0)
avg_emotions_by_author = {
author: {emotion: float(score) for emotion, score in row.items()}
for author, row in avg_emotions.iterrows()
}
if "topic" in df.columns:
topic_df = df[
df["topic"].notna()
& (df["topic"] != "")
& (df["topic"] != "Misc")
]
if not topic_df.empty:
topic_counts = (
topic_df.groupby(["author", "topic"])
.size()
.reset_index(name="count")
.sort_values(
["author", "count", "topic"],
ascending=[True, False, True],
)
.drop_duplicates(subset=["author"])
)
dominant_topic_by_author = {
row["author"]: {
"topic": row["topic"],
"count": int(row["count"]),
}
for _, row in topic_counts.iterrows()
}
# ensure columns always exist
for col in ("post", "comment"):
if col not in per_user.columns:
per_user[col] = 0
per_user["comment_post_ratio"] = per_user["comment"] / per_user["post"].replace(
0, 1
)
per_user["comment_share"] = per_user["comment"] / (
per_user["post"] + per_user["comment"]
).replace(0, 1)
per_user = per_user.sort_values("comment_post_ratio", ascending=True)
per_user_records = per_user.reset_index().to_dict(orient="records")
vocab_rows = self._vocab_richness_per_user(df)
vocab_by_author = {row["author"]: row for row in vocab_rows}
# merge vocab richness + per_user information
merged_users = []
for row in per_user_records:
author = row["author"]
merged_users.append(
{
"author": author,
"post": int(row.get("post", 0)),
"comment": int(row.get("comment", 0)),
"comment_post_ratio": float(row.get("comment_post_ratio", 0)),
"comment_share": float(row.get("comment_share", 0)),
"avg_emotions": avg_emotions_by_author.get(author, {}),
"dominant_topic": dominant_topic_by_author.get(author),
"vocab": vocab_by_author.get(
author,
{
"vocab_richness": 0,
"avg_words_per_event": 0,
"top_words": [],
},
),
}
)
merged_users.sort(key=lambda u: u["comment_post_ratio"])
return merged_users

View File

@@ -1,199 +1,608 @@
from flask import Flask, jsonify, request
from flask_cors import CORS
from server.stat_gen import StatGen
import os
import pandas as pd
import traceback
import json
from dotenv import load_dotenv
from flask import Flask, jsonify, request
from flask_cors import CORS
from flask_bcrypt import Bcrypt
from flask_jwt_extended import (
JWTManager,
create_access_token,
jwt_required,
get_jwt_identity,
)
from server.analysis.stat_gen import StatGen
from server.exceptions import NotAuthorisedException, NonExistentDatasetException
from server.db.database import PostgresConnector
from server.core.auth import AuthManager
from server.core.datasets import DatasetManager
from server.utils import get_request_filters, get_env
from server.queue.tasks import process_dataset, fetch_and_process_dataset
from server.connectors.registry import get_available_connectors, get_connector_metadata
app = Flask(__name__)
# Allow for CORS from localhost:5173
CORS(app, resources={r"/*": {"origins": "http://localhost:5173"}})
# Env Variables
load_dotenv()
max_fetch_limit = int(get_env("MAX_FETCH_LIMIT"))
frontend_url = get_env("FRONTEND_URL")
jwt_secret_key = get_env("JWT_SECRET_KEY")
jwt_access_token_expires = int(
os.getenv("JWT_ACCESS_TOKEN_EXPIRES", 1200)
) # Default to 20 minutes
# Global State
posts_df = pd.read_json('small.jsonl', lines=True)
with open("topic_buckets.json", "r", encoding="utf-8") as f:
domain_topics = json.load(f)
stat_obj = StatGen(posts_df, domain_topics)
# Flask Configuration
CORS(app, resources={r"/*": {"origins": frontend_url}})
app.config["JWT_SECRET_KEY"] = jwt_secret_key
app.config["JWT_ACCESS_TOKEN_EXPIRES"] = jwt_access_token_expires
@app.route('/upload', methods=['POST'])
# Security
bcrypt = Bcrypt(app)
jwt = JWTManager(app)
# Helper Objects
db = PostgresConnector()
auth_manager = AuthManager(db, bcrypt)
dataset_manager = DatasetManager(db)
stat_gen = StatGen()
connectors = get_available_connectors()
# Default Files
with open("server/topics.json") as f:
default_topic_list = json.load(f)
def normalize_topics(topics):
if not isinstance(topics, dict) or len(topics) == 0:
return None
normalized = {}
for topic_name, topic_keywords in topics.items():
if not isinstance(topic_name, str) or not isinstance(topic_keywords, str):
return None
clean_name = topic_name.strip()
clean_keywords = topic_keywords.strip()
if not clean_name or not clean_keywords:
return None
normalized[clean_name] = clean_keywords
return normalized
@app.route("/register", methods=["POST"])
def register_user():
data = request.get_json()
if (
not data
or "username" not in data
or "email" not in data
or "password" not in data
):
return jsonify({"error": "Missing username, email, or password"}), 400
username = data["username"]
email = data["email"]
password = data["password"]
try:
auth_manager.register_user(username, email, password)
except ValueError as e:
return jsonify({"error": str(e)}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
print(f"Registered new user: {username}")
return jsonify({"message": f"User '{username}' registered successfully"}), 200
@app.route("/login", methods=["POST"])
def login_user():
data = request.get_json()
if not data or "username" not in data or "password" not in data:
return jsonify({"error": "Missing username or password"}), 400
username = data["username"]
password = data["password"]
try:
user = auth_manager.authenticate_user(username, password)
if user:
access_token = create_access_token(identity=str(user["id"]))
return jsonify({"access_token": access_token}), 200
else:
return jsonify({"error": "Invalid username or password"}), 401
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/profile", methods=["GET"])
@jwt_required()
def profile():
current_user = get_jwt_identity()
return (
jsonify(
message="Access granted", user=auth_manager.get_user_by_id(current_user)
),
200,
)
@app.route("/user/datasets")
@jwt_required()
def get_user_datasets():
current_user = int(get_jwt_identity())
return jsonify(dataset_manager.get_user_datasets(current_user)), 200
@app.route("/datasets/sources", methods=["GET"])
def get_dataset_sources():
list_metadata = list(get_connector_metadata().values())
return jsonify(list_metadata)
@app.route("/datasets/fetch", methods=["POST"])
@jwt_required()
def fetch_data():
data = request.get_json()
connector_metadata = get_connector_metadata()
# Strong validation needed, otherwise data goes to Celery and crashes silently
if not data or "sources" not in data:
return jsonify({"error": "Sources must be provided"}), 400
if "name" not in data or not str(data["name"]).strip():
return jsonify({"error": "Dataset name is required"}), 400
dataset_name = data["name"].strip()
user_id = int(get_jwt_identity())
custom_topics = data.get("topics")
topics_for_processing = default_topic_list
source_configs = data["sources"]
if not isinstance(source_configs, list) or len(source_configs) == 0:
return jsonify({"error": "Sources must be a non-empty list"}), 400
for source in source_configs:
if not isinstance(source, dict):
return jsonify({"error": "Each source must be an object"}), 400
if "name" not in source:
return jsonify({"error": "Each source must contain a name"}), 400
name = source["name"]
limit = source.get("limit", 1000)
category = source.get("category")
search = source.get("search")
if limit:
try:
limit = int(limit)
except (ValueError, TypeError):
return jsonify({"error": "Limit must be an integer"}), 400
if limit > 1000:
limit = 1000
if name not in connector_metadata:
return jsonify({"error": "Source not supported"}), 400
if search and not connector_metadata[name]["search_enabled"]:
return jsonify({"error": f"Source {name} does not support search"}), 400
if category and not connector_metadata[name]["categories_enabled"]:
return jsonify({"error": f"Source {name} does not support categories"}), 400
# if category and not connectors[name]().category_exists(category):
# return jsonify({"error": f"Category does not exist for {name}"}), 400
if custom_topics is not None:
normalized_topics = normalize_topics(custom_topics)
if not normalized_topics:
return (
jsonify(
{
"error": "Topics must be a non-empty JSON object with non-empty string keys and values"
}
),
400,
)
topics_for_processing = normalized_topics
try:
dataset_id = dataset_manager.save_dataset_info(
user_id, dataset_name, topics_for_processing
)
dataset_manager.set_dataset_status(
dataset_id,
"fetching",
f"Data is being fetched from {', '.join(source['name'] for source in source_configs)}",
)
fetch_and_process_dataset.delay(dataset_id, source_configs, topics_for_processing)
except Exception:
print(traceback.format_exc())
return jsonify({"error": "Failed to queue dataset processing"}), 500
return (
jsonify(
{
"message": "Dataset queued for processing",
"dataset_id": dataset_id,
"status": "processing",
}
),
202,
)
@app.route("/datasets/upload", methods=["POST"])
@jwt_required()
def upload_data():
if "posts" not in request.files or "topics" not in request.files:
return jsonify({"error": "Missing required files or form data"}), 400
post_file = request.files["posts"]
topic_file = request.files["topics"]
dataset_name = (request.form.get("name") or "").strip()
if post_file.filename == "" or topic_file == "":
if not dataset_name:
return jsonify({"error": "Missing required dataset name"}), 400
if post_file.filename == "" or topic_file.filename == "":
return jsonify({"error": "Empty filename"}), 400
if not post_file.filename.endswith('.jsonl') or not topic_file.filename.endswith('.json'):
return jsonify({"error": "Invalid file type. Only .jsonl and .json files are allowed."}), 400
try:
global stat_obj
if not post_file.filename.endswith(".jsonl") or not topic_file.filename.endswith(
".json"
):
return (
jsonify(
{"error": "Invalid file type. Only .jsonl and .json files are allowed."}
),
400,
)
posts_df = pd.read_json(post_file, lines=True)
stat_obj = StatGen(posts_df, json.load(topic_file))
return jsonify({"message": "File uploaded successfully", "event_count": len(stat_obj.df)}), 200
try:
current_user = int(get_jwt_identity())
posts_df = pd.read_json(post_file, lines=True, convert_dates=False)
topics = json.load(topic_file)
dataset_id = dataset_manager.save_dataset_info(
current_user, dataset_name, topics
)
process_dataset.delay(dataset_id, posts_df.to_dict(orient="records"), topics)
return (
jsonify(
{
"message": "Dataset queued for processing",
"dataset_id": dataset_id,
"status": "processing",
}
),
202,
)
except ValueError as e:
return jsonify({"error": f"Failed to read JSONL file: {str(e)}"}), 400
return jsonify({"error": f"Failed to read JSONL file"}), 400
except Exception as e:
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
@app.route('/dataset', methods=['GET'])
def get_dataset():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
return stat_obj.df.to_json(orient="records"), 200, {"Content-Type": "application/json"}
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route('/stats/content', methods=['GET'])
def word_frequencies():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
@app.route("/dataset/<int:dataset_id>", methods=["GET"])
@jwt_required()
def get_dataset(dataset_id):
try:
return jsonify(stat_obj.get_content_analysis()), 200
except ValueError as e:
return jsonify({"error": f"Malformed or missing data: {str(e)}"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
@app.route('/stats/summary', methods=["GET"])
def get_summary():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
try:
return jsonify(stat_obj.summary()), 200
except ValueError as e:
return jsonify({"error": f"Malformed or missing data: {str(e)}"}), 400
except Exception as e:
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
@app.route("/stats/time", methods=["GET"])
def get_time_analysis():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
try:
return jsonify(stat_obj.get_time_analysis()), 200
except ValueError as e:
return jsonify({"error": f"Malformed or missing data: {str(e)}"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
@app.route("/stats/user", methods=["GET"])
def get_user_analysis():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
try:
return jsonify(stat_obj.get_user_analysis()), 200
except ValueError as e:
return jsonify({"error": f"Malformed or missing data: {str(e)}"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
@app.route("/stats/cultural", methods=["GET"])
def get_cultural_analysis():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
try:
return jsonify(stat_obj.get_cultural_analysis()), 200
except ValueError as e:
return jsonify({"error": f"Malformed or missing data: {str(e)}"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
user_id = int(get_jwt_identity())
@app.route("/stats/interaction", methods=["GET"])
def get_interaction_analysis():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
try:
return jsonify(stat_obj.get_interactional_analysis()), 200
except ValueError as e:
return jsonify({"error": f"Malformed or missing data: {str(e)}"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
@app.route('/filter/query', methods=["POST"])
def filter_query():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
dataset_info = dataset_manager.get_dataset_info(dataset_id)
included_cols = {"id", "name", "created_at"}
data = request.get_json(silent=True) or {}
if "query" not in data:
return jsonify(stat_obj.df.to_dict(orient="records")), 200
query = data["query"]
filtered_df = stat_obj.filter_by_query(query)
return jsonify(filtered_df), 200
@app.route('/filter/time', methods=["POST"])
def filter_time():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
data = request.get_json(silent=True)
if not data:
return jsonify({"error": "Invalid or missing JSON body"}), 400
if "start" not in data or "end" not in data:
return jsonify({"error": "Please include both start and end dates"}), 400
try:
start = pd.to_datetime(data["start"], utc=True)
end = pd.to_datetime(data["end"], utc=True)
filtered_df = stat_obj.set_time_range(start, end)
return jsonify(filtered_df), 200
return jsonify({k: dataset_info[k] for k in included_cols}), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except Exception:
return jsonify({"error": "Invalid datetime format"}), 400
@app.route('/filter/sources', methods=["POST"])
def filter_sources():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
data = request.get_json(silent=True)
if not data:
return jsonify({"error": "Invalid or missing JSON body"}), 400
if "sources" not in data:
return jsonify({"error": "Ensure sources hash map is in 'sources' key"}), 400
print(traceback.format_exc())
return jsonify({"error": "An unexpected error occured"}), 500
@app.route("/dataset/<int:dataset_id>", methods=["PATCH"])
@jwt_required()
def update_dataset(dataset_id):
try:
filtered_df = stat_obj.filter_data_sources(data["sources"])
return jsonify(filtered_df), 200
except ValueError:
return jsonify({"error": "Please enable at least one data source"}), 400
except Exception as e:
return jsonify({"error": "An unexpected server error occured: " + str(e)}), 500
@app.route('/filter/reset', methods=["GET"])
def reset_dataset():
if stat_obj is None:
return jsonify({"error": "No data uploaded"}), 400
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
body = request.get_json()
new_name = body.get("name")
if not new_name or not new_name.strip():
return jsonify({"error": "A valid name must be provided"}), 400
dataset_manager.update_dataset_name(dataset_id, new_name.strip())
return (
jsonify(
{"message": f"Dataset {dataset_id} renamed to '{new_name.strip()}'"}
),
200,
)
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except Exception:
print(traceback.format_exc())
return jsonify({"error": "An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>", methods=["DELETE"])
@jwt_required()
def delete_dataset(dataset_id):
try:
stat_obj.reset_dataset()
return jsonify({"success": "Dataset successfully reset"})
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_manager.delete_dataset_info(dataset_id)
dataset_manager.delete_dataset_content(dataset_id)
return (
jsonify(
{
"message": f"Dataset {dataset_id} metadata and content successfully deleted"
}
),
200,
)
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except Exception:
print(traceback.format_exc())
return jsonify({"error": "An unexpected error occured"}), 500
@app.route("/dataset/<int:dataset_id>/status", methods=["GET"])
@jwt_required()
def get_dataset_status(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_status = dataset_manager.get_dataset_status(dataset_id)
return jsonify(dataset_status), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except Exception:
print(traceback.format_exc())
return jsonify({"error": "An unexpected error occured"}), 500
@app.route("/dataset/<int:dataset_id>/linguistic", methods=["GET"])
@jwt_required()
def get_linguistic_analysis(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.linguistic(dataset_content, filters, dataset_id=dataset_id)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred: {str(e)}"}), 500
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>/emotional", methods=["GET"])
@jwt_required()
def get_emotional_analysis(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.emotional(dataset_content, filters, dataset_id=dataset_id)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>/summary", methods=["GET"])
@jwt_required()
def get_summary(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.summary(dataset_content, filters, dataset_id=dataset_id)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>/temporal", methods=["GET"])
@jwt_required()
def get_temporal_analysis(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.temporal(dataset_content, filters, dataset_id=dataset_id)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>/user", methods=["GET"])
@jwt_required()
def get_user_analysis(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.user(dataset_content, filters, dataset_id=dataset_id)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>/cultural", methods=["GET"])
@jwt_required()
def get_cultural_analysis(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.cultural(dataset_content, filters, dataset_id=dataset_id)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>/interactional", methods=["GET"])
@jwt_required()
def get_interaction_analysis(dataset_id):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.interactional(dataset_content, filters, dataset_id=dataset_id)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
@app.route("/dataset/<int:dataset_id>/all", methods=["GET"])
@jwt_required()
def get_full_dataset(dataset_id: int):
try:
user_id = int(get_jwt_identity())
if not dataset_manager.authorize_user_dataset(dataset_id, user_id):
raise NotAuthorisedException(
"This user is not authorised to access this dataset"
)
dataset_content = dataset_manager.get_dataset_content(dataset_id)
filters = get_request_filters()
return jsonify(stat_gen.filter_dataset(dataset_content, filters)), 200
except NotAuthorisedException:
return jsonify({"error": "User is not authorised to access this content"}), 403
except NonExistentDatasetException:
return jsonify({"error": "Dataset does not exist"}), 404
except ValueError as e:
return jsonify({"error": f"Malformed or missing data"}), 400
except Exception as e:
print(traceback.format_exc())
return jsonify({"error": f"An unexpected error occurred"}), 500
if __name__ == "__main__":
app.run(debug=True)
app.run(debug=True)

24
server/connectors/base.py Normal file
View File

@@ -0,0 +1,24 @@
from abc import ABC, abstractmethod
from dto.post import Post
import os
class BaseConnector(ABC):
source_name: str # machine readable
display_name: str # human readablee
required_env: list[str] = []
search_enabled: bool
categories_enabled: bool
@classmethod
def is_available(cls) -> bool:
return all(os.getenv(var) for var in cls.required_env)
@abstractmethod
def get_new_posts_by_search(
self, search: str = None, category: str = None, post_limit: int = 10
) -> list[Post]: ...
@abstractmethod
def category_exists(self, category: str) -> bool: ...

View File

@@ -7,56 +7,94 @@ from dto.post import Post
from dto.comment import Comment
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
from server.connectors.base import BaseConnector
logger = logging.getLogger(__name__)
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; ForumScraper/1.0)"
}
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; Digital-Ethnography-Aid/1.0)"}
class BoardsAPI(BaseConnector):
source_name: str = "boards.ie"
display_name: str = "Boards.ie"
categories_enabled: bool = True
search_enabled: bool = False
class BoardsAPI:
def __init__(self):
self.url = "https://www.boards.ie"
self.source_name = "Boards.ie"
self.base_url = "https://www.boards.ie"
def get_new_category_posts(self, category: str, post_limit: int, comment_limit: int) -> list[Post]:
def get_new_posts_by_search(
self, search: str, category: str, post_limit: int
) -> list[Post]:
if search:
raise NotImplementedError("Search not compatible with boards.ie")
if category:
return self._get_posts(f"{self.base_url}/categories/{category}", post_limit)
else:
return self._get_posts(f"{self.base_url}/discussions", post_limit)
def category_exists(self, category: str) -> bool:
if not category:
return False
url = f"{self.base_url}/categories/{category}"
try:
response = requests.head(url, headers=HEADERS, allow_redirects=True)
if response.status_code == 200:
return True
if response.status_code == 404:
return False
# fallback if HEAD not supported
response = requests.get(url, headers=HEADERS)
return response.status_code == 200
except requests.RequestException as e:
logger.error(f"Error checking category '{category}': {e}")
return False
## Private
def _get_posts(self, url, limit) -> list[Post]:
urls = []
current_page = 1
logger.info(f"Fetching posts from category: {category}")
while len(urls) < post_limit:
url = f"{self.url}/categories/{category}/p{current_page}"
while len(urls) < limit:
url = f"{url}/p{current_page}"
html = self._fetch_page(url)
soup = BeautifulSoup(html, "html.parser")
logger.debug(f"Processing page {current_page} for category {category}")
logger.debug(f"Processing page {current_page} for link: {url}")
for a in soup.select("a.threadbit-threadlink"):
if len(urls) >= post_limit:
if len(urls) >= limit:
break
href = a.get("href")
if href:
urls.append(href)
current_page += 1
logger.debug(f"Fetched {len(urls)} post URLs from category {category}")
logger.debug(f"Fetched {len(urls)} post URLs")
# Fetch post details for each URL and create Post objects
posts = []
def fetch_and_parse(post_url):
html = self._fetch_page(post_url)
post = self._parse_thread(html, post_url, comment_limit)
post = self._parse_thread(html, post_url)
return post
with ThreadPoolExecutor(max_workers=30) as executor:
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(fetch_and_parse, url): url for url in urls}
for i, future in enumerate(as_completed(futures)):
post_url = futures[future]
logger.debug(f"Fetching Post {i + 1} / {len(urls)} details from URL: {post_url}")
logger.debug(
f"Fetching Post {i + 1} / {len(urls)} details from URL: {post_url}"
)
try:
post = future.result()
posts.append(post)
@@ -65,15 +103,14 @@ class BoardsAPI:
return posts
def _fetch_page(self, url: str) -> str:
response = requests.get(url, headers=HEADERS)
response.raise_for_status()
return response.text
def _parse_thread(self, html: str, post_url: str, comment_limit: int) -> Post:
def _parse_thread(self, html: str, post_url: str) -> Post:
soup = BeautifulSoup(html, "html.parser")
# Author
author_tag = soup.select_one(".userinfo-username-title")
author = author_tag.text.strip() if author_tag else None
@@ -82,10 +119,16 @@ class BoardsAPI:
timestamp_tag = soup.select_one(".postbit-header")
timestamp = None
if timestamp_tag:
match = re.search(r"\d{2}-\d{2}-\d{4}\s+\d{2}:\d{2}[AP]M", timestamp_tag.get_text())
match = re.search(
r"\d{2}-\d{2}-\d{4}\s+\d{2}:\d{2}[AP]M", timestamp_tag.get_text()
)
timestamp = match.group(0) if match else None
# convert to unix epoch
timestamp = datetime.datetime.strptime(timestamp, "%d-%m-%Y %I:%M%p").timestamp() if timestamp else None
timestamp = (
datetime.datetime.strptime(timestamp, "%d-%m-%Y %I:%M%p").timestamp()
if timestamp
else None
)
# Post ID
post_num = re.search(r"discussion/(\d+)", post_url)
@@ -93,14 +136,16 @@ class BoardsAPI:
# Content
content_tag = soup.select_one(".Message.userContent")
content = content_tag.get_text(separator="\n", strip=True) if content_tag else None
content = (
content_tag.get_text(separator="\n", strip=True) if content_tag else None
)
# Title
title_tag = soup.select_one(".PageTitle h1")
title = title_tag.text.strip() if title_tag else None
# Comments
comments = self._parse_comments(post_url, post_num, comment_limit)
comments = self._parse_comments(post_url, post_num)
post = Post(
id=post_num,
@@ -110,16 +155,16 @@ class BoardsAPI:
url=post_url,
timestamp=timestamp,
source=self.source_name,
comments=comments
comments=comments,
)
return post
def _parse_comments(self, url: str, post_id: str, comment_limit: int) -> list[Comment]:
def _parse_comments(self, url: str, post_id: str) -> list[Comment]:
comments = []
current_url = url
while current_url and len(comments) < comment_limit:
while current_url:
html = self._fetch_page(current_url)
page_comments = self._parse_page_comments(html, post_id)
comments.extend(page_comments)
@@ -128,9 +173,9 @@ class BoardsAPI:
soup = BeautifulSoup(html, "html.parser")
next_link = soup.find("a", class_="Next")
if next_link and next_link.get('href'):
href = next_link.get('href')
current_url = href if href.startswith('http') else self.url + href
if next_link and next_link.get("href"):
href = next_link.get("href")
current_url = href if href.startswith("http") else url + href
else:
current_url = None
@@ -146,21 +191,29 @@ class BoardsAPI:
comment_id = tag.get("id")
# Author
user_elem = tag.find('span', class_='userinfo-username-title')
user_elem = tag.find("span", class_="userinfo-username-title")
username = user_elem.get_text(strip=True) if user_elem else None
# Timestamp
date_elem = tag.find('span', class_='DateCreated')
date_elem = tag.find("span", class_="DateCreated")
timestamp = date_elem.get_text(strip=True) if date_elem else None
timestamp = datetime.datetime.strptime(timestamp, "%d-%m-%Y %I:%M%p").timestamp() if timestamp else None
timestamp = (
datetime.datetime.strptime(timestamp, "%d-%m-%Y %I:%M%p").timestamp()
if timestamp
else None
)
# Content
message_div = tag.find('div', class_='Message userContent')
message_div = tag.find("div", class_="Message userContent")
if message_div.blockquote:
message_div.blockquote.decompose()
content = message_div.get_text(separator="\n", strip=True) if message_div else None
content = (
message_div.get_text(separator="\n", strip=True)
if message_div
else None
)
comment = Comment(
id=comment_id,
@@ -169,10 +222,8 @@ class BoardsAPI:
content=content,
timestamp=timestamp,
reply_to=None,
source=self.source_name
source=self.source_name,
)
comments.append(comment)
return comments

View File

@@ -0,0 +1,259 @@
import requests
import logging
import time
import os
from dotenv import load_dotenv
from requests.auth import HTTPBasicAuth
from dto.post import Post
from dto.user import User
from dto.comment import Comment
from server.connectors.base import BaseConnector
logger = logging.getLogger(__name__)
CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")
CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")
class RedditAPI(BaseConnector):
source_name: str = "reddit"
display_name: str = "Reddit"
search_enabled: bool = True
categories_enabled: bool = True
def __init__(self):
self.url = "https://www.reddit.com/"
self.token = None
self.token_expiry = 0
# Public Methods #
def get_new_posts_by_search(
self, search: str, category: str, post_limit: int
) -> list[Post]:
prefix = f"r/{category}/" if category else ""
params = {"limit": post_limit}
if search:
endpoint = f"{prefix}search.json"
params.update(
{"q": search, "sort": "new", "restrict_sr": "on" if category else "off"}
)
else:
endpoint = f"{prefix}new.json"
posts = []
after = None
while len(posts) < post_limit:
batch_limit = min(100, post_limit - len(posts))
params["limit"] = batch_limit
if after:
params["after"] = after
data = self._fetch_post_overviews(endpoint, params)
if not data or "data" not in data or not data["data"].get("children"):
break
batch_posts = self._parse_posts(data)
posts.extend(batch_posts)
after = data["data"].get("after")
if not after:
break
return posts[:post_limit]
def _get_new_subreddit_posts(self, subreddit: str, limit: int = 10) -> list[Post]:
posts = []
after = None
url = f"r/{subreddit}/new.json"
logger.info(f"Fetching new posts from subreddit: {subreddit}")
while len(posts) < limit:
batch_limit = min(100, limit - len(posts))
params = {"limit": batch_limit, "after": after}
data = self._fetch_post_overviews(url, params)
batch_posts = self._parse_posts(data)
logger.debug(
f"Fetched {len(batch_posts)} new posts from subreddit {subreddit}"
)
if not batch_posts:
break
posts.extend(batch_posts)
after = data["data"].get("after")
if not after:
break
return posts
def get_user(self, username: str) -> User:
data = self._fetch_post_overviews(f"user/{username}/about.json", {})
return self._parse_user(data)
def category_exists(self, category: str) -> bool:
try:
data = self._fetch_post_overviews(f"r/{category}/about.json", {})
return (
data is not None
and "data" in data
and data["data"].get("id") is not None
)
except Exception:
return False
## Private Methods ##
def _parse_posts(self, data) -> list[Post]:
posts = []
total_num_posts = len(data["data"]["children"])
current_index = 0
for item in data["data"]["children"]:
current_index += 1
logger.debug(f"Parsing post {current_index} of {total_num_posts}")
post_data = item["data"]
post = Post(
id=post_data["id"],
author=post_data["author"],
title=post_data["title"],
content=post_data.get("selftext", ""),
url=post_data["url"],
timestamp=post_data["created_utc"],
source=self.source_name,
comments=self._get_post_comments(post_data["id"]),
)
post.subreddit = post_data["subreddit"]
post.upvotes = post_data["ups"]
posts.append(post)
return posts
def _get_post_comments(self, post_id: str) -> list[Comment]:
comments: list[Comment] = []
url = f"comments/{post_id}.json"
data = self._fetch_post_overviews(url, {})
if len(data) < 2:
return comments
comment_data = data[1]["data"]["children"]
def _parse_comment_tree(items, parent_id=None):
for item in items:
if item["kind"] != "t1":
continue
comment_info = item["data"]
comment = Comment(
id=comment_info["id"],
post_id=post_id,
author=comment_info["author"],
content=comment_info.get("body", ""),
timestamp=comment_info["created_utc"],
reply_to=parent_id or comment_info.get("parent_id", None),
source=self.source_name,
)
comments.append(comment)
# Process replies recursively
replies = comment_info.get("replies")
if replies and isinstance(replies, dict):
reply_items = replies.get("data", {}).get("children", [])
_parse_comment_tree(reply_items, parent_id=comment.id)
_parse_comment_tree(comment_data)
return comments
def _parse_user(self, data) -> User:
user_data = data["data"]
user = User(username=user_data["name"], created_utc=user_data["created_utc"])
user.karma = user_data["total_karma"]
return user
def _get_token(self):
if self.token and time.time() < self.token_expiry:
return self.token
logger.info("Fetching new Reddit access token...")
auth = HTTPBasicAuth(CLIENT_ID, CLIENT_SECRET)
data = {
"grant_type": "client_credentials"
}
headers = {
"User-Agent": "python:ethnography-college-project:0.1 (by /u/ThisBirchWood)"
}
response = requests.post(
"https://www.reddit.com/api/v1/access_token",
auth=auth,
data=data,
headers=headers,
)
response.raise_for_status()
token_json = response.json()
self.token = token_json["access_token"]
self.token_expiry = time.time() + token_json["expires_in"] - 60
logger.info(
f"Obtained new Reddit access token (expires in {token_json['expires_in']}s)"
)
return self.token
def _fetch_post_overviews(self, endpoint: str, params: dict) -> dict:
url = f"https://oauth.reddit.com/{endpoint.lstrip('/')}"
max_retries = 15
backoff = 1 # seconds
for attempt in range(max_retries):
try:
response = requests.get(
url,
headers={
"User-agent": "python:ethnography-college-project:0.1 (by /u/ThisBirchWood)",
"Authorization": f"Bearer {self._get_token()}",
},
params=params,
)
if response.status_code == 429:
try:
wait_time = int(response.headers.get("X-Ratelimit-Reset", backoff))
wait_time += 1 # Add a small buffer to ensure the rate limit has reset
except ValueError:
wait_time = backoff
logger.warning(
f"Rate limited by Reddit API. Retrying in {wait_time} seconds..."
)
time.sleep(wait_time)
backoff *= 2
continue
if response.status_code == 500:
logger.warning("Server error from Reddit API. Retrying...")
time.sleep(backoff)
backoff *= 2
continue
response.raise_for_status()
return response.json()
except requests.RequestException as e:
print(f"Error fetching data from Reddit API: {e}")
return {}

View File

@@ -0,0 +1,35 @@
import pkgutil
import importlib
import server.connectors
from server.connectors.base import BaseConnector
def _discover_connectors() -> list[type[BaseConnector]]:
"""Walk the connectors package and collect all BaseConnector subclasses."""
for _, module_name, _ in pkgutil.iter_modules(server.connectors.__path__):
if module_name in ("base", "registry"):
continue
importlib.import_module(f"server.connectors.{module_name}")
return [
cls
for cls in BaseConnector.__subclasses__()
if cls.source_name # guard against abstract intermediaries
]
def get_available_connectors() -> dict[str, type[BaseConnector]]:
return {c.source_name: c for c in _discover_connectors() if c.is_available()}
def get_connector_metadata() -> dict[str, dict]:
res = {}
for id, obj in get_available_connectors().items():
res[id] = {
"id": id,
"label": obj.display_name,
"search_enabled": obj.search_enabled,
"categories_enabled": obj.categories_enabled,
}
return res

View File

@@ -0,0 +1,118 @@
import os
import datetime
import logging
from dotenv import load_dotenv
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from dto.post import Post
from dto.comment import Comment
from server.connectors.base import BaseConnector
load_dotenv()
API_KEY = os.getenv("YOUTUBE_API_KEY")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
class YouTubeAPI(BaseConnector):
source_name: str = "youtube"
display_name: str = "YouTube"
search_enabled: bool = True
categories_enabled: bool = False
def __init__(self):
self.youtube = build("youtube", "v3", developerKey=API_KEY)
def get_new_posts_by_search(
self, search: str, category: str, post_limit: int
) -> list[Post]:
videos = self._search_videos(search, post_limit)
posts = []
for video in videos:
video_id = video["id"]["videoId"]
snippet = video["snippet"]
title = snippet["title"]
description = snippet["description"]
published_at = datetime.datetime.strptime(
snippet["publishedAt"], "%Y-%m-%dT%H:%M:%SZ"
).timestamp()
channel_title = snippet["channelTitle"]
comments = []
comments_data = self._get_video_comments(video_id)
for comment_thread in comments_data:
comment_snippet = comment_thread["snippet"]["topLevelComment"][
"snippet"
]
comment = Comment(
id=comment_thread["id"],
post_id=video_id,
content=comment_snippet["textDisplay"],
author=comment_snippet["authorDisplayName"],
timestamp=datetime.datetime.strptime(
comment_snippet["publishedAt"], "%Y-%m-%dT%H:%M:%SZ"
).timestamp(),
reply_to=None,
source=self.source_name,
)
comments.append(comment)
post = Post(
id=video_id,
content=f"{title}\n\n{description}",
author=channel_title,
timestamp=published_at,
url=f"https://www.youtube.com/watch?v={video_id}",
title=title,
source=self.source_name,
comments=comments,
)
posts.append(post)
return posts
def category_exists(self, category):
return True
def _search_videos(self, query, limit):
results = []
next_page_token = None
while len(results) < limit:
batch_size = min(50, limit - len(results))
request = self.youtube.search().list(
q=query,
part="snippet",
type="video",
maxResults=batch_size,
pageToken=next_page_token
)
response = request.execute()
results.extend(response.get("items", []))
logging.info(f"Fetched {len(results)} out of {limit} videos for query '{query}'")
next_page_token = response.get("nextPageToken")
if not next_page_token:
logging.warning(f"No more pages of results available for query '{query}'")
break
return results[:limit]
def _get_video_comments(self, video_id):
request = self.youtube.commentThreads().list(
part="snippet", videoId=video_id, textFormat="plainText"
)
try:
response = request.execute()
except HttpError as e:
print(f"Error fetching comments for video {video_id}: {e}")
return []
return response.get("items", [])

61
server/core/auth.py Normal file
View File

@@ -0,0 +1,61 @@
import re
from server.db.database import PostgresConnector
from flask_bcrypt import Bcrypt
EMAIL_REGEX = re.compile(r"[^@]+@[^@]+\.[^@]+")
class AuthManager:
def __init__(self, db: PostgresConnector, bcrypt: Bcrypt):
self.db = db
self.bcrypt = bcrypt
# private
def _save_user(self, username, email, password_hash):
query = """
INSERT INTO users (username, email, password_hash)
VALUES (%s, %s, %s)
"""
self.db.execute(query, (username, email, password_hash))
# public
def register_user(self, username, email, password):
hashed_password = self.bcrypt.generate_password_hash(password).decode("utf-8")
if len(username) < 3:
raise ValueError("Username must be longer than 3 characters")
if not EMAIL_REGEX.match(email):
raise ValueError("Please enter a valid email address")
if self.get_user_by_email(email):
raise ValueError("Email already registered")
if self.get_user_by_username(username):
raise ValueError("Username already taken")
self._save_user(username, email, hashed_password)
def authenticate_user(self, username, password):
user = self.get_user_by_username(username)
if user and self.bcrypt.check_password_hash(user["password_hash"], password):
return user
return None
def get_user_by_id(self, user_id):
query = "SELECT id, username, email FROM users WHERE id = %s"
result = self.db.execute(query, (user_id,), fetch=True)
return result[0] if result else None
def get_user_by_username(self, username) -> dict:
query = (
"SELECT id, username, email, password_hash FROM users WHERE username = %s"
)
result = self.db.execute(query, (username,), fetch=True)
return result[0] if result else None
def get_user_by_email(self, email) -> dict:
query = "SELECT id, username, email, password_hash FROM users WHERE email = %s"
result = self.db.execute(query, (email,), fetch=True)
return result[0] if result else None

202
server/core/datasets.py Normal file
View File

@@ -0,0 +1,202 @@
import pandas as pd
from server.db.database import PostgresConnector
from psycopg2.extras import Json
from server.exceptions import NonExistentDatasetException
class DatasetManager:
def __init__(self, db: PostgresConnector):
self.db = db
def authorize_user_dataset(self, dataset_id: int, user_id: int) -> bool:
dataset_info = self.get_dataset_info(dataset_id)
if dataset_info.get("user_id", None) == None:
return False
if dataset_info.get("user_id") != user_id:
return False
return True
def get_user_datasets(self, user_id: int) -> list[dict]:
query = "SELECT * FROM datasets WHERE user_id = %s"
return self.db.execute(query, (user_id,), fetch=True)
def get_dataset_content(self, dataset_id: int) -> pd.DataFrame:
query = "SELECT * FROM events WHERE dataset_id = %s"
result = self.db.execute(query, (dataset_id,), fetch=True)
df = pd.DataFrame(result)
if df.empty:
return df
dedupe_columns = [
column
for column in [
"post_id",
"parent_id",
"reply_to",
"author",
"type",
"timestamp",
"dt",
"title",
"content",
"source",
"topic",
]
if column in df.columns
]
if dedupe_columns:
df = df.drop_duplicates(subset=dedupe_columns, keep="first")
else:
df = df.drop_duplicates(keep="first")
return df.reset_index(drop=True)
def get_dataset_info(self, dataset_id: int) -> dict:
query = "SELECT * FROM datasets WHERE id = %s"
result = self.db.execute(query, (dataset_id,), fetch=True)
if not result:
raise NonExistentDatasetException(f"Dataset {dataset_id} does not exist")
return result[0]
def save_dataset_info(self, user_id: int, dataset_name: str, topics: dict) -> int:
query = """
INSERT INTO datasets (user_id, name, topics)
VALUES (%s, %s, %s)
RETURNING id
"""
result = self.db.execute(
query, (user_id, dataset_name, Json(topics)), fetch=True
)
return result[0]["id"] if result else None
def save_dataset_content(self, dataset_id: int, event_data: pd.DataFrame):
if event_data.empty:
return
dedupe_columns = [
column for column in ["id", "type", "source"] if column in event_data.columns
]
if dedupe_columns:
event_data = event_data.drop_duplicates(subset=dedupe_columns, keep="first")
else:
event_data = event_data.drop_duplicates(keep="first")
self.delete_dataset_content(dataset_id)
query = """
INSERT INTO events (
dataset_id,
post_id,
type,
parent_id,
author,
title,
content,
timestamp,
date,
dt,
hour,
weekday,
reply_to,
source,
topic,
topic_confidence,
ner_entities,
emotion_anger,
emotion_disgust,
emotion_fear,
emotion_joy,
emotion_sadness
)
VALUES (
%s, %s, %s, %s, %s,
%s, %s, %s, %s, %s,
%s, %s, %s, %s, %s,
%s, %s, %s, %s, %s,
%s, %s
)
"""
values = [
(
dataset_id,
row["id"],
row["type"],
row["parent_id"],
row["author"],
row.get("title"),
row["content"],
row["timestamp"],
row["date"],
row["dt"],
row["hour"],
row["weekday"],
row.get("reply_to"),
row["source"],
row.get("topic"),
row.get("topic_confidence"),
Json(row["entities"]) if row.get("entities") is not None else None,
row.get("emotion_anger"),
row.get("emotion_disgust"),
row.get("emotion_fear"),
row.get("emotion_joy"),
row.get("emotion_sadness"),
)
for _, row in event_data.iterrows()
]
self.db.execute_batch(query, values)
def set_dataset_status(
self, dataset_id: int, status: str, status_message: str | None = None
):
if status not in ["fetching", "processing", "complete", "error"]:
raise ValueError("Invalid status")
query = """
UPDATE datasets
SET status = %s,
status_message = %s,
completed_at = CASE
WHEN %s = 'complete' THEN NOW()
ELSE NULL
END
WHERE id = %s
"""
self.db.execute(query, (status, status_message, status, dataset_id))
def get_dataset_status(self, dataset_id: int):
query = """
SELECT status, status_message, completed_at
FROM datasets
WHERE id = %s
"""
result = self.db.execute(query, (dataset_id,), fetch=True)
if not result:
print(result)
raise NonExistentDatasetException(f"Dataset {dataset_id} does not exist")
return result[0]
def update_dataset_name(self, dataset_id: int, new_name: str):
query = "UPDATE datasets SET name = %s WHERE id = %s"
self.db.execute(query, (new_name, dataset_id))
def delete_dataset_info(self, dataset_id: int):
query = "DELETE FROM datasets WHERE id = %s"
self.db.execute(query, (dataset_id,))
def delete_dataset_content(self, dataset_id: int):
query = "DELETE FROM events WHERE dataset_id = %s"
self.db.execute(query, (dataset_id,))

62
server/db/database.py Normal file
View File

@@ -0,0 +1,62 @@
import os
import psycopg2
import os
from dotenv import load_dotenv
from psycopg2.extras import RealDictCursor
from psycopg2.extras import execute_batch
load_dotenv()
postgres_host = os.getenv("POSTGRES_HOST", "localhost")
postgres_port = os.getenv("POSTGRES_PORT", 5432)
postgres_user = os.getenv("POSTGRES_USER", "postgres")
postgres_password = os.getenv("POSTGRES_PASSWORD", "postgres")
postgres_db = os.getenv("POSTGRES_DB", "postgres")
from server.exceptions import DatabaseNotConfiguredException
class PostgresConnector:
"""
Simple PostgreSQL connector (single connection).
"""
def __init__(self):
try:
self.connection = psycopg2.connect(
host=postgres_host,
port=postgres_port,
user=postgres_user,
password=postgres_password,
database=postgres_db,
)
except psycopg2.OperationalError as e:
raise DatabaseNotConfiguredException(
f"Ensure database is up and running: {e}"
)
self.connection.autocommit = False
def execute(self, query, params=None, fetch=False) -> list:
try:
with self.connection.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute(query, params)
result = cursor.fetchall() if fetch else None
self.connection.commit()
return result
except Exception:
self.connection.rollback()
raise
def execute_batch(self, query, values):
try:
with self.connection.cursor(cursor_factory=RealDictCursor) as cursor:
execute_batch(cursor, query, values)
self.connection.commit()
except Exception:
self.connection.rollback()
raise
def close(self):
if self.connection:
self.connection.close()

66
server/db/schema.sql Normal file
View File

@@ -0,0 +1,66 @@
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username VARCHAR(255) NOT NULL UNIQUE,
email VARCHAR(255) NOT NULL UNIQUE,
password_hash VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE datasets (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL,
name VARCHAR(255) NOT NULL,
description TEXT,
-- Job state machine
status TEXT NOT NULL DEFAULT 'processing',
status_message TEXT,
completed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
topics JSONB,
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
-- Enforce valid states
CONSTRAINT datasets_status_check
CHECK (status IN ('fetching', 'processing', 'complete', 'error'))
);
CREATE TABLE events (
/* Required Fields */
id SERIAL PRIMARY KEY,
dataset_id INTEGER NOT NULL,
post_id VARCHAR(255) NOT NULL,
type VARCHAR(255) NOT NULL,
author VARCHAR(255) NOT NULL,
content TEXT NOT NULL,
timestamp BIGINT NOT NULL,
date DATE NOT NULL,
dt TIMESTAMP NOT NULL,
hour INTEGER NOT NULL,
weekday VARCHAR(255) NOT NULL,
/* Posts Only */
title TEXT,
/* Comments Only*/
parent_id VARCHAR(255),
reply_to VARCHAR(255),
source VARCHAR(255) NOT NULL,
/* NLP Fields */
topic VARCHAR(255),
topic_confidence FLOAT,
ner_entities JSONB,
emotion_anger FLOAT,
emotion_disgust FLOAT,
emotion_fear FLOAT,
emotion_joy FLOAT,
emotion_sadness FLOAT,
FOREIGN KEY (dataset_id) REFERENCES datasets(id) ON DELETE CASCADE
);

8
server/exceptions.py Normal file
View File

@@ -0,0 +1,8 @@
class NotAuthorisedException(Exception):
pass
class NonExistentDatasetException(Exception):
pass
class DatabaseNotConfiguredException(Exception):
pass

View File

@@ -0,0 +1,23 @@
from celery import Celery
from dotenv import load_dotenv
from server.utils import get_env
load_dotenv()
REDIS_URL = get_env("REDIS_URL")
def create_celery():
celery = Celery(
"ethnograph",
broker=REDIS_URL,
backend=REDIS_URL,
)
celery.conf.task_serializer = "json"
celery.conf.result_serializer = "json"
celery.conf.accept_content = ["json"]
return celery
celery = create_celery()
from server.queue import tasks

84
server/queue/tasks.py Normal file
View File

@@ -0,0 +1,84 @@
from time import time
import pandas as pd
import logging
from server.queue.celery_app import celery
from server.analysis.enrichment import DatasetEnrichment
from server.db.database import PostgresConnector
from server.core.datasets import DatasetManager
from server.connectors.registry import get_available_connectors
logger = logging.getLogger(__name__)
@celery.task(bind=True, max_retries=3)
def process_dataset(self, dataset_id: int, posts: list, topics: dict):
db = PostgresConnector()
dataset_manager = DatasetManager(db)
try:
df = pd.DataFrame(posts)
dataset_manager.set_dataset_status(
dataset_id, "processing", "NLP Processing Started"
)
processor = DatasetEnrichment(df, topics)
enriched_df = processor.enrich()
dataset_manager.save_dataset_content(dataset_id, enriched_df)
dataset_manager.set_dataset_status(
dataset_id, "complete", "NLP Processing Completed Successfully"
)
except Exception as e:
dataset_manager.set_dataset_status(
dataset_id, "error", f"An error occurred: {e}"
)
@celery.task(bind=True, max_retries=3)
def fetch_and_process_dataset(
self, dataset_id: int, source_info: list[dict], topics: dict
):
connectors = get_available_connectors()
db = PostgresConnector()
dataset_manager = DatasetManager(db)
posts = []
try:
for metadata in source_info:
fetch_start = time()
name = metadata["name"]
search = metadata.get("search")
category = metadata.get("category")
limit = metadata.get("limit", 100)
connector = connectors[name]()
raw_posts = connector.get_new_posts_by_search(
search=search, category=category, post_limit=limit
)
posts.extend(post.to_dict() for post in raw_posts)
fetch_time = time() - fetch_start
df = pd.DataFrame(posts)
nlp_start = time()
dataset_manager.set_dataset_status(
dataset_id, "processing", "NLP Processing Started"
)
processor = DatasetEnrichment(df, topics)
enriched_df = processor.enrich()
nlp_time = time() - nlp_start
dataset_manager.save_dataset_content(dataset_id, enriched_df)
dataset_manager.set_dataset_status(
dataset_id, "complete", f"Completed Successfully. Fetch time: {fetch_time:.2f}s, NLP time: {nlp_time:.2f}s"
)
except Exception as e:
dataset_manager.set_dataset_status(
dataset_id, "error", f"An error occurred: {e}"
)

View File

@@ -1,170 +0,0 @@
import pandas as pd
import datetime
import nltk
from nltk.corpus import stopwords
from server.analysis.nlp import NLP
from server.analysis.temporal import TemporalAnalysis
from server.analysis.emotional import EmotionalAnalysis
from server.analysis.interactional import InteractionAnalysis
from server.analysis.linguistic import LinguisticAnalysis
from server.analysis.cultural import CulturalAnalysis
DOMAIN_STOPWORDS = {
"www", "https", "http",
"boards", "boardsie",
"comment", "comments",
"discussion", "thread",
"post", "posts",
"would", "get", "one"
}
nltk.download('stopwords')
EXCLUDE_WORDS = set(stopwords.words('english')) | DOMAIN_STOPWORDS
class StatGen:
def __init__(self, df: pd.DataFrame, domain_topics: dict) -> None:
comments_df = df[["id", "comments"]].explode("comments")
comments_df = comments_df[comments_df["comments"].apply(lambda x: isinstance(x, dict))]
comments_df = pd.json_normalize(comments_df["comments"])
posts_df = df.drop(columns=["comments"])
posts_df["type"] = "post"
posts_df["parent_id"] = None
comments_df["type"] = "comment"
comments_df["parent_id"] = comments_df.get("post_id")
self.domain_topics = domain_topics
self.df = pd.concat([posts_df, comments_df])
self.df.drop(columns=["post_id"], inplace=True, errors="ignore")
self.nlp = NLP(self.df, "title", "content", domain_topics)
self.nlp.add_emotion_cols()
self.nlp.add_topic_col()
self.nlp.add_ner_cols()
self._add_time_cols(self.df)
self.temporal_analysis = TemporalAnalysis(self.df)
self.emotional_analysis = EmotionalAnalysis(self.df)
self.interaction_analysis = InteractionAnalysis(self.df, EXCLUDE_WORDS)
self.linguistic_analysis = LinguisticAnalysis(self.df, EXCLUDE_WORDS)
self.cultural_analysis = CulturalAnalysis(self.df)
self.original_df = self.df.copy(deep=True)
## Private Methods
def _add_time_cols(self, df: pd.DataFrame) -> None:
df['timestamp'] = pd.to_numeric(df['timestamp'], errors='coerce')
df['date'] = pd.to_datetime(df['timestamp'], unit='s').dt.date
df["dt"] = pd.to_datetime(df["timestamp"], unit="s", utc=True)
df["hour"] = df["dt"].dt.hour
df["weekday"] = df["dt"].dt.day_name()
## Public
# topics over time
# emotions over time
def get_time_analysis(self) -> dict:
return {
"events_per_day": self.temporal_analysis.posts_per_day(),
"weekday_hour_heatmap": self.temporal_analysis.heatmap()
}
# average topic duration
def get_content_analysis(self) -> dict:
return {
"word_frequencies": self.linguistic_analysis.word_frequencies(),
"common_two_phrases": self.linguistic_analysis.ngrams(),
"common_three_phrases": self.linguistic_analysis.ngrams(n=3),
"average_emotion_by_topic": self.emotional_analysis.avg_emotion_by_topic(),
"reply_time_by_emotion": self.temporal_analysis.avg_reply_time_per_emotion()
}
# average emotion per user
# average chain length
def get_user_analysis(self) -> dict:
return {
"top_users": self.interaction_analysis.top_users(),
"users": self.interaction_analysis.per_user_analysis()
}
# average / max thread depth
# high engagment threads based on volume
def get_interactional_analysis(self) -> dict:
return {
"average_thread_depth": self.interaction_analysis.average_thread_depth(),
"average_thread_length_by_emotion": self.interaction_analysis.average_thread_length_by_emotion(),
"interaction_graph": self.interaction_analysis.interaction_graph()
}
# detect community jargon
# in-group and out-group linguistic markers
def get_cultural_analysis(self) -> dict:
return {
"identity_markers": self.cultural_analysis.get_identity_markers(),
"stance_markers": self.cultural_analysis.get_stance_markers(),
"entity_salience": self.cultural_analysis.get_avg_emotions_per_entity()
}
def summary(self) -> dict:
total_posts = (self.df["type"] == "post").sum()
total_comments = (self.df["type"] == "comment").sum()
events_per_user = self.df.groupby("author").size()
return {
"total_events": int(len(self.df)),
"total_posts": int(total_posts),
"total_comments": int(total_comments),
"unique_users": int(events_per_user.count()),
"comments_per_post": round(total_comments / max(total_posts, 1), 2),
"lurker_ratio": round((events_per_user == 1).mean(), 2),
"time_range": {
"start": int(self.df["dt"].min().timestamp()),
"end": int(self.df["dt"].max().timestamp())
},
"sources": self.df["source"].dropna().unique().tolist()
}
def filter_by_query(self, search_query: str) -> dict:
self.df = self.df[
self.df["content"].str.contains(search_query)
]
return {
"rows": len(self.df),
"data": self.df.to_dict(orient="records")
}
def set_time_range(self, start: datetime.datetime, end: datetime.datetime) -> dict:
self.df = self.df[
(self.df["dt"] >= start) &
(self.df["dt"] <= end)
]
return {
"rows": len(self.df),
"data": self.df.to_dict(orient="records")
}
"""
Input is a hash map (source_name: str -> enabled: bool)
"""
def filter_data_sources(self, data_sources: dict) -> dict:
enabled_sources = [src for src, enabled in data_sources.items() if enabled]
if not enabled_sources:
raise ValueError("Please choose at least one data source")
self.df = self.df[self.df["source"].isin(enabled_sources)]
return {
"rows": len(self.df),
"data": self.df.to_dict(orient="records")
}
def reset_dataset(self) -> None:
self.df = self.original_df.copy(deep=True)

67
server/topics.json Normal file
View File

@@ -0,0 +1,67 @@
{
"Personal Life": "daily life, life updates, what happened today, personal stories, life events, reflections",
"Relationships": "dating, relationships, breakups, friendships, family relationships, marriage, relationship advice",
"Family & Parenting": "parents, parenting, children, raising kids, family dynamics, family stories",
"Work & Careers": "jobs, workplaces, office life, promotions, quitting jobs, career advice, workplace drama",
"Education": "school, studying, exams, university, homework, academic pressure, learning experiences",
"Money & Finance": "saving money, debt, budgeting, cost of living, financial advice, personal finance",
"Health & Fitness": "exercise, gym, workouts, running, diet, fitness routines, weight loss",
"Mental Health": "stress, anxiety, depression, burnout, therapy, emotional wellbeing",
"Food & Cooking": "meals, cooking, recipes, restaurants, snacks, food opinions",
"Travel": "holidays, trips, tourism, travel experiences, airports, flights, travel tips",
"Entertainment": "movies, TV shows, streaming services, celebrities, pop culture",
"Music": "songs, albums, artists, concerts, music opinions",
"Gaming": "video games, gaming culture, consoles, PC gaming, esports",
"Sports": "sports matches, teams, players, competitions, sports opinions",
"Technology": "phones, gadgets, apps, AI, software, tech trends",
"Internet Culture": "memes, viral trends, online jokes, internet drama, trending topics",
"Social Media": "platforms, influencers, content creators, algorithms, online communities",
"News & Current Events": "breaking news, world events, major incidents, public discussions",
"Politics": "political debates, elections, government policies, ideology",
"Culture & Society": "social issues, cultural trends, generational debates, societal changes",
"Identity & Lifestyle": "personal identity, lifestyle choices, values, self-expression",
"Hobbies & Interests": "art, photography, crafts, collecting, hobbies",
"Fashion & Beauty": "clothing, style, makeup, skincare, fashion trends",
"Animals & Pets": "pets, animal videos, pet care, wildlife",
"Humour": "jokes, funny stories, sarcasm, memes",
"Opinions & Debates": "hot takes, controversial opinions, arguments, discussions",
"Advice & Tips": "life advice, tutorials, how-to tips, recommendations",
"Product Reviews": "reviews, recommendations, experiences with products",
"Complaints & Rants": "frustrations, complaining, venting about things",
"Motivation & Inspiration": "motivational quotes, success stories, encouragement",
"Questions & Curiosity": "asking questions, seeking opinions, curiosity posts",
"Celebrations & Achievements": "birthdays, milestones, achievements, good news",
"Random Thoughts": "shower thoughts, observations, random ideas"
}

57
server/utils.py Normal file
View File

@@ -0,0 +1,57 @@
import datetime
import os
from flask import request
def parse_datetime_filter(value):
if not value:
return None
try:
return datetime.datetime.fromisoformat(value)
except ValueError:
try:
return datetime.datetime.fromtimestamp(float(value))
except ValueError as err:
raise ValueError(
"Date filters must be ISO-8601 strings or Unix timestamps"
) from err
def get_request_filters() -> dict:
filters = {}
search_query = request.args.get("search_query") or request.args.get("query")
if search_query:
filters["search_query"] = search_query
start_date = parse_datetime_filter(
request.args.get("start_date") or request.args.get("start")
)
if start_date:
filters["start_date"] = start_date
end_date = parse_datetime_filter(
request.args.get("end_date") or request.args.get("end")
)
if end_date:
filters["end_date"] = end_date
data_sources = request.args.getlist("data_sources")
if not data_sources:
data_sources = request.args.getlist("sources")
if len(data_sources) == 1 and "," in data_sources[0]:
data_sources = [
source.strip() for source in data_sources[0].split(",") if source.strip()
]
if data_sources:
filters["data_sources"] = data_sources
return filters
def get_env(name: str) -> str:
value = os.getenv(name)
if not value:
raise RuntimeError(f"Missing required environment variable: {name}")
return value