Automatic Scraping of dataset options #9

Merged
dylan merged 36 commits from feat/automatic-scraping-datasets into main 2026-03-14 21:58:49 +00:00
Owner

They will now be more used in the backend.

They will now be more used in the backend.
dylan added 1 commit 2026-03-09 20:53:41 +00:00
They will now be more used in the backend.
dylan added 1 commit 2026-03-09 20:55:15 +00:00
Ensures consistency with the other dataset-based endpoints and follows the REST-API rules more cleanly.
dylan added 1 commit 2026-03-09 21:29:11 +00:00
Idea is to have a "plugin-type" system, where new connectors can extend the `BaseConnector` class and implement the fetch posts method.

These are automatically detected by the registry, and automatically used in new Flask endpoints that give a list of possible sources.

Allows for an open-ended system where new data scrapers / API consumers can be added dynamically.
dylan added 1 commit 2026-03-10 18:08:03 +00:00
dylan added 1 commit 2026-03-10 18:11:37 +00:00
dylan added 1 commit 2026-03-10 18:18:46 +00:00
Registry paths were using the incorrect connector path locations.
dylan added 3 commits 2026-03-10 18:41:09 +00:00
This is easier and quicker compared to deriving a topics list based on the dataset that has been scraped.

While using LLMs to create a personalised topic list based on the query, category or dataset itself would yield better results for most, it is beyond the scope of this project.
dylan added 4 commits 2026-03-10 19:17:45 +00:00
dylan added 1 commit 2026-03-10 19:23:51 +00:00
dylan added 2 commits 2026-03-10 22:17:04 +00:00
Calling the original `process_dataset` function led to issues with JSON serialisation.
dylan added 2 commits 2026-03-10 22:48:08 +00:00
dylan added 1 commit 2026-03-10 23:15:36 +00:00
dylan added 1 commit 2026-03-10 23:28:33 +00:00
dylan added 2 commits 2026-03-11 19:41:37 +00:00
dylan added 1 commit 2026-03-11 19:44:40 +00:00
dylan added 1 commit 2026-03-11 19:47:45 +00:00
dylan added 1 commit 2026-03-11 21:16:28 +00:00
Ideally category and search are fully optional, however some sites break if one or the other is not provided.

Unfortuntely `boards.ie` has a different page type for searches and I'm not bothered to implement a scraper from scratch.

In addition, removed comment limit options.
dylan added 4 commits 2026-03-13 21:59:19 +00:00
dylan added 4 commits 2026-03-14 17:14:07 +00:00
dylan added 1 commit 2026-03-14 17:35:10 +00:00
dylan added 1 commit 2026-03-14 21:53:16 +00:00
In addition, I made some methods private to better align with the BaseConnector parent class.
dylan added 1 commit 2026-03-14 21:58:03 +00:00
Mainly a security thing, we don't want actual code errors being given in the API response, as someone could find out how the inner workings of the code behaves.
dylan changed title from WIP: Automatic Scraping of dataset options to Automatic Scraping of dataset options 2026-03-14 21:58:45 +00:00
dylan merged commit 94befb61c5 into main 2026-03-14 21:58:49 +00:00
dylan deleted branch feat/automatic-scraping-datasets 2026-03-14 21:58:49 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dylan/crosspost#9