Compare commits

...

3 Commits

5 changed files with 144 additions and 28 deletions

BIN
report/img/gnatt.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

View File

@@ -83,9 +83,9 @@ A defining feature of this project is its focus on a geographically grounded dat
\subsection{What is Digital Ethnography?} \subsection{What is Digital Ethnography?}
Digital Ethnography is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities. Digital Ethnography is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities.
There are multiple methods to carry out digital ethnography, such as online participant observation through automated or manual methods, digital interviews via text or video or tracing digital footprints. \cite{dominguez2007virtual} There are multiple methods to carry out digital ethnography, such as online participant observation through automated or manual methods, digital interviews via text or video or tracing digital footprints.
Compared to traditional ethnography, digital ethnography is usually faster and more cost-effective due to the availability of large swathes of data across social media sites such as Reddit, YouTube, and Facebook and lack of need to travel. Traditional ethnography often relied on in-person interviews and in-person observation of communities Compared to traditional ethnography, digital ethnography is usually faster and more cost-effective due to the availability of large swathes of data across social media sites such as Reddit, YouTube, and Facebook and lack of need to travel. Traditional ethnography often relied on in-person interviews and in-person observation of communities. \cite{coleman2010ethnographic}
\subsubsection{Traditional Ethnography} \subsubsection{Traditional Ethnography}
Ethnography originated in the late nineteenth and early twentieth centuries as a method for understanding cultures through long-term fieldwork. The goal was not just to describe behaviour, but to show how people made sense of that world. Over time, ethnography grew beyond anthropology into sociology, media studies, education, and human computer interaction, becoming a broadly used qualitative research approach. Traditional ethnography was closely tied to physical locations: villages, workplaces or towns. However, as communication technologies developed and social life increasingly took place through technological mediums, it was no longer tied to a physical place. Researchers questioned whether social interactions could still be studied properly if they were no longer tied to physical places. Ethnography originated in the late nineteenth and early twentieth centuries as a method for understanding cultures through long-term fieldwork. The goal was not just to describe behaviour, but to show how people made sense of that world. Over time, ethnography grew beyond anthropology into sociology, media studies, education, and human computer interaction, becoming a broadly used qualitative research approach. Traditional ethnography was closely tied to physical locations: villages, workplaces or towns. However, as communication technologies developed and social life increasingly took place through technological mediums, it was no longer tied to a physical place. Researchers questioned whether social interactions could still be studied properly if they were no longer tied to physical places.
@@ -137,6 +137,8 @@ Some patterns, such as usage of words like "we, us, our, ourselves", where posts
\label{sec:stance_markers} \label{sec:stance_markers}
Stance Markers refer to the usage of different phrasing patterns which can reveal the speakers attitude towards topics. There are different kinds of these phrasings, such as hedge, certainty, deontic and permission patterns. Stance Markers refer to the usage of different phrasing patterns which can reveal the speakers attitude towards topics. There are different kinds of these phrasings, such as hedge, certainty, deontic and permission patterns.
Hedge and certainty markers are discussed in this article \cite{shen2021stance}.
\textbf{Hedge Patterns} are usually phrases that contain words like "maybe, possibly, probably, i think, i feel" and generally mean that someone is unsure or suspicious about something. \textbf{Hedge Patterns} are usually phrases that contain words like "maybe, possibly, probably, i think, i feel" and generally mean that someone is unsure or suspicious about something.
\textbf{Certainty Patterns} contain phrases like "definitely, certainly, clearly, obviously" and as the name suggests, imply certainty or assuredness. \textbf{Certainty Patterns} contain phrases like "definitely, certainly, clearly, obviously" and as the name suggests, imply certainty or assuredness.
@@ -197,7 +199,7 @@ The dataset is drawn from four distinct online platforms, each of which represen
\item \textbf{Boards.ie Cork section} — an older Irish forum platform with a distinct demographic profile and lower volume compared to Reddit, providing a counterpoint to the Reddit-dominant data and representing a longer-established form of Irish online community. \item \textbf{Boards.ie Cork section} — an older Irish forum platform with a distinct demographic profile and lower volume compared to Reddit, providing a counterpoint to the Reddit-dominant data and representing a longer-established form of Irish online community.
\end{itemize} \end{itemize}
Reddit's hierarchical comment threading enables deep conversational analysis and reply-chain metrics, whereas YouTube comments are largely flat and unthreaded. Boards.ie occupies a middle ground, with linear threads but a more intimate community character. Taken together, the four sources offer variation in interaction structure, community age, demographic composition, and linguistic register, all of which are factors that the system's analytical modules are designed to detect and compare. Reddit's hierarchical comment threading enables deep conversational analysis and reply-chain metrics \cite{medvedev2019anatomy}, whereas YouTube comments are largely flat and unthreaded. Boards.ie occupies a middle ground, with linear threads but a more intimate community character. Taken together, the four sources offer variation in interaction structure, community age, demographic composition, and linguistic register, all of which are factors that the system's analytical modules are designed to detect and compare.
Due to data being collected across multiple platforms, they must be normalised into a single data model. Posts, comments, and metadata fields differ in schema and semantics across sources. A core design requirement of the system is the normalisation of these inputs into a unified event-based internal representation, allowing the same analytical pipeline to operate uniformly regardless of the source. Due to data being collected across multiple platforms, they must be normalised into a single data model. Posts, comments, and metadata fields differ in schema and semantics across sources. A core design requirement of the system is the normalisation of these inputs into a unified event-based internal representation, allowing the same analytical pipeline to operate uniformly regardless of the source.
@@ -1071,19 +1073,6 @@ With the identity markers, in-group markers such as "we", "us", "our" were count
\label{fig:stance_markers} \label{fig:stance_markers}
\end{figure} \end{figure}
\subsubsection{Summary}
During development, it was helpful to see a high-level summary of the entire dataset and it would also be helpeful for end-users on the frontend to have a quick overview of the dataset. Therefore, a "summary" statistic was implemented that returns a high-level overview of the dataset, including:
\begin{itemize}
\item Total number of posts and comments in the dataset.
\item Total number of unique users in the dataset.
\item Comments per post.
\item Lurker Ratio, which is the percentage of users that only have one event in the dataset.
\item The time range of the dataset, from the earliest event to the latest event.
\item Sources included in the dataset.
\end{itemize}
This is implemented in the same way as the other statistics, using Pandas queries and in it's own class.
\subsubsection{StatGen Class} \subsubsection{StatGen Class}
The \texttt{StatGen} (Statistics Generator) class is a higher level module that aggregates all of the different statistics into a single class that is called by the API endpoints to generate the statistics. The \texttt{StatGen} (Statistics Generator) class is a higher level module that aggregates all of the different statistics into a single class that is called by the API endpoints to generate the statistics.
@@ -1128,8 +1117,6 @@ All of these routes begin with an ownership check via \texttt{dataset\_manager.a
\texttt{POST /datasets/scrape} handles automated data fetching. The request body contains a list of source configurations, each specifying a connector and optional search query, category, and limit. Careful validation is performed on the source configurations, as any failure within the Celery task would cause a silent failure. The dataset metadata is saved to the database, and the \texttt{fetch\_and\_process\_dataset} task is dispatched asynchronously via Celery. This task fetches each source's data using the appropriate connector, combines the result into a single DataFrame, then passes it through the same enrichment and storage process. \texttt{POST /datasets/scrape} handles automated data fetching. The request body contains a list of source configurations, each specifying a connector and optional search query, category, and limit. Careful validation is performed on the source configurations, as any failure within the Celery task would cause a silent failure. The dataset metadata is saved to the database, and the \texttt{fetch\_and\_process\_dataset} task is dispatched asynchronously via Celery. This task fetches each source's data using the appropriate connector, combines the result into a single DataFrame, then passes it through the same enrichment and storage process.
\texttt{GET /datasets/sources} is an unauthenticated endpoint that returns the connector registry metadata so the frontend can dynamically render the available sources and what they can do.
\texttt{GET /dataset/<id>/status} allows the frontend to poll the state of a dataset. It returns the current status string and message stored in the \texttt{datasets} table, which the Celery worker updates at each stage of the pipeline, from \texttt{"fetching"} through \texttt{"processing"} to \texttt{"complete"} or \texttt{"error"}. \texttt{GET /dataset/<id>/status} allows the frontend to poll the state of a dataset. It returns the current status string and message stored in the \texttt{datasets} table, which the Celery worker updates at each stage of the pipeline, from \texttt{"fetching"} through \texttt{"processing"} to \texttt{"complete"} or \texttt{"error"}.
\texttt{GET /dataset/<id>/all} returns the full raw event table for a dataset as a list of records, which powers the raw data viewer in the frontend. \texttt{GET /dataset/<id>/all} returns the full raw event table for a dataset as a list of records, which powers the raw data viewer in the frontend.
@@ -1141,9 +1128,6 @@ For each type of analysis, there is a corresponding endpoint, the base configura
Each endpoint needs a JWT authorization header that corresponds to the user that owns that dataset, and the dataset ID is validated against the user's datasets to ensure they have access to it. The endpoint then fetches the entire dataset, and passes it through the global \texttt{StatGen} instance to generate statistics. The resulting statistics are returned as JSON to the frontend for visualisation. Each endpoint needs a JWT authorization header that corresponds to the user that owns that dataset, and the dataset ID is validated against the user's datasets to ensure they have access to it. The endpoint then fetches the entire dataset, and passes it through the global \texttt{StatGen} instance to generate statistics. The resulting statistics are returned as JSON to the frontend for visualisation.
\subsubsection{Access Control}
Endpoints are protected with Flask's \texttt{@jwt\_required()} decorator. This ensures that only authenticated users can access the protected endpoints. For dataset-specific endpoints, an additional ownership check is performed using \texttt{dataset\_manager.authorize\_user\_dataset()} to ensure that users can only access their own datasets. If a user attempts to access a dataset they do not own, a \texttt{403 Forbidden} response is returned.
\subsubsection{Error Handling} \subsubsection{Error Handling}
Each route handler wraps its logic in a \texttt{try/except} block that catches three categories of exception. \texttt{NotAuthorisedException} maps to a \texttt{403} response. \texttt{NonExistentDatasetException} maps to \texttt{404}. \texttt{ValueError}, which is raised by input validation in the manager layers, maps to \texttt{400}. Each route handler wraps its logic in a \texttt{try/except} block that catches three categories of exception. \texttt{NotAuthorisedException} maps to a \texttt{403} response. \texttt{NonExistentDatasetException} maps to \texttt{404}. \texttt{ValueError}, which is raised by input validation in the manager layers, maps to \texttt{400}.
@@ -1370,16 +1354,89 @@ In addition, ensuring a well-curated topic list that is specific to the dataset
\subsection{Performance Benchmarks} \subsection{Performance Benchmarks}
The benchmarks for the performance of the system were measured in terms of the time taken for each stage of the data pipeline, including both fetching and NLP processing. The benchmarks were measured in many configurations, such as different dataset sizes, different numbers of sources for fetching, pre-gathered or auto-fetched. The benchmarks for the performance of the system were measured in terms of the time taken for each stage of the data pipeline, including both fetching and NLP processing. The benchmarks were measured in many configurations, such as different dataset sizes, different numbers of sources for fetching, pre-gathered or auto-fetched.
It must be noted that this benchmark will be based on posts, and with each post comes a number of comments, so the total number of events (posts + comments) will be higher than the number of posts, and the performance will be affected by the total number of events rather than just the number of posts. Therefore, the benchmarks are based on the number of posts, as this is a more intuitive metric for users to understand when creating datasets.
Celery tasks return the time taken for a function to complete, so both the fetching and NLP processing times are recorded in the logs. This was tested on a machine with an AMD Ryzen 7 5800X3D, NVIDIA RTX 3070 Ti, 16GB of RAM and Arch Linux OS.
\subsubsection{NLP Performance} \subsubsection{NLP Performance}
This section will outline the performance of the NLP processing, which is the process of enriching the dataset with the NLP models after the data has been fetched. The performance of this feature is measured in terms of the time taken to process a certain number of posts through the NLP pipeline, which includes both emotion classification and topic classification. The benchmarks are as follows:
\begin{itemize}
\item \textbf{10 posts}: 0.40s
\item \textbf{100 posts}: 6.73s
\item \textbf{1000 posts}: 67.12s
\end{itemize}
Overall this tends to follow a linear trend, with the time taken increasing linearly with the number of posts. As noted above, the number of events the pipeline is processing is likely 10-20x the number of posts, due to comments, so the actual number of events being processed is likely around 1000 for the 100 post benchmark, and around 10,000 for the 1000 post benchmark.
The 1000 posts benchmark for \texttt{boards.ie} took 312.83s for NLP processing, which is much higher than the other sources. This is likely due to the fact that \texttt{boards.ie} is a forum site, with long running conversations that can last years, therefore the number of comments per thread. is significantly higher than other sources. There is an average of around 900 comments per post in the \texttt{boards.ie} dataset, compared to ~30 comments per post in the Reddit and YouTube datasets, which explains the significant increase in NLP processing time for the \texttt{boards.ie} dataset.
\subsubsection{Auto-fetching Performance} \subsubsection{Auto-fetching Performance}
This section outlines the performance of the auto-fetching feature, which is the process of fetching data from the sources using the connectors. The performance of this feature is measured in terms of the time taken to fetch a certain number of posts from each source. The benchmarks are shown in Table \ref{tab:performance_benchmarks}.
\begin{table}[!h]
\centering
\begin{tabular}{|c|c|c|c|}
\hline
\textbf{Size} & \textbf{Reddit} & \textbf{Boards.ie} & \textbf{YouTube} \\
\hline
10 posts & 3.25s & 103.28s & 2.08s \\
100 posts & 37.46s & 1182.71s & 12.52s \\
1000 posts & 482.87s & DNF & 74.80s \\
\hline
\end{tabular}
\caption{Performance Benchmarks for Auto-fetching and NLP Processing}
\label{tab:performance_benchmarks}
\end{table}
\subsection{Limitations} \subsection{Limitations}
Several limitations of the system became apparent through development, evaluation and user testing.
\subsubsection{NLP Accuracy}
As shown in the accuracy evaluation above, both emotion and topic classification only achieve accuracy within the 60-70\% range. This is ok for detecting patterns across the entire dataset, but on an individual level, it's quite inaccurate. The removal of the "neutral" class, while initially motivated by the desire to surface more meaningful emotional signals, introduced misclassification for genuinely neutral posts such as simple arithmetic or factual statements.
\subsubsection{Temporal Coverage}
The system is designed to fetch only the most recent posts and comments from social media platforms, which means the dataset is limited to a relatively short time window, usually a few weeks at most. This limits the ability to perform true long-term temporal analysis or to study the evolution of a community over time.
\subsubsection{Platform Coverage}
The system currently supports three data sources: Reddit, YouTube, and Boards.ie. It still excludes major platforms such as Twitter/X, Facebook, and TikTok, all of which would be valuable for ethnography, however their APIs are either extremely restrictive or non-existent for academic research. Should someone need to study communities on those platforms, they would need to upload their own datasets manually.
\subsubsection{Boards.ie Scraping Fragility}
The Boards.ie connector relies on web scraping, which is very fragile and prone to breaking. The performance benchmarks for fetching from Boards.ie are significantly higher than the other sources. Fetching 100 posts from Boards.ie took around 20 minutes, which is much higher than the other sources.
\subsubsection{English-Only Support}
Two of three NLP models used in the system are trained exclusively on English-language data. This means the system cannot accurately analyse datasets in other languages, which limits its usefulness for researchers working with non-English communities. This was noted as a specific concern by participants in the user feedback session, who work with both English and Turkish datasets.
\subsubsection{Scalability}
While asynchronous processing via Celery and Redis mitigates blocking during NLP enrichment and data fetching, the system is not designed to scale horizontally. A single Celery worker handles all tasks sequentially, and the PostgreSQL database is not configured for high-availability or replication. For research use at small to medium scale this is fine, but the system would require significant infrastructure changes to support concurrent large-scale usage across many users.
\newpage \newpage
\section{Conclusions} \section{Conclusions}
\subsection{Reflection} \subsection{Reflection}
\subsection{Future Work} I have learned a lot through the process of building this system, both in terms of technical skills and personal growth. This project represented the most technically complex system I had built independently to date. Being able
The analytical scope is the project's most visible limitation. Six analytical angles across many data sources sounds comprehensive, but with a single developer and a fixed timeline, the actual ethnographic depth achievable was modest. The decision between depth of ethnographic analysis and typical SaaS-type infrastructure and features was a tension throughout the project. Eventually a balance between the two was achieved, but there was some sacrafice in the analysis depth for the sake of building a more complete and polished system.
Planning the project was a challenge, as generally I tend to work iteratively. I jump in and start building straight away, and I find that the process of building helps me to figure out what I actually want to build. This led to some awkward parts in the report where design and implementation often overlapped and were made in a non-linear fashion. Creating the design section was difficult when implementation had already started, and design was still changed throughout the implementation process.
On a personal level, the project was a significant learning experience in terms of time management and project planning. The planning and implementation of the project was ambitious but easy to get carried away with, and I found myself spending a lot of time on features that were not essential to the core functionality of the system. The implementation was felt productive and visible in a way that the writing of a report was not, I found myself spending more time on the implementation than the report, and the report was pushed to the sidelines until the end of the project.
\subsection{How the project was conducted}
\begin{figure}[!h]
\centering
\includegraphics[width=1\textwidth]{img/gnatt.png}
\caption{Gantt Chart of the Project Timeline}
\label{fig:gnatt_chart}
\end{figure}
The project was maintained and developed using Git for version control, with the repository hosted on both Github and a self hosted Gitea instance. The project eventually began to use conventional commits to maintain a clean commit history, and commit messages contained rationale for non-obvious decisions.
Starting in Novemeber, the project went through a few iterations of basic functionality such as data retreival and storage. Research was done on digital ethnography, the traditional metrics used, and how they're implemented in code. The design of the system was also iterated on, with the initial design being a very simple frontend that showed simple aggregates, into a more complex and feature-rich dashboard with multiple analytical perspectives and NLP enrichments.
The majority of real development and implementation took place between January and April, with the final month of April being focused on testing, bug fixing, writing the report and preparation for the open day. The project was developed in an agile and iterative way, with new features being added and improved upon throughout the development process, rather than having a fixed plan for the entire project from the beginning.
Git was as a changelog of decisions and rationale, to aid writing the report. But if this project were to be done again, I would maintain the report alongside the implementation from the beginning, as it would have made writing the report much easier and less stressful at the end.
\newpage \newpage
\bibliography{references} \bibliography{references}

View File

@@ -68,3 +68,39 @@
year = {2024}, year = {2024},
doi = {10.1016/j.nlp.2024.100059} doi = {10.1016/j.nlp.2024.100059}
} }
@article{coleman2010ethnographic,
ISSN = {00846570},
URL = {http://www.jstor.org/stable/25735124},
abstract = {This review surveys and divides the ethnographic corpus on digital media into three broad but overlapping categories: the cultural politics of digital media, the vernacular cultures of digital media, and the prosaics of digital media. Engaging these three categories of scholarship on digital media, I consider how ethnographers are exploring the complex relationships between the local practices and global implications of digital media, their materiality and politics, and thier banal, as well as profound, presence in cultural life and modes of communication. I consider the way these media have become central to the articulation of cherished beliefs, ritual practices, and modes of being in the world; the fact that digital media culturally matters is undeniable but showing how, where, and why it matters is necessary to push against peculiarly narrow presumptions about the universality of digital experience.},
author = {E. Gabriella Coleman},
journal = {Annual Review of Anthropology},
pages = {487--505},
publisher = {Annual Reviews},
title = {Ethnographic Approaches to Digital Media},
urldate = {2026-04-15},
volume = {39},
year = {2010}
}
@article{shen2021stance,
author = {Shen, Qian and Tao, Yating},
title = {Stance Markers in {English} Medical Research Articles and Newspaper Opinion Columns: A Comparative Corpus-Based Study},
journal = {PLOS ONE},
volume = {16},
number = {3},
pages = {e0247981},
year = {2021},
doi = {10.1371/journal.pone.0247981}
}
@incollection{medvedev2019anatomy,
author = {Medvedev, Alexey N. and Lambiotte, Renaud and Delvenne, Jean-Charles},
title = {The Anatomy of Reddit: An Overview of Academic Research},
booktitle = {Dynamics On and Of Complex Networks III},
series = {Springer Proceedings in Complexity},
publisher = {Springer},
year = {2019},
pages = {183--204}
}

View File

@@ -234,6 +234,7 @@ class RedditAPI(BaseConnector):
if response.status_code == 429: if response.status_code == 429:
try: try:
wait_time = int(response.headers.get("X-Ratelimit-Reset", backoff)) wait_time = int(response.headers.get("X-Ratelimit-Reset", backoff))
wait_time += 1 # Add a small buffer to ensure the rate limit has reset
except ValueError: except ValueError:
wait_time = backoff wait_time = backoff

View File

@@ -1,5 +1,6 @@
import os import os
import datetime import datetime
import logging
from dotenv import load_dotenv from dotenv import load_dotenv
from googleapiclient.discovery import build from googleapiclient.discovery import build
@@ -9,9 +10,11 @@ from dto.comment import Comment
from server.connectors.base import BaseConnector from server.connectors.base import BaseConnector
load_dotenv() load_dotenv()
API_KEY = os.getenv("YOUTUBE_API_KEY") API_KEY = os.getenv("YOUTUBE_API_KEY")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
class YouTubeAPI(BaseConnector): class YouTubeAPI(BaseConnector):
source_name: str = "youtube" source_name: str = "youtube"
@@ -77,11 +80,30 @@ class YouTubeAPI(BaseConnector):
return True return True
def _search_videos(self, query, limit): def _search_videos(self, query, limit):
results = []
next_page_token = None
while len(results) < limit:
batch_size = min(50, limit - len(results))
request = self.youtube.search().list( request = self.youtube.search().list(
q=query, part="snippet", type="video", maxResults=limit q=query,
part="snippet",
type="video",
maxResults=batch_size,
pageToken=next_page_token
) )
response = request.execute() response = request.execute()
return response.get("items", []) results.extend(response.get("items", []))
logging.info(f"Fetched {len(results)} out of {limit} videos for query '{query}'")
next_page_token = response.get("nextPageToken")
if not next_page_token:
logging.warning(f"No more pages of results available for query '{query}'")
break
return results[:limit]
def _get_video_comments(self, video_id): def _get_video_comments(self, video_id):
request = self.youtube.commentThreads().list( request = self.youtube.commentThreads().list(