\documentclass{article} \usepackage{graphicx} \usepackage{setspace} \begin{document} \begin{titlepage} \centering \vspace*{3cm} {\Huge \textbf{Web-Based Tool for Observing and Analysing Online Communities} \par} \vspace{2cm} {\Large Dylan De Faoite \par} \vspace{0.5cm} {\large April 2026 \par} \vfill {\large Bachelor of Science in Computer Science \\ University College Cork \\ Supervisor: Paolo Palmeiri \par} \vspace{2cm} \end{titlepage} \section{Introduction} This project presents the design and implementation of a web-based analytics engine for the exploration and analysis of online discussion data. Built using \textbf{Flask and Pandas}, and supplemented with \textbf{Natural Language Processing} (NLP) techniques, the system provides an API for extracting structural, temporal, linguistic, and emotional insights from social media posts. A React-based frontend delivers interactive visualizations and user controls, the backend architecture implements analytical pipeline for the data, including data parsing, manipulation and analysis. \vspace{0.5cm} Beyond its technical objectives, the system is conceptually informed by approaches from \textbf{digital ethnography} and computational social science. Traditional Ethnography is the practice of studying individual or group culture from the point of view of the subject of the study. Digital ethnography seeks to understand how social relations, topics and norms are constructed in online spaces. \subsection{Motivation} There are many beneficiaries of a digital ethnography analytic system: social scientists gain a deeper understanding of contemporary culture and online communities; businesses and marketers can better understand consumer behaviour and online engagement; educators and designers can improve digital learning environments and user experiences; and policymakers can make informed decisions regarding digital platforms, online safety, and community regulation. \subsection{Goals \& Objectives} \begin{itemize} \item \textbf{Collect data ethically}: enable users to link/upload text, and interaction data (messages etc) from specified online communities. Potentially and automated method for importing (using APIs or scraping techniques) could be included as well. \item \textbf{Organise content}: Store gathered material in a structured database with tagging for themes, dates, and sources. \item \textbf{Analyse patterns}: Use natural language processing (NLP) to detect frequent keywords, sentiment, and interaction networks. \item \textbf{Visualise insights}: Present findings as charts, timelines, and network diagrams to reveal how conversations and topics evolve. \end{itemize} \subsection{The Cork Dataset} A defining feature of this project is its focus on a geographically grounded dataset centred on \textbf{Cork, Ireland}. The system analyses publicly available discussions relating to Cork drawn from multiple online platforms: \begin{itemize} \item The \textbf{r/Cork} subreddit \item The \textbf{r/Ireland} subreddit using a Cork-specific search filter \item \textbf{YouTube} videos retrieved using Cork-related search queries \item The \textbf{Boards.ie Cork section} \end{itemize} \newpage \section{Background} This section describes what digital ethnography is, how it stems from traditional ethnography and why it is useful. \subsection{Digital Ethnography} Digital Ethnography is the study of cultures and interactions in various online spaces, such as forums, posts and video comments. The goal is not only to describe the high-level statistics such as number of posts and posts per day, but also analyse people's behaviour at an interactional and cultural level, delving into common phrases, interactions patterns and common topics and entities. There are multiple methods to carry out digital ethnography, such as online participant observation through automated or manual methods, digital interviews via text or video or tracing digital footprints. Compared to traditional ethnography, digital ethnography is usually faster and more cost-effective due to the availability of large swathes of data across social media sites such as Reddit, YouTube, and Facebook and lack of need to travel. Traditional ethnography often relied on in-person interviews and in-person observation of communities \subsection{Traditional Ethnography} Ethnography originated in the late nineteenth and early twentieth centuries as a method for understanding cultures through long-term, immersive fieldwork. The goal was not merely to describe behaviour, but to interpret how people made sense of that world they were in. Over time, ethnography expanded beyond anthropology into sociology, media studies, education, and human–computer interaction, becoming a broadly used qualitative research approach. Traditional ethnography was closely tied to physical locations: villages, workplaces or towns. However, as communication technologies developed and social life increasingly took place through technological mediums, these assumptions of always being tied to a physical place dissipated. Researchers began to question whether social interactions could still be studied properly if they were no longer anchored to physical places. \subsection{Transition to Digital Spaces} The rise of the internet in the late twentieth century massively changed social interaction. Online forums, emails, SMS and social media platforms became central to human communication. All types of groups and identities were constructed. As a result, ethnographic methods were adapted to study these emerging digital environments. Early work in this area was referred to as "virtual ethnography" or "digital ethnography", where online spaces began to mixed and intertwine with traditional cultural spaces. Digital ethnography gives us new challenges to overcome in comparison to traditional ethnography. The field is distributed across platforms, devices and online-offline interactions. For example, a digital ethnographer studying influencer culture might examine Instagram posts, comment sections, private messages, algorithms, and also conduct interviews or observe offline events. This transitions requires flexibility, since researchers can no longer rely solely on face-to-face interactions. \subsection{Online Communities} There are many different types of online communities, often structured in various ways, with many different types of users, norms and power dynamics. These communities can range from large-scale social networking platforms and discussion forums to niche interest. Each type of community fosters different forms of interaction, participation, and identity construction. Participation within these communities is usually not evenly distributed. The majority of users are passive consumers (lurkers), a smaller percentage contribute occasionally, and a very small core group produces most of the content. This uneven contribution structure has significant implications for digital ethnography, as visible discourse may disproportionately reflect the perspectives of highly active members rather than the broader community. This is particularly evident in some reputation-based systems such as Reddit, which allows for the opinions of a few to rise above the rest. \subsection{Digital Ethnography Metrics} This section describes common keywords and metrics use to measure and quantify online communities using digital ethnography. \subsubsection{Sentiment Analysis} Sentiment Analysis involves capturing the emotions associated with a specific post, topic or entity. This type of analysis can be as simple as classifying a post as "positive" or "negative", or classifying a post into a set of pre-existing emotions such as anger, joy or sadness. \subsubsection{Active vs Passive Participation} Not everyone in an online community participates in the same way. Some users post regularly, leave comments, and interact with others, while many more simply read content without ever contributing anything themselves. Some might only contribute occasionally. This distinction between active and passive participation (passive users are often referred to as "lurkers") is an important one in digital ethnography, because looking only at posts and comments can give a misleading picture of how large or engaged a community actually is. \subsubsection{Temporal Activity Patterns} Looking at when a community is active can reveal quite a lot about its nature and membership. A subreddit that peaks at 2am UTC might have a mostly American userbase, while one that is consistently active across all hours could suggest a more globally distributed community. Beyond timezones, temporal patterns can also capture things like how a community responds to external events, like a sudden spike in posting activity often corresponds to something newsworthy happening that is relevant to the community. \subsubsection{Cultural Markers} Cultural markers are the words, phrases, memes, and behaviours that are specific to a particular community and signal that someone is a member of it. These might include in-jokes, niche slang, recurring references, or even particular ways of formatting posts. In the context of digital ethnography, identifying these markers is useful because they reveal how communities build a shared identity and distinguish themselves from outsiders. \subsection{Natural Language Processing} \textbf{Natural Language Processing} is a branch of artificial intelligence that allows machines to interpret, analyse and generate human language. The aim of NLP models is not only to understand single words individually, but to be able to understand the context of those words in a broader paragraph or story. NLP can carry out many different types of tasks, such as classifying sentences or paragraphs, generating text content, extracting answers from text or even speech recognition in audio. However, even with the advances in NLP models, many challenges and limitations remain. These include understanding ambiguity, cultural context, sarcasm, and humour. \subsubsection{Sentiment Analysis} \textbf{Sentiment Analysis} involves determining the emotional tone behind a piece of text. It is commonly used to classify text as positive, negative, or neutral. This technique is widely applied in areas such as customer feedback analysis, social media monitoring, and market research. More advanced sentiment analysis models can detect nuanced emotions, such as frustration, satisfaction, or sarcasm, although accurately identifying these emotions remains a challenge. \subsubsection{Named Entity Recognition} \textbf{Named Entity Recognition (NER)} is the process of identifying and classifying key entities within a text into predefined categories like names of people, organisations, locations, or dates. NER is essential for structuring unstructured text data and is often used in information extraction, search engines, and question-answering systems. Despite its usefulness, NER can struggle with ambiguous entities or context-dependent meanings. \subsubsection{Topic Modelling} \subsection{Cork Dataset} The Cork dataset serves as the foundation for this project, providing a geographically and culturally grounded corpus for analysis. Rather than examining a globally distributed or topic-neutral community, the dataset centres on a single city with Cork, Ireland which allows the system's analytical outputs to be interpreted against a known social and cultural context. The dataset is drawn from four distinct online platforms, each of which represents a structurally different mode of online community participation: \begin{itemize} \item \textbf{r/Cork} — a dedicated subreddit for Cork residents and diaspora, characterised by threaded discussion, upvote-based content curation, and an established community identity with its own norms and recurring topics. \item \textbf{r/Ireland} (Cork-filtered) — the broader Irish national subreddit, filtered by Cork-related keywords, capturing how Cork is discussed within a wider national discourse rather than within its own community space. \item \textbf{YouTube} — video comments retrieved via Cork-related search queries, representing a flatter, less threaded interaction model and a potentially more casual or emotionally expressive register than forum-style platforms. \item \textbf{Boards.ie Cork section} — an older Irish forum platform with a distinct demographic profile and lower volume compared to Reddit, providing a counterpoint to the Reddit-dominant data and representing a longer-established form of Irish online community. \end{itemize} Reddit's hierarchical comment threading enables deep conversational analysis and reply-chain metrics, whereas YouTube comments are largely flat and unthreaded. Boards.ie occupies a middle ground, with linear threads but a more intimate community character. Taken together, the four sources offer variation in interaction structure, community age, demographic composition, and linguistic register, all of which are factors that the system's analytical modules are designed to detect and compare. Collecting data across multiple platforms also introduces the challenge of normalisation. Posts, comments, and metadata fields differ in schema and semantics across sources. A core design requirement of the system is the normalisation of these inputs into a unified event-based internal representation, allowing the same analytical pipeline to operate uniformly regardless of the source. \newpage \section{Analysis} This section describes the background to digital ethnography, why it's used, and the objectives of the project. \subsection{Goals \& Objectives} The objective of this project is to provide a tool that can assist social scientists, digital ethnographers, and researchers to observing and interpret online communities and the interactions between them. Rather than replacing the study of digital ethnography or the related fields, this tool aims to aid researchers analyse communities. Specifically, the system aims to: \begin{itemize} \item \textbf{Enable scalable observation}: Provide researchers with the ability to process and explore large volumes of online discussion data that would be impractical to analyse manually. \item \textbf{Support mixed-method research}: Bridge quantitative computational analysis and qualitative ethnographic interpretation by presenting statistically derived insights that can inform deeper contextual study. \item \textbf{Reveal structural dynamics}: Expose interaction patterns such as reply networks, participation inequality, conversation depth, and influential contributors within a community. \item \textbf{Identify thematic structures}: Detect dominant topics, recurring phrases, and emerging themes using Natural Language Processing techniques. \item \textbf{Track emotional and cultural signals}: Analyse sentiment and emotion distributions across posts, users, and topics to better understand the affective tone of discussions and how it evolves over time. \item \textbf{Examine temporal evolution}: Provide time-series analysis of activity levels, topic trends, and emotional shifts, enabling longitudinal observation of community development. \item \textbf{Promote ethical data practices}: Restrict analysis to publicly available data, provide opt-out mechanisms for computationally intensive processing, and ensure responsible handling of user-generated content. \end{itemize} Ultimately, the project seeks to demonstrate how computational systems can aid and augment social scientists and digital ethnographers toolkits. \subsection{Requirements} The following requirements are derived from the backend architecture, NLP processing pipeline, and the React-based frontend interface. \subsubsection{Functional Requirements} \paragraph{Data Ingestion and Preparation} \begin{itemize} \item The system shall accept social media data in \texttt{.jsonl} format containing posts and nested comments. \item The system shall validate uploaded files and return structured error responses for invalid formats or malformed data. \item The system shall normalise posts and comments into a unified event-based dataset. \item The system shall give the user the option to automatically fetch datasets from social media sites filtered for specific keywords or categories. \item The system shall provide a loading screen with a progress bar after the dataset is uploaded. \end{itemize} \paragraph{Dataset Management} \begin{itemize} \item The system shall utilise Natural Language Processing models to generate average emotions per event. \item The system shall utilise Natural Language Processing models to classify each event into a topic. \item The system shall utilise Natural Language Processing models to identify entities in each event. \item The system shall allow the users to view the raw dataset. \item The system shall return detailed endpoints that return calculated statistics grouped into themes. \end{itemize} \paragraph{Filtering and Search} \begin{itemize} \item The system shall support keyword-based filtering across content, author, and optionally title fields. \item The system shall support filtering by start and end date ranges. \item The system shall support filtering by one or more data sources. \item The system shall allow multiple filters to be applied simultaneously. \item The system shall return a filtered dataset reflecting all active filters. \end{itemize} \paragraph{Temporal Analysis} \begin{itemize} \item The system shall compute event frequency per day. \item The system shall generate weekday--hour heatmap data representing activity distribution. \end{itemize} \paragraph{Linguistic Analysis} \begin{itemize} \item The system shall compute word frequency statistics excluding standard and domain-specific stopwords. \item The system shall extract common bi-grams and tri-grams from textual content. \item The system shall compute lexical diversity metrics for the dataset. \end{itemize} \paragraph{Emotional Analysis} \begin{itemize} \item The system shall compute average emotional distribution per topic. \item The system shall compute overall average emotional distribution across the dataset. \item The system shall determine dominant emotion distributions. \item The system shall compute emotional distribution grouped by data source. \end{itemize} \paragraph{User Analysis} \begin{itemize} \item The system shall identify top users based on activity. \item The system shall compute per-user activity and behavioural metrics. \end{itemize} \paragraph{Interaction Analysis} \begin{itemize} \item The system shall compute average conversation thread depth. \item The system shall identify top interaction pairs between users. \item The system shall generate an interaction graph based on user relationships. \item The system shall compute conversation concentration metrics. \end{itemize} \paragraph{Cultural Analysis} \begin{itemize} \item The system shall identify identity-related linguistic markers. \item The system shall detect stance-related linguistic markers. \item The system shall compute average emotional expression per detected entity. \end{itemize} \paragraph{Frontend} \begin{itemize} \item The system shall provide a frontend UI to accommodate all of the above functions \item The system shall provide a tab for each endpoint in the frontend \end{itemize} \subsubsection{Non-Functional Requirements} \paragraph{Performance} \begin{itemize} \item The system shall utilise GPU acceleration where available for NLP. \item The system shall utilise existing React libraries for visualisations. \end{itemize} \paragraph{Scalability} \begin{itemize} \item The system shall utilise cookies and session tracking for multi-user support. \item NLP models shall be cached to prevent redundant loading. \end{itemize} \paragraph{Reliability and Robustness} \begin{itemize} \item The system shall implement structured exception handling. \item The system shall return meaningful JSON error responses for invalid requests. \item The dataset reset functionality shall preserve data integrity. \end{itemize} \subsection{Limits of Computation Analysis} While computational methods enable large-scale observation and analysis of online communities, there are many limitations that must be acknowledged. Many limitations come from NLP techniques and the practical boundaries of computational resources. Natural Language Processors will be central to many aspects of the system, such as emotional and topic classification. While these models are strong and have shown results in many areas, they are imperfect and may produce inaccurate or misleading results. One key limitation is how the models will likely find it difficult to interpret context-dependent language. Online communities will often use sarcasm, irony or culturally specific references, all of which will be challenging to for NLP models to correctly interpret. For example, a sarcastic comment might be incorrectly classified as positive, despite conveying negativity. This could be especially prominent in online Irish communities which often include regional slang, abbreviations or informal grammar. Many NLP models are trained on standardised datasets like research papers or novels, therefore reducing accuracy in informal data. In addition, the simplification of complex human interactions and emotions into discrete categories like "happy" or "sad" will inevitably overlook some nuance and ambiguity, even if the model is not inherently "wrong". As a result, the outputs of NLP models should be interpreted as indicative patterns rather than definitive representations of user meaning. \subsubsection{Computational Constraints} The performance and speed of the system will be influenced by the computational resources available during development and execution. While the system will attempt to use GPU acceleration during NLP inference, these resource may not always be available or may not be particularly strong should they exist. As a result, there are practical limits on the size of datasets that can be processed efficiently. Large datasets may produce long processing times, \newpage \section{Design} \subsection{System Architecture} \newpage \section{Implementation} \newpage \section{Evaluation} \newpage \section{Conclusions} \end{document}