From afae7f42a158cdb6dd6f27a9a6a92a2c6315e3be Mon Sep 17 00:00:00 2001 From: Dylan De Faoite Date: Sat, 11 Apr 2026 15:03:24 +0100 Subject: [PATCH] docs(report): add data pipeline diagram and update references for embedding models --- report/img/pipeline.png | Bin 0 -> 26117 bytes report/main.tex | 25 +++++++++++++++++++++++++ report/references.bib | 16 +++++++++++++++- 3 files changed, 40 insertions(+), 1 deletion(-) create mode 100644 report/img/pipeline.png diff --git a/report/img/pipeline.png b/report/img/pipeline.png new file mode 100644 index 0000000000000000000000000000000000000000..c98fe6a29be694b5e736943669ce17234194a17b GIT binary patch literal 26117 zcmeFZ1z40{*DoyND1(SFAR&k}C=${jDIpz7OLun*NJtDIp&%hCO1Hp}(xnI}4MRvP z9ny97AU<*4_na@z|2og}z302Gxkl!`_rCAF*IsMwwSMcj4N+2%#JNIx<-&yvIMPz$ zk1t$+pupebu*=}v5?PmY;leFJX9*2w8+TI+Ym*Cf9Adx!qGP{jZs+7o$01I~&Te3D z&thz0VCHCGHPbXy<5b;`sa0J@h3G3A%d^KvVRGL+tz! z;D@b&&3R8H12Yrm^X?YLzq_i+n(^8x^6*G2@bEmeFy%3GIUB`4J?+mkZQLat4eZV3 z?Tk&V!Gn$6e?Nkci|_Z4j6Hr|;O64^eZkD}&uh-#ioT$1@$}D3&c@At&&VX(HE5egGA9+ z)h&#j&40J$W#{=l3-om<6ALr*KW}E^KA*6S!Jo$GGjTFEFt&61-Tyx<%5O3L5vHS^ z9eCb9tl2-_{?o|DUChK9E$csWJbw=z`@{d|eg7s6M-$ugsr;`z&Rs%T)YZyN?}4lBX0tHAoy48x7V<9bUJTq;^=I0?z+x2?0-I)|5UK%|KJX|&Q;#+@2-Q5 z?flyLYxvH8qSqY^f}P|1_rK3lu%Fx7Ut+gFpU(EIzc1qdZ@0hI=zsK$wwV9#*zNiH z+8g|pcB|?Hq8avk;s(xWqgA!HwlgsPcQIIwe?No$<6-_1gFO$|oc>QYSZ=h!{lC}_ z{>}H<^X2_(?N!0f(Z;~q!U+TzcDDaAL*=~p&s*!?e*gPyt!#hg)_SI)ARYWa-BSOH zM#Su#zb)x6@#_ClUHN;;be0PJ$(a5x(x=~f7CJ{l_d+LYAXWWePM^+%`d`njpRey< z>((V~9W9K^ZO{n8SrG5~zdu{z;ySm2^K|Lo(5JI={e^aV=F9$7KKMSYuQF1eKMBacuaUByyOFs-ym+LY>|HQ@Q

^=145S)4H5c*yjVs!}#;@%f=V6MlFw&5*oHq=3bQ9#`jx<{jfg2Ek%UE zfQ%lXEq@3Vrxb+N{^I2FV^&?GD^`vkq%D_(T8fAm)?CJW9S9Q-6L~{>RoDJodeLZ% zg2+ep`*$<0W55xt7$GZrJ39TAg$Lq6zO-`2)uM!O40!e{k4+UNE{D$$P-ZCL|vO>dr zVYD&61Sl*Ut+}pVnNmmB`^r%9v_dGS_W)u26=HN_515p7&vRM=sy?O5{wECKA}@km zdcWogqF+P_xkFD`l04n5-ZvwkroF)wW>$NsRjx?m=jbzMVSN2zqG}LKMH0;M{O#Sy`jhH{iixKgN$H$G z1atVeIU8dVBU5!n`|=0nypb*^YD0rn7gcIu2$2U6r&iZS+Aa!nOMlP9m5LRAY?k7M z6wmI55jQQ~Nk$@HtLu&^@JS|n;Oxtfdy+AF@R4)}e5zP^*X*6~V?rI-y59x&trf@ zLGC(iV4Hd|ncocw=ARi;jnSEy!GdpJKP3M0d2;D zo;zh9cSiFZQp4MCd4Ay@cUHDSAq#Y#?ay(0?j+qwx^z+P0K7<;A?4t)PdA;u&ZPTH zxXp7M?&7U({>W@-ARiNcKP4Ak6I6xkkLf`CL|V8FJqGhN2j7Ze)@**hYcpJI(5*A` z<6Wfd7re6F3D2VKwaH0xGEVcTapykyHon8<(#VyvK82lb{n z@8R_eNlp>T-a8{Z^NDuVMm&>VrH8hw&ue#4cEX;EIa%85n;!ECW=Wo5VI5wzJEK;G zwLw9bJ$FXT3f$Hv<`=wwybZqZaQgi@?kECILn?6?DSTqCvB{BDQ+c>r({s!HJBHan z`v*lbzRWYr0Gja@FZ&0*u>?b4ZxZ5qx+OxS6$*i)_8#C;2#sCio5foi$cwY7SwRkF z%9AnZGH>C&^&WF*uc{m@&_ZdNC7VPF9b{8wZ*;DXmgf;5VJ-n~^D%9weVOv5J||nL z-PGf`s&}=_l3kX=c_z#Sbj>e1csF2R#;?gVLG%Yy)yrS2J;~X)e)NUkZPog4t-gLL zI^5)7qn)B6M>*@h8%c!S$L&GakxN{m0|)m6mi+hw+ASo8~a{TfMY`xGKFu$kL02989$9ij&r* zUD}!>?F1EqS&8#Dv4*)cl!XHueIvh}QwS~IP^S^4;bFhBqKoaVXNoj(MIU#bh?#FO zHsJ7#ecjuM+*&s}+*!f#`SB^5;qY5fSQ~+&z_-et%eY*dYA?a`CZs&uZkSc_gxlw9 zG;Ds+dSp*WH5;}P%dGXJ!gf5#j)yw&DEJz=rBy+BY>EE**6CC*g*K_rM=PGsKIcv*%M=6F=>8#_93 z1-Vh;vtg6H4AFx5_ff3xiLln~O%GPPSS>*OYg!eBk7uvVkaaZ&*D^+7(&>535fVqb zlLIS2?Fb7U{Gi#rKr^^L`RuT(%H})uGzIHU_+6|fpORK0CzjhIKlxo2XPGAY!=5Gq zeP1nL)F>ZWLsb;Xo^;Sek`&=M^J`M{hYPq<{sP{n?Ah^wCZ9{^XPLy`A}yiDQ@?SK z*zri~UIjDPBqNN0p0S)H%75jqYkEYSD5auxoTJPJ=bj~n*LoxJ+e0!wgBN;Sc0U_V zJx=x)c%ZCxBN{I}wlIlipI^J*q3Ksty~v0%iWE!|<{31D+PQ|zhSgv};*UH?qJ!^v z472g;ytVOa@W_*{Q%(FO@FE1(E*Mxn<@RjEB(Cb|lw=)nrPUQNU$1#p^D*!zIrmDt z0{=;(_O$nwLkm18m1nz4AYw&$@(#+J$`*AuF1e7Qb{Rp`02wf*dh$3#1s7j%GIKs>& z(f=-rrx}-or{B$skk59~jcacXuPuF2)N(g06CXq;h_jD)h{#COZJF2VI!&}+V}=Xm z_cfo}gMGoZl~A1o7kl%Gi0*Dh<7G0iJv3V?tKm2rU^LX!TSdM8@V7V(anFw_rLfLa(zz*r=5V2#MeA%&Tw-g~us$|KweP%2C>vgsUDd<`r>I@U7I(+kuFq zl%t!snWg#%^-S_;*+xd3uA>h$%cL{b^}8P?=~@~6$;siG8+E1ALO1d=V_s#w&?kzF zDU!90c~)h^J9d4U;ex-uiiy|UGi-sO41&q6%Lmjry1R(BHAd8RJxq2n-S%iWV_Y^r zF`znL(yRc}bzUIO?PqxMv%y-!HgSd8n-P zj51H|OerGNa;OYeZAieN;OrxQ*&sD5*oJ?qC`kC*y$qPG_ZI@`j@%y!3|BKh0&B5L>N6LIqRe6B9UC@VN@z^qOl&(6RBkRrO=HGpXlppK+kYqa0+%{Asgj(2*kO zIXYZolw3|qwc?~>5J+eZ95RPjO-QqMfL+sc7^^MV*p2!wos#%uIDC^Jc4>bYdCO+) z<8BIb{m~}N`aOEc-cnIL61W4@!BWv$_uNd{^K0xUpXpG~8lFyQdw=Mo`LT~3i#XU> zL6Hr*ju{`-{sL|R1S5F&{9avBZrE1Z*`T3(XLEGG6vTB++nOc6|+taw!#WTqp zhwW8X!aid=&Cu}V#cq`DN%sD5d+&|NB`v6>BqdA&E}l|u<#6BWaca!?8orzf zO&=B*(iHsmZKRcEAPCF9*blic6;x;=`?3^yZb`LdJxwr)b-)#z785jRg1Xg$;0p*AcKIK5|8QxQ!l zNqd43+nN=GYwQA*SP+St>x5`iq5`k41gR{q9kD^Zl6SH3mOZTnjQqayJ#Bpi*?s5i z@n?rC;&;*BSHu@;Om`Kp@B@(_KJ0b~EJlHnvqb*2yD!{|y(#wisJK_=ZdEX`!_N+3 zwNFAmCHUXEQ4cqJ77A#vnTkaeIOk03xib0>)_rF|*lkmgeL0k zdpQXaD1-qoclcT@7bE2>D8A|)oi&1iWJ|du*n0;*c-F}y9njBdmnvIat12HR{=ot?D`Pa!Hfi}lME%1PC6!x1^Dy91 z205AfiNq)_W|@Q#5pkH*g7|xyUTqwmPrYM@mNj6?)ga}~%BvrE0_#^3rvUrImG(335~Sx?GZvfA^p3ZKQ`br<9t z#@{w552+@>{KkrpdZ475vW2hor50)|5+4_0C{CFHeZ@;oTOxmlJjK%WHF!ZDl%%GQ z>rf)A(^v$dpqEXbLxWd-nZe<{lwo2zx2;O5J08Msc(A8DQ%suIDzsEB!2(C+?yJv| z&|HSJNu%wnub;yZ4cZVo#7b4jQfY=o&jL&Ga$DobR*9jTQa@ZD_hw zLx@e!)|aj80<5D~pBN9YhCY{e!S_>wW#gS^VhA`!3o~rdERc*+091k0Nx5 zP{@5DvTpWih~2%7kt)u&O`8RGB~Ui4e}e}#UW-#<%P0>`SJZdCt@ff-2BIH3Fc@DN3Z+g0`6wRd~*Dl68S)3IgX+=t8uk)Aabp8}==5w7TaTU&ZImyI0@kt$*PScS z#qtRG^5F>4b_V3NZPITcmvV#V@V};@oAn3zvZ$9&kpydlMz4n>ntvdS$d|a%XF6nD zerHPAWPWe73m4Tq3AGg=87%#@$fXfkMt%&a%yAA~T8^dvx$qDoG=7hXE%8y|9c6HQ zzqBxe(=D^QJApWFY2zTJKq#p%*kQT>@p3rGAG}t?3#NaH zP_$81MlWl2sP8hi>Qq?ECzFsBseX;ue~Xmx{9^E;ys1venM=q|h4OPRJz0LBHvKV!O{gIwk`8Xg_2E60ENu>zI`wV3FhY#*B@LH2Mlu&jZYqpZo-)%=3%}&& zs*3doEb}pj8lR=F_i=LWi>SY(#&dwYA-0@&z($DTT1dMpMvs>~{U$;64WCNtLFZLX ziq$(Ts$wjy$wir2vA)lCxz?VBth}L74FcBxCBOH}Tlexu<%E7$u*7W}H4XSfQ`HWy zI~WkhJsqqIj(GqOg{{!`DWMNNy=wO`xD_4SsH~@M|U62|o7w3`L3q z)aN#t&E|Kd)ivA-Ua?I3VS3Q0v0pu`P`{E<{rtWZ9zX=4E}3trAvttP7R}}QxiLxI zXlx7ve*2|wi)%<)j2v_m%M0xNJ7BvS?h!ss@{u}pph~Mzark*>fa4()KyEkQjb2t( zog-!P3n5^5(&!MIFywQzA$hpojB5jO{YJI-En{>@S!jka^>+diKF9Yq08&71w2`5x zf^QniGMC4={OoOgc6#h4EFC>*SAUEwx9H3KWHcZy1mZ>{Ix)I;dbFh#M#MbE?a*H| zRx8+ZcQn>J~(HdU-t5+5U!c?o7w(w3xOP}KIVoe9sis=arSLiZW!Vv=2kbcWhq z+~~9J1!z$Ft!H~vrhPy3-Jt*gV{|s1k>((%OQL*-QVo=*jo`1Qx**fF3nxh2>E1lG%PEiLSKNx$aL4l7l zc)8QWc-wX^Ms+eGp&utfsp3@lX#M$0aq~4-fU%JZPoO~qp~Gbkv2|VV?E#mZAzjav zIeKxm{ffzFHRV=Ag$}1+qIt5(D%rrYx1o6bAdys)xWUGH1Y&`Q`h^D|pj`fZx5%2k zXf9T(JH>1BQ<4#Y={To#jN2ansG~W8KWBUxKarHHg$8cy#1-H+Sk^L0;< z{f0Z#w~<<-H%k`zoh|BX~OImk>_Qj`79?K8{Wvv^G*5+HNFR ze%Y^Yx^hz+KwRw<%3VUArpD(tmz9_uqk<-@%~YyjTJhwF~FkymSC1* zF<#@w!-NCmQ{b}LpR++?;W}!;K*nYWl@7ZM-3wtzF_+^;+13;Y?tOnAzVPd3U%Td~ z>{Ew!ax)NaTM5kDX*++D*zq~t^*Jf5lc3y(QujB3kg;DuaJ#>4(kDYz%lM^4K1kEJ z?Z1myElz@{&q7gfTcw+-h1-8z!t{j%h|a^<6q!oDup>cySjqR9ywi*FX)!0rI1c!n zZeRy4*I0p!Fq1>YSmD!tr0gxV4pnuXfi6#2P7;)E*`4nFqgTCH`dkw}r^jUNUqFs9 zzHRu4F0M@Y_^0ASy@M(=wz~vjG&3IS6h!D=XVUPsWa$*0+771%GQonjQqhcmg;PWtRU|{VDQzNog$y(P94P_ zp+jr8;5e!LZ5$k&ou65$rbO2w2~^`KOLT~r8OyFf#!q-FtmZvk={b!>%}+~=Ilj#j zcUZo`dhX6N2J*pIf(Jj|xsJOG-7t&`yVicR4xqm(RpID{(jFnJkDj~8M7z2Xa?{6# z+Y!7%N9)bwEn975c9ll?%T*t1K&n-JY)Apn)XK!KmvL_`FZxHBZAW%u6FV%nv*mBW zj+T!=ddL2byq}F<7llq9@OV%5X2xyy(3oZ}Btdx<1l@<7dMCZcLI~P1d0@2UZb#_k z9gTru5fi`M*Z&s&`ldrmdU92Zn;;*^tZlXj)ZM~7u3`|-%XK6fbjyTJ=A%A&?Jbyb zJDoTQ+1g#aeAQk}THD0Z0^W5;j$Z;x|K9YQtuGusm2AyItE8UGCBXy*8m}BKHnG0h zum@?Y=473>&Z!f*(b2m61C4gg<*CAc7N?=@$$&fZwH2rb~ykZE~}d)$YWB=z=!S9UlH-1j;ZBRHJYuO(lb?b;5_Qm8VOy z%H?go@^sgyr=!}MI|*j2u0K8Tv>N%4akug-rjQ9(zExMEyDm)p8X+J0qA$x52?+p?Sx#zg523N+WM znYtW+;AAEo9tSrjfNN}P-ESqJnwx zJLzb8zBrsV48uK8TX3uE*LGssisY|_nP0t6Vh=zIS)&*C=3xo6t5griR9z>%-Q1X` zlS;eqcLMBoMLSj3ed^LKH6&nreX4o9^fAIIfXgUoiF#lEPS!K)1x^Pn2R89`9l@ZD&hVbn6O;7EQZJ(d}Xd;dpTn^7xwJf|Z63g^rq zE3x9~Cu@2c@jbqUq?sdj*4p)^r2I<_FDA7a!cupX5~GbnOHd2loL9cBX?NO|qk47g@p1>5m71!CG< zpEW#5p=j5Z5?FQq5(u4MFYi*!JJ!eg-reC&MJcE0%nV)}ba=KSYtxd0AzmB|3qy>ZVzL!m|oHurNF%FAe2TAYp~J zMe2Ev5&b0aNwovgM0J+G2qbnhu$P=IyaIp|sjRhQ@50o;t9&mop`0@fxJhA3ZQCzv z?miQO*`-)f3oi!qGYqk4D@!#GH)jWEVY4tW-3}>|j2bS)ba}0b;}=vL?fo*DA~i0b zyJkeeO%2wH*>ToZP-JzlNd>YmrilvWhEXC&A(-IDjR zQn0(GbH|qg}MIH6kQL%U(`KTV*41 zhKf#@f*xs`(0TzI`;;izM{bf@i61idC0EDx+n1aJZfe&OYZ}uJC=Mw1VT-i-j&77*%X>p;5W9Zu87lXa!$3$U}`fCF5U82v9_BOij ze{y1a|LGdSOILldhXcXOlseKR#Qzc5%uqaHnpnl-L1}#F7(kxRm&x+GD}5;as}vk~ z0++C6zHJHYMo|wv+H#=%g9VV}Pis3>sy&&fm>H&^Nf=K$)fRS8e4y{-8;6Ux;E`e= zXm)j@n7$pRtyS_Kly*i18#%8J62!y&kgQr4If zYWuNbX-Q=|r&2d_J5Juo8HrA{cVE19l`@n03Z-7c?%p_of&< z`~F%=171nOhVeZ(NI20);FL@q|CT_LAl<2k(;?H9O7btEslimAf_WF)5-(;8H&MjJ zdif~ilL6oU(CaI~BPW}$Y;XjRUsZbf8&~_dP`;qn52?6e7SFOq{Iu3jHXQ2{?iN4Z zO7>xaT&}l<4nAEfFcV;E7nt@)aakE1s8kNR3tM#_lb`XL7he6sHgpe{eF&aPjwe$) zhouf(y~(bwespKJbKMfgkwpx@RCN-M(4tt#&IIc7OqP}I0v!(&-K~u`Zt%T%`A$ z4818+!uVpDkeb)^B^m^SyaT_(`X9c!q+ahlr#(5i>ABz*?|nBlP^jKy{-$N{4@&uG zBKnj{dmp0cE9cTjBNdbyBL8RplC-;RB0 zLaN#Q``8hucAvfa6#r>_eUD2-4Xqc!uJGxWa2q)hatWSGgxbz7`+VJx6W;J9;A`85 zs7%+q0yGM)B!n?e!v}uyVo@>u4M;e<#51vvG)uM zzUd&5u#iT*iDh0N0i|_n3Szx$JKUnGYXVqF(6!r<>v>7T@3ANF|V>~SRk$*t& zHKY8*JB{=@ecFa*y!<^$XD;j@D`qEEU26SUnUVr*|Db2uSndu5m5aYo`L6;P^6On& zAsgRRLRUz3)zk9|+s?r?5lgHN{P^L*xM2m2Axnd#;%ll-0C^Fmi{c6ndQynyxJa>p z1V3QAfT#^?Nr?-T(kN9RZ^Xn}3!20!389uWNM%2na6lmX2V7?pCjP{F$yL`QIj0{8 z#0mBTkbQnvEJQDqbm^2z1<7TwpUO90VMl!CeJm(W83Zj$iZsZ|`Am%mKnbVMSd>Vc zo#l}$Y(`O=4teuoA~>+(h@uDFpVO~Eo~)~2C}xHF!de0VqlhghFE`;b#PS^o6hmR= zjnV1dYRWWHlxp=28minCrF2Lq4|4#PP+oZv8DX z)=LLTD34JBKv+U-yuYC}pIeE>MZ{x7#wpE1d5}POSJc`wsfZHth^C=I&eGTAG3sHBx!&uy+#|C(leziP3X1nIE^9x& zI3C-N{Khg$Tp{u?n)FB5dt_IhsCBbRN*baPu5YfcweV#2kqxeIF7LgH!n zAviust3IbCd8}|)fUj`((VEDW9J5R+gpJP8;z*-JXQ9MyKXqg}Ar8IYAXcM&xI$rW z2$0?-X!VIh#OZN+VVyW79^`~~tN!Ah_bCZMlt?-=(2=pm1Hz`uz&Ux399mw}e&9p0qhboy(mpL!|ijZQRHSFsUx49M}a>MSD~3ZTr9rTvRg8GV4j<2C4xyUVboi z*Kju+wjfmX(0^Hq_dw#+5Bm>T{rx`@ayQgetcR?=*8+0YPM@Oi2pZi%gDc4{gPJ>_ z%6z?`Tli!H*YosvH!M>ba;E!YX_>@Pe46w>mEB+UCtAwV!m{x#Xp{&_y!+&NMgUG{ z4A^v;IzGqS;a4SJkQe}H?0Erl)N-)O$?OpDWyVT^DV_}%R?U;5>+i`3*O!-sh&1#u zP!?Iq%ST&16?sK<)ocT3k`v(RmDm(FUPc3RReSZP$7^V+mqr|m&hmh|o{NKgh7;!J zy}kW~RGW5kw=w__Ob+V#c%$r_aaKS!X3HK7>iOMjdfqNtpaQ!*=G0TJ=W{%eK?b4< zG=PN!EHe}u;sLOs(}tG-8l)Uh(;6Anva|X4beme{Z{=uoMH&MNcALtWF;FuetgzGb zTyMl&0aY#}AnvtbS&46g9N+3WCNA<4vG&&pp#$^aWi+McZeeAUDL~f%Q!YR3?#?$r zZX*N8=`kP#05%=;-51PT?hSak0Vl_Ys80;3)}T0=LNJLK68#za`n69Iy93wTBw}6n z#%_TvIsVnkna$`CK%g;2)8QTwxBN6jYw`M_=4Dv@80WVm!}D{5#n2%Ws}p%GdTA5cO+w`X1PdrJ z{GNa)W&;QpWuO>3Hu%Jf*{sTTyt-V+WA1g3Qmc&Ir43my-i%vAO(R zpF`B7_6?2I(jEzOnx>7Z%&y8h0e~{Ge1c2foFc;53o;7Ns=Mhv_gXzkf^4Hp=R$zM z#uY4@ul2J$LS;*^)5{o-e6ccRFnl5l<~F7vBWR$oA8FhIl7rD`*}eyvYIR_lhl zW8g5~*!Jz6B-yheut2qNym*ZdXb9wkqb-5$y9UWR?)~M`9%}$_l?M_%>Af_#T4@U#iy1Nm)MAurV#xw4LnXxPQ1A$UyO^^0L^njd6iiWMHD^07@v+OQ&7~~8tZ$y6lrr&#o=rN z2gOB1<6Mpe+v?tJKpqrexKT*cLa=|lJE?_cp_v?l1u9(~uR(G22pBPIABl7QE~_ zEip-2z;&rsqK3M1NDP}&y=-7CPco^_NO$49*vbWQwfvu3qau(#2EN;l$YbO(hbQ<5Qo91L6K{9huyZ7 zQMDDH-6{hAr zc46S{TsK;YYohTQlkdB*Ys;>mt_~HY?yFfaeMM;DbH_!@JWE^Mb637L=r(-^cdGUi zePZzvq>#<}JH)?Z;pWb9F>q;oyF?S|qS|W{A8q1vT-i1)zNH7Qfb3YoIJ!b_^5gCV zYTw-`xn1INy-t)z45zcubk(aHI0ONF zcH|q88)(w!FyJvyc@6LY=53;Y#|LufuMY+RPl2Jj#H){mX*RF0I#)};{)Gh0?MiX+ zjK7%fmGdo%5)wj}7Lfe?z;%B z(=xy)D?qJi>>~x)F6>qsZ3-r)2k)zuqhdfQN=yC!m$e z?|!HD(Z0MPtFw_jZ_qs7yYI0-iO|y)vp@b$rO34Cv)`$A)C}mXS&9_9I|SmKgI?FB zM@3P?ZM`P|Jsk?VdczPv)cINg-%;Ic0sLV>Z^eSEc}l$Yrh>!mXdAbKcq-Q~6A+&u*Yf8t9Onx~qNT=DhR}tr#3PIHEr~ZvB(>dEM$7YSf3p(A(|G-Q zQ>O2vD8}DWiA9I3F=8Ob3{!h6U0-Kd^A(>mZ+a9R!TZJI$GZjHc`64(+=L2}LmBmA#4zGW;h`2a?{d>+GFK0{cp+nmZws;wP2zY8qpi(w zGPYEsk%iySc>n1$dw!S4D85>wG%JI3u|XWM;i<&0ECOl_LRCgT@z53MDN=StJ>#ri zSo)a0`Ef5I1275%pq3kb2@J{-w68>{TajvsAvg0ZW5DzMVdg0cN@lrtsFfHX+X<4N zS-D;s)+l)2dTH260fpzdzqD+bE8c6*$WrM!z>iWbt~*kFssVMDhrM$4o`1P1`+E4A zJwN1>FPp`hi6Ots@!QKF@QHbl7K)Ys;_<83D%7_% zk8cbN{;0Sykn-rGA-Ya%N}jn0(iRn2N{j_rC|w%C02Ii)3=fepJm^Bt+^oEEo<{iv z5GZZoD^kW9YPl&92(t``oH^&S1V9Sv$Jl~g?gYX8hGl@|-LvrTObJV&zqs=PI= z!ZmQ5Rn^e$&{9!^54;i~;s8-atV|~`+?V(~L>jWH#iekbk)a!j1E!76y8}kDzrP0f zAp{Oc{Dz%hGXh?Tk{IN(%y|z{P)H`fVoLZ*UgQ?<2xr;vBvXkD;!sx&cngjp^x!Cp zU=K%+@_OywJ;b~{V}aUJWErP|4Bc;b#Wf5aozHJG%BJ8V)}9H|*S}16H1zH%+FXmh zARM4CtF(Hq)={@0h?QSuYRn**zbzwze4CDlNt1zKX?<0d7Rv(SrcDO*d#ASq#b37ts}M8A|OEa&zUFs^1w}KCGjKJa6@}3D_kVX+XK@22U~_riH=qJ=N>@|#;aH~l#gQ2NwExlTA=Bt?3iLp$mcYC2-{w` zjJ0}i30*3|bGVYDmW<+U9kug=G`tw{=I}L2Rn~2&;W`=6H|Ll(eJFJFhF`eP1K{I&1mYXLwp__U;4eny{#P~l0p-!8ZS`DtZ> zRs?=s_W^}u3=yi-T}>*H z&5@fOdQfu3{XQ2Pl-jFA;i?grIMkmys(VS}lA#uXj*^(tKDVUEOd;R>M|_#Op)!iJ zRkwcDYA{qu4sgSz?)zP7ah0d9e($7uXsH02%}ZVEQ2KcNf{MwV-`S59gb=Z!QKA0& z>>v?6?DijB^+)UKRrNXa$n_v3BY7}g!i6{8<+&vNZnH7!PhTC9;!E36*t{?`uU3;@ zi)=wE8rJ;8D9}JVLxOlgPo(NCQ&k6BuDQ&^G2UxcpW8L6535tgrg^)M?hi#LelA?* zk)4#b7NatKfctjNIo=hB`TRrq0Z5)^gsZF5VzpHK!)Pw>U%s&SDZUrZ77Y!nXy1a5 zL?=Z^NoJG&-IXg5`c;Rnt;|;#~gjH^) z1mP#SUe|tdJZAN^HHw;IkKUN{Bc}qbVRUDJuMbAEXF0 zmj=%lL&~tGn&sOFoJ#2_m-8CZSxx2+1Vh^e3OH{f&~!P<3kg7ubJx_zA(WIGctG$8 z8tD7@os3xqqd5rQ`GwHppYJ;eqJ|^9h;#h^81QAtld+7qaec#W2VpCmkJS^;XLKf` z$AERhPdJ|c5U6pCjL6UPL8Zk4u~|tgd!Q9qL<%}PPKs|%NCVlXOY0>-pJHoLh%wU) z{OxI! zSMK2g33^1`76oZl0o0heFn^(3%HmC;;X5gjcU++#U;Vl-ZO!m8ct8m43}NxUM0YSQ zs_k&mrhR)N7XLX{90G@<=myn3{Y{1ts(XO6fj-{UWqgDaP(&xaaLV=&Ftf=HL8kS5 z>UEkn9g?6w!7>!gCmd#MQVJ^3qx6Z^hV{ogT1Lw)Ap`{yVFz-lvMg2ay+Kg416a2x zfTj;=WI6IY-C4GnX^(Wv(s3D3b?udAsIu`E54&)@?2+OzQ9A{g6O4Hu(7a)95UF13 zxdAwpHh}MIni6<``_c!bblvcSx3M=J4SZnzQL;GiE7dR88uG7{MT45Z&xi6MUDV0Z zg0NPqkhivG&QHam(O#iNr}Encw!gi#Jp4y|h;>K6*ApLbKnYZ5D+1MY?_8n^1$DBy z$FH~(<|$71frhepWmM#MpNd-9G(Es*V$MunA_b@qT4nlOku>v^9A6Rs%hqE8Ov+jK_jn3hT zB`I=+IifaCWUUu4z7@j6Lq$4>GNsBt+M8T9y|xmCH5?)eA4l^!2;K$e`=bd1Hf7%) zqC-o5P$pdggZZM;vkZ^`BmfH96a+TJ`Iqubm%%bspi}(SrZSLkjsz0xnuC(4KffFL z+!)A?Yqs_(0oBN6CY*h+zfw695p;<>ZvZ5!(q-W!Y)uh@I~qeNTtnJ@XV?oM zw2YSoJPQ!&W}NDXuU@TQ8jF8o@>$6!}DxLZdVG=Wi}L`b`;p4kgINDDaI3^bf8WV z0T@zkx1Ro>O1A+Zi`yic>~RdDZd?99ii;dJ6I2AIW};&B;yb4Rb}$1BhXOP;y)Xis zr84b!yU2%&XdN)AQ?rDzyY5hU|9VLd4#KqQgabxMb!~A(=|@0ALV-%Bg`W5J>#&9A zRnF+N7EGb%WWUE}6tGD|%BDU$0+^uc;K-z6+B_I^fGDBG>~~*?D-zhrPK~yOW~0QGGZSq;uo<=Sc21St zcq9W6-vtL<3^bl1#PVl8_+@`LY;479^ae#@zL%?Ei&zBo!`{2&X1D_DzJdUHVKK&b z*gOFT)>Z(hW!e|=l*M;h&%W`J!gU(^D|653$cV3h7E_uc5*yE101Pi1G%3UT;AfU= z6BffTpiddaY=gDncL*TNvzx3#0!q>_pfD_dR$Q0Np2^3Pgnj{(Fw2oCU~W#IZYam; zKFt9nqEAn^7fk?#tQ7FO+sVmt+hXAP3Q0VI3|0DWc2j-T(a zr0{Z+tc5XsfF)Ysyd<$}{mzOe5j9`quM8X?F-x`ig2^-pPNlk=*>|ieYR6say2F%D zLh(Xto_aptshh98qfc-4&;#ct@7LA3L)*jM`coT(&HSyF`!49iBRGx1;PeqcrO|SO zhHqf9(IYIdhCf5Hhnk6zCvMjl!&*W`1yzEd;*rGPiHX{GJiSOfj&ozf_udTw_u{)> zPk&5&tSPsfoNyg6A=Uz4;G~3Yx7!B5>g=W0&1R|nUcZ6$8eAG>Hu3|96XP5#Tdt#>tD4?_oZqro}Z}3)O9L)KX=6^fIjx)>G9-gmxl(vEH!mI zt4$!GS}{0J@Fi0rc`?2CW=*)T(2-WrGPSS!h3FN4h1PFiutC=Wq_M=JpJ;(U=(T5Gyu(z9JC0-@Z;fKm8*sZ;I!bI0J7Hu*_#3 z%QBl3)o#itG`v~KYV~fJ>|TolWYB7GB7uxL(FS6%-e^MhZC3cGlec?y=5ZU#^a7gs za`XiYkqVBFM3rMd)4B6}BO$q77H>i_rcJ|DXHjTE<|pLf+P5$a|9uQ%+_4qBx{(Fq zMCJI>;uxoEmoG4lj=9Wq-&nP2A+Qe@lPi^PpbN2UBf~8+x}aYRN6aW(ZGVTKPK$ik zB)*Nfh>@Ae+h2-wF%EgFzp8^}Urc}EK!*HULjNj`xV1G9y9xM_33uu^yMBQS7cd{7 z|Gxmk8$4HqA$K{bFWpV1h&x^h8(6M9vboiE1P=Np)ia$$>`gq?oia6`nq8r;ETFUn z$I(A`G1N2nW8b|4NwIZM?F$}9h#kPr(lvgxgAz6J-iQT-#4u_FxRg$s)ff#Iz&&qtN6FL zsyr3l2GP;&U=cB0!&Ac;G$kZy+@tSl%EE+9(?!9PN46e@E(!Mx{KJ|sUuykIS*SXe z5n7jM!6X*ef^zOVb(SgVY9PVU%?C7sb#oe`9_AmayzmtEA7wETtuhSw@l;6V6O*AjM3i zo9m2FgA=qX8E`E;Uz?t|m;E{pYjM9Dl#H3hc6t5YyHp}K8X_BkqeTy0BcE7~tb2M6 zLT-J&tHjQ?`F2QJ_Y<|99QC9 z!;lTcJux}1MBWk1L<&2&1;2Nv*m$jUfJYE!$8j)G0?+1QX+Pf_P(JQ?-uZi*vkx4# zr&B!UZ#3InA!bIJ(uA-2i_cwa77BsH^nee$l~Ku~gotYrTRSOgMKV)P4= zYx*kAO?0!xEb~j-slfV=FdOQPqIv2@GVyidmiuA?L}tAoZ+nzm9%7P@E@DNpR4eACX*f&bGH%q59X6z10UL>U8{(mPHt6R$>~O!nD{ zhCyP3P8j;c>%PGTB`J#{A_2t9qvM%b`O88RfRWy{Hgn{7C0s}Qimum&geQoWU+}Yo zqquCJe|4HQHgIFe3fkx=Vt%iT_E$0!4y3}LDNtD@xk(}nh`BMI-gk&hZoS#*=~?X& zb-I#HYE|iT&26?L7?XaV>;n44x@RO0Hk&RN;=iq0DBt(c*1py`^~ZS^_DgeK%X@b% zq2|Nqo>SW)d1-|z?E1soG6ft8HuHS<~w|XS&qiY?Lr{dy@NL zdxwS7n-uvhJ^6E)TKasoKA)=J5u<)vFq_jelcabMraNDuThaO&!m}3U&%qeg%MSW#;46W>Pcmy*1%F~>Oh!)8BcPt^m^IRIE z1Sd!*uJzEG|3d&+5BtZjTn6%XfIQPB_pcA?uDHxNv1^v$HkGECCCk_q_zeQ{+}{ zF_^0k?)FY#Vaft-)pGa++Hto2{M96+p}xim94-+#?Jf>SA#ny=_OR}<_q#Zk)oZfA zop3pzc{U2*z1EFtoEn?>Sc?9BE&Bc{KvopA7~;YyBb`eNrznF*1tB}AfUZzrQWerj zOpQ$Pi#*A6(nPyHS4;^wvewb_2;S{Z>EU0!0`+UHx3v IIVCg!06@p%6#xJL literal 0 HcmV?d00001 diff --git a/report/main.tex b/report/main.tex index e3847d3..944319f 100644 --- a/report/main.tex +++ b/report/main.tex @@ -461,6 +461,13 @@ As this project is focused on the collection and analysis of online community da A unified data model is used to represent all incoming data, regardless of its original source or structure. This ensures that the same pipeline works across YouTube, Reddit and boards.ie data, and can be easily extended to new sources in the future. +\begin{figure} + \centering + \includegraphics[width=1.0\textwidth]{img/pipeline.png} + \caption{Data Pipeline Diagram} + \label{fig:pipeline} +\end{figure} + \subsubsection{Data Ingestion} The system will support two methods of data ingestion: \begin{itemize} @@ -952,6 +959,24 @@ A middle ground was found with the "Emotion English DistilRoBERTa-base" model fr As the project progressed and more posts were classified, the "surprise" and "neutral" emotions were found to be dominating the dataset, which made it difficult to analyse the other emotions. This could possible be because the model is not fine-tuned for internet slang, and usage of exclamation marks and emojis, which are common in social media posts, may be classified as "surprise" or "neutral" rather than the intended emotion. Therefore, the "surprise" and "neutral" emotion classes were removed from the dataset, and the confidence numbers were re-normalised to the remaining 5 emotions. +\subsubsection{Topic Classification} +For topic classification, a zero-shot classification approach was used, which allows for classification of text into arbitrary topic classes without needing to fine-tune a model for each specific set of topics. Initially, attempts were made to automatically generate topic classes based on the most common words in the dataset using TF-IDF, but this led to generic and strange classes that weren't useful for analysis. Therefore, it was decided that a topic list would be provided manually, either by the user or using a generic list of broad common topics. + +Initially, the "all-mpnet-base-v2" \cite{all_mpnet_base_v2} was used as the base model for the zero-shot classification, which is a general-purpose sentence embedding model. While this worked well and produced good results, it was slow to run interference on large datasets, and would often take hours to classify a dataset of over 60,000 posts and comments. + +The choice between this model and a weaker model that was faster to run inference on was a difficult one, as the stronger model produced better results, but the performance was a significant issue. + +After testing multiple models, the "MiniLM-L6-v2 " \cite{minilm_l6_v2} was chosen as the base model for zero-shot classification, which is a smaller and faster sentence embedding model. While it may not produce quite as good results as the larger model, it still produces good results and is much faster to run inference on, which makes it more practical for use in this project. + +To produce the results, the topic list is embedded and cached using the sentence embedding model, and then each post is embedded and compared to the topic embeddings using cosine similarity to produce a relevance score for each topic. The topic with the highest relevance score is then assigned to the post as its topic classification, along with the model confidence score for that classification to allow end-users to see how confident the model is in that classification. + +\subsubsection{Entity Recognition} + + + +\subsubsection{Optimization} + + \subsection{Ethnographic Statistics} This section will discuss the implementation of the various ethnographic statistics that are available through the API endpoints, such as temporal analysis, linguistic analysis, emotional analysis, user analysis, interactional analysis, and cultural analysis. Each of these are available through the API and visualised in the frontend. diff --git a/report/references.bib b/report/references.bib index 519d062..11ebddb 100644 --- a/report/references.bib +++ b/report/references.bib @@ -13,9 +13,23 @@ howpublished = {\url{https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/}}, } +@misc{all_mpnet_base_v2, + author={Microsoft Research}, + title={All-MPNet-Base-V2}, + year={2021}, + howpublished = {\url{https://huggingface.co/sentence-transformers/all-mpnet-base-v2}}, +} + +@misc{minilm_l6_v2, + author={Microsoft Research}, + title={MiniLM-L6-V2}, + year={2021}, + howpublished = {\url{https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2}}, +} + @inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} -} \ No newline at end of file +}