WebSci '23: Proceedings of the 15th ACM Web Science Conference 2023

Full Citation in the ACM Digital Library

SESSION: Politics and Ideology

Political Honeymoon Effect on Social Media: Characterizing Social Media Reaction to the Changes of Prime Minister in Japan

New leaders in democratic countries typically enjoy high approval ratings immediately after taking office. This phenomenon is called the honeymoon effect and is regarded as a significant political phenomenon; however, its mechanism remains underexplored. Therefore, this study examines how social media users respond to changes in political leadership in order to better understand the honeymoon effect in politics. In particular, we constructed a 15-year Twitter dataset on eight change timings of Japanese prime ministers consisting of 6.6M tweets and analyzed them in terms of sentiments, topics, and users. We found that, while not always, social media tend to show a honeymoon effect at the change timings of prime minister. The study also revealed that sentiment about prime ministers differed by topic, indicating that public expectations vary from one prime minister to another. Furthermore, the user base was largely replaced before and after the change in the prime minister, and their sentiment was also significantly different. The implications of this study would be beneficial for administrative management.

Political advertisement on Facebook and Instagram in the run up to 2022 Italian general election

Targeted advertising on online social platforms has become increasingly relevant in the political marketing toolkit. Monitoring political advertising is crucial to ensure accountability and transparency of democratic processes. Leveraging Meta public library of sponsored content, we study the extent to which political ads were delivered on Facebook and Instagram in the run up to the 2022 Italian general election. Analyzing over 23 k unique ads paid by 2.7 k unique sponsors, with an associated amount spent of 4 M EUR and over 1 billion views generated, we investigate temporal, geographical, and demographic patterns of the political campaigning activity of main coalitions. We find results that are in accordance with their political agenda and the electoral outcome, highlighting how the most active coalitions also obtained most of the votes and showing regional differences that are coherent with the (targeted) political base of each group. Our work raises attention to the need for further studies of digital advertising and its implications for individuals’ opinions and choices.

Wearing Masks Implies Refuting Trump?: Towards Target-specific User Stance Prediction across Events in COVID-19 and US Election 2020

People who share similar opinions towards controversial topics could form an echo chamber and may share similar political views toward other topics as well. The existence of such connections, which we call connected behavior, gives researchers a unique opportunity to predict how one would behave for a future event given their past behaviors. In this work, we propose a framework to conduct connected behavior analysis. Neural stance detection models are trained on Twitter data collected on three seemingly independent topics, i.e., wearing a mask, racial equality, and Trump, to detect people’s stance, which we consider as their online behavior in each topic-related event. Our results reveal a strong connection between the stances toward the three topical events and demonstrate the power of past behaviors in predicting one’s future behavior.

Analyzing Polarization And Toxicity On Political Debate In Brazilian TikTok Videos Transcriptions

With the rise of TikTok’s popularity, there is an opportunity to understand how political communication has been made on this platform based on short videos. In addition to understanding what topics and themes are discussed on TikTok, we also analyzed the polarization and toxic behavior of its users. However, this great opportunity brings a challenge as well. As TikTok is a video platform, the largest content information is in the video itself, so it is necessary to extract this information from video data and features.

In this paper, we propose a methodology to extract topics from TikTok video transcriptions in order to identify polarization and toxicity in their contents, by using techniques that range from web crawling to speech recognition algorithms. By providing a robust audio cleaning pipeline, it’s possible to generate a less noisy dataset, by removing silence and music segments. We validate our methodology by practically applying it to create topics in order to identify signs of political polarization and toxicity in 8,329 Brazilian political TikTok videos, collected over the last two years. Our work shows that it is possible to extract coherent and meaningful topics from TikTok videos, even with the challenges spoken texts bring. We point out that topics related to religion and social classes contain a higher percentage of toxicity and polarization, as well as opposite hashtags, such as "direita" (Right-wing) and "esquerda" (Left-wing).

Beyond Fish and Bicycles: Exploring the Varieties of Online Women’s Ideological Spaces

The Internet has been instrumental in connecting under-represented and vulnerable groups of people. Platforms built to foster social interaction and engagement have enabled historically disenfranchised groups to have a voice. One such vulnerable group is women. In this paper, we explore the diversity in online women’s ideological spaces using a multi-dimensional approach. We perform a large-scale, data-driven analysis of over 6M Reddit comments and submissions from 14 subreddits. We elicit a diverse taxonomy of online women’s ideological spaces, ranging from counterparts to the so-called Manosphere to Gender-Critical Feminism. We then perform content analysis, finding meaningful differences across topics and communities. Finally, we explore two platforms, namely, ovarit.com and thepinkpill.co, where two toxic communities of online women’s ideological spaces (Gender-Critical Feminism and Femcels) migrated after their ban on Reddit.

One of Many: Assessing User-level Effects of Moderation Interventions on r/The_Donald

Evaluating the effects of moderation interventions is a task of paramount importance, as it allows assessing the success of content moderation processes. So far, intervention effects have been almost solely evaluated at the aggregated platform or community levels. Here, we carry out a multidimensional evaluation of the user-level effects of the sequence of moderation interventions that targeted r/The_Donald: a community of Donald Trump adherents on Reddit. We demonstrate that the interventions: 1) strongly reduced user activity; 2) slightly increased the diversity of the subreddits in which users participated; 3) slightly reduced user toxicity; and 4) gave way to the sharing of less factual and more politically biased news. Importantly, we also find that interventions having strong community level effects are associated to extreme and diversified user-level reactions. Our results highlight that community-level effects are not always representative of the underlying behavior of individuals or smaller user groups. We conclude by discussing the practical and ethical implications of our results. Overall, our findings can inform the development of targeted moderation interventions and provide useful guidance for policing online platforms.

SESSION: Misinformation and Misperceptions

Propaganda and Misinformation on Facebook and Twitter during the Russian Invasion of Ukraine

Online social media represent an oftentimes unique source of information, and having access to reliable and unbiased content is crucial, especially during crises and contentious events. We study the spread of propaganda and misinformation that circulated on Facebook and Twitter during the first few months of the Russia-Ukraine conflict. By leveraging two large datasets of millions of social media posts, we estimate the prevalence of Russian propaganda and low-credibility content on the two platforms, describing temporal patterns and highlighting the disproportionate role played by superspreaders in amplifying unreliable content. We infer the political leaning of Facebook pages and Twitter users sharing propaganda and misinformation, and observe they tend to be more right-leaning than the average. By estimating the amount of content moderated by the two platforms, we show that only about 8-15% of the posts and tweets sharing links to Russian propaganda or untrustworthy sources were removed. Overall, our findings show that Facebook and Twitter are still vulnerable to abuse, especially during crises: we highlight the need to urgently address this issue to preserve the integrity of online conversations.

On the Globalization of the QAnon Conspiracy Theory Through Telegram

QAnon is a far-right conspiracy theory that has implications in the real world, with supporters of the theory participating in real-world violent acts like the US capitol attack in 2021. At the same time, the QAnon theory started evolving into a global phenomenon by attracting followers across the globe and, in particular, in Europe, hence it is imperative to understand how QAnon has become a worldwide phenomenon and how this dissemination has been happening in the online space. This paper performs a large-scale data analysis of QAnon through Telegram by collecting 4.4M messages posted in 161 QAnon groups/channels. Using Google’s Perspective API, we analyze the toxicity of QAnon content across languages and over time. Also, using a BERT-based topic modeling approach, we analyze the QAnon discourse across multiple languages. Among other things, we find that the German language is prevalent in our QAnon dataset, even overshadowing English after 2020. Also, we find that content posted in German and Portuguese tends to be more toxic compared to English. Our topic modeling indicates that QAnon supporters discuss various topics of interest within far-right movements, including world politics, conspiracy theories, COVID-19, and the anti-vaccination movement. Taken all together, we perform the first multilingual study on QAnon through Telegram and paint a nuanced overview of the globalization of QAnon.

Characterizing and Predicting Social Correction on Twitter

Online misinformation has been a serious threat to public health and society. Social media users are known to reply to misinformation posts with counter-misinformation messages, which have been shown to be effective in curbing the spread of misinformation. This is called social correction. However, the characteristics of tweets that attract social correction versus those that do not remain unknown. To close the gap, we focus on answering the following two research questions: (1) “Given a tweet, will it be countered by other users?”, and (2) “If yes, what will be the magnitude of countering it?”. This exploration will help develop mechanisms to guide users’ misinformation correction efforts and to measure disparity across users who get corrected. In this work, we first create a novel dataset with 690,047 pairs of misinformation tweets and counter-misinformation replies. Then, stratified analysis of tweet linguistic and engagement features as well as tweet posters’ user attributes are conducted to illustrate the factors that are significant in determining whether a tweet will get countered. Finally, predictive classifiers are created to predict the likelihood of a misinformation tweet to get countered and the degree to which that tweet will be countered. The code and data is accessible on https://github.com/claws-lab/social-correction-twitter.

Emotional Framing in the Spreading of False and True Claims

The explosive growth of online misinformation, such as false claims, has affected the social behavior of online users. In order to be persuasive and mislead the audience, false claims are made to trigger emotions in their audience. This paper contributes to understanding how misinformation in social media is shaped by investigating the emotional framing that authors of the claims try to create for their audience. We investigate how, firstly, the existence of emotional framing in the claims depends on the topic and credibility of the claims. Secondly, we explore how emotionally framed content triggers emotional response posts by social media users, and how emotions expressed in claims and corresponding users’ response posts affect their sharing behavior on social media. Analysis of four data sets covering different topics (politics, health, Syrian war, and COVID-19) reveals that authors shape their claims depending on the topic area to pass targeted emotions to their audience. By analysing responses to claims, we show that the credibility of the claim influences the distribution of emotions that the claim incites in its audience. Moreover, our analysis shows that emotions expressed in the claims are repeated in the users’ responses. Finally, the analysis of users’ sharing behavior shows that negative emotional framing such as anger, fear, and sadness of false claims leads to more interaction among users than positive emotions. This analysis also reveals that in the claims that trigger happy responses, true claims result in more sharing compared to false claims.

Misinformation Detection Algorithms and Fairness across Political Ideologies: The Impact of Article Level Labeling

Multiple recent efforts have used large-scale data and computational models to automatically detect misinformation in online news articles. Given the potential impact of misinformation on democracy, many of these efforts have also used the political ideology of these articles to better model misinformation and study political bias in such algorithms. However, almost all such efforts have used source level labels for credibility and political alignment, thereby assigning the same credibility and political alignment label to all articles from the same source (e.g., the New York Times or Breitbart). Here, we report on the impact of journalistic best practices to label individual news articles for their credibility and political alignment. We found that while source level labels are decent proxies for political alignment labeling, they are very poor proxies – almost the same as flipping a coin – for credibility ratings. Next, we study the implications of such source level labeling on downstream processes such as the development of automated misinformation detection algorithms and political fairness audits therein. We find that the automated misinformation detection and fairness algorithms can be suitably revised to support their intended goals but might require different assumptions and methods than those which are appropriate using source level labeling. The results suggest caution in generalizing recent results on misinformation detection and political bias therein. On a positive note, this work shares a new dataset of journalistic quality individually labeled articles and an approach for misinformation detection and fairness audits.

Understanding the Use of e-Prints on Reddit and 4chan’s Politically Incorrect Board

The dissemination and reach of scientific knowledge have increased at a blistering pace. In this context, e-Print servers have played a central role by providing scientists with a rapid and open mechanism for disseminating research without waiting for the (lengthy) peer review process. While helping the scientific community in several ways, e-Print servers also provide scientific communicators and the general public with access to a wealth of knowledge without paying hefty subscription fees. This motivates us to study how e-Prints are positioned within Web community discussions.

In this paper, we analyze data from two Web communities: 14 years of Reddit data and over 4 from 4chan’s Politically Incorrect board. Our findings highlight the presence of e-Prints in both science-enthusiast and general-audience communities. Real-world events and distinct factors influence the e-Prints people’s discussions; e.g., there was a surge of COVID-19-related research publications during the early months of the outbreak and increased references to e-Prints in online discussions. Text in e-Prints and in online discussions referencing them has a low similarity, suggesting that the latter are not exclusively talking about the findings in the former. Further, our analysis of a sample of threads highlights: 1) misinterpretation and generalization of research findings, 2) early research findings being amplified as a source for future predictions, and 3) questioning findings from a pseudoscientific e-Print. Overall, our work emphasizes the need to quickly and effectively validate non-peer-reviewed e-Prints that get substantial press/social media coverage to help mitigate wrongful interpretations of scientific outputs.

SESSION: Language and Emotions

Multi-emotion Recognition Using Multi-EmoBERT and Emotion Analysis in Fake News

Emotion recognition techniques are increasingly applied in fake news veracity or stance detection. While multiple co-existing emotions tend to co-occur in a single news article, most existing fake news detection has only leveraged single-label emotion recognition mechanisms. In addition, the relationship between the emotion of an article and its stance has not been sufficiently explored. To address these research gaps, we have developed and applied a multi-label emotion recognition tool called Multi-EmoBERT in fake news datasets. The tool delivers state-of-the-art performance on SemEval2018 Task 1. We apply the tool to identify emotions in several fake news datasets and examine the relationships between veracity/stance and emotion. Our work demonstrates the potential for predicting multiple co-existing emotions for fake news and implications against fake news spread.

Geolocated Social Media Posts are Happier: Understanding the Characteristics of Check-in Posts on Twitter

The increasing prevalence of location sharing features on social media has enabled researchers to ground computational social science research using geolocated data, affording opportunities to study human mobility, the impact of real-world events, and more. This paper analyzes what crucially separates tweets with geotags from tweets without. Our findings show that geotagged tweets are not representative of Twitter data at large, limiting the generalizability of research that uses only geolocated data. We collected 1.3M geotagged tweets on Twitter (most of which came from Instagram), and compared them with a random dataset of tweets on three aspects: affect, content, and audience engagement. We show that geotagged tweets on Twitter exhibit significantly more positivity, often citing joyous and special events such as weddings, graduations, and vacations. They also convey more collectivism by using more first-person plural pronouns and contain more additional features such as hashtags or objects in images. However, geotagged tweets generate less engagement. These findings suggest there exist significant differences in the messages conveyed in geotagged posts. Our research carries important implications for future research utilizing geolocation social media data.

On the Prevalence of Leichte Sprache on the German Web

Web accessibility guidelines call for website content to be ‘understandable’. In the German public sector, this principle has been interpreted as a specific set of writing rules known as ‘Leichte Sprache’ (LS). In this paper, we set out to investigate the prevalence of LS on the German web, using both web measurements and qualitative methods. We find that while many of the prerequisites for the creation of content in LS are now in place, such as accessibility monitoring authorities or procedures to translate content into LS, the vast majority of public sector websites are still not accessible in this regard. Based on these findings, we offer four technical and policy recommendations to move towards a more inclusive web.

Language on Reddit Reveals Differential Mental Health Markers for Individuals posting in Immigration Communities

The experience of immigrating to a foreign land is associated with exposure to new cultures, changes in social networks, and challenges to prevalent systems of meaning. A body of literature has shown that the immigration experience, while pursued with hope of a better quality of life, is associated with adverse effects on immigrants’ mental health and well-being due to sociopolitical and economic factors. In this paper, we study first-hand accounts of struggles with mental health by individuals who participate in immigration communities (aka subreddits) on Reddit, a popular social media platform. First, we compare and contrast the sentiment and content of posts made by individuals in mental health subreddits who also post in immigration subreddits with those of matched group that does not. Second, we adopt the case-crossover approach to evaluate the changes in their mental health language before and after the first post in immigration subreddits. We find that mental health concerns among the individuals posting in immigration subreddits are about race, politics, violence, employment, and affordability whereas among the matched group, mental health posts are about anger and self-harm, family and relationships, swearing, and introspection. We also find that the language of mental health before and after the first post in immigration subreddits evolves from seeking support and therapy to a more concrete and specific discussion around mental health and a positive outlook towards future goals.

Analyzing Social Media Activities at Bellingcat

Open-source journalism emerged as a new phenomenon in the media ecosystem, which uses crowdsourcing to fact-check and generate investigative reports for world events using open sources (e.g., social media). A particularly prominent example is Bellingcat. Bellingcat is known for its investigations on the illegal use of chemical weapons during the Syrian war, the Russian responsibility for downing flight MH17, the identification of the perpetrators in the attempted murder of Alexei Navalny, and war crimes in the Russo-Ukraine war. Crucial for this is social media in order to disseminate findings and crowdsource fact-checks. In this work, we characterize the social media activities at Bellingcat on Twitter. For this, we built a comprehensive dataset of all N =  24,682 tweets posted by Bellingcat on Twitter since its inception in July 2014. Our analysis is three-fold: (1) We analyze how Bellingcat uses Twitter to disseminate information and collect information from its follower base. Here, we find a steady increase in both posts and replies over time, particularly during the Russo-Ukrainian war, which is in line with the growing importance of Bellingcat for the traditional media ecosystem. (2) We identify characteristics of posts that are successful in eliciting user engagement. User engagement is particularly large for posts embedding additional media items and with a more negative sentiment. (3) We examine how the follower base has responded to the Russian invasion of Ukraine. Here, we find that the sentiment has become more polarized and negative. We attribute this to a ∼ 13-fold increase in bots interacting with the Bellingcat account. Overall, our findings provide recommendations for how open-source journalism such as Bellingcat can successfully operate on social media.

Detecting Symptoms of Depression on Reddit

Depression is known to have heterogeneous symptom manifestations. Investigating various symptoms of depression is essential to understanding underlying mechanisms and personalizing treatments. Reddit, an online peer-to-peer social media platform, contains varied communities (subreddits) where individuals discuss their detailed mental health experiences and seek support. The current paper has two aims. The first is to identify psycho-linguistic and open-vocabulary language markers associated with different symptoms using 1,318,749 posts from 43 subreddit communities (e.g., r/bingeeating) clustered into 13 expert-validated depression symptoms (e.g., disordered eating). The second aim is to develop prediction models based on the above linguistic features and RoBERTa embeddings to detect specific symptom discourse in contrast to control subreddit posts contributed by the same Reddit users. These predictive models are then validated on a second sample of individuals (N = 2,986) who shared their Facebook posts and completed self-report depression (PHQ-9), anxiety (GAD-7), and loneliness (UCLA-3) surveys.

Based on the differential linguistic patterns that emerged across the various symptoms in our data, we identified three potential clusters, which could also be mapped to the Research Domain Criteria (RDoC) framework. RoBERTa embeddings demonstrated the highest accuracy at predicting most symptoms and were particularly robust at predicting the severity of suicidal thoughts and attempts, self-loathing, loneliness, and disordered eating. Our study demonstrates the potential of using large, pseudonymous online forums to train language-based symptom-estimation machine-learning models that can be applied to other text sources. Such technologies could be helpful in clinical psychology, population health, and other areas where early mental health monitoring could improve diagnosis, risk reduction, and treatment.

SESSION: Fairness and Bias

Tracking Machine Learning Bias Creep in Traditional and Online Lending Systems with Covariance Analysis

Machine Learning (ML) algorithms are embedded within online banking services, proposing decisions about consumers’ credit cards, car loans, and mortgages. These algorithms are sometimes biased, resulting in unfair decisions toward certain groups. One common approach for addressing such bias is simply dropping the sensitive attributes from the training data (e.g. gender). However, sensitive attributes can indirectly be represented by other attributes in the data (e.g. maternity leave taken). This paper addresses the problem of identifying attributes that can mimic sensitive attributes by proposing a new approach based on covariance analysis. Our evaluation conducted on two different credit datasets, extracted from a traditional and an online banking institution respectively, shows how our approach: (i) effectively identifies the attributes from the data that encapsulate sensitive information and, (ii) leads to the reduction of biases in ML models, while maintaining their overall performance.

The Impact of Data Persistence Bias on Social Media Studies

Social media studies often collect data retrospectively to analyze public opinion. Social media data may decay over time and such decay may prevent the collection of the complete dataset. As a result, the collected dataset may differ from the complete dataset and the study may suffer from data persistence bias. Past research suggests that the datasets collected retrospectively are largely representative of the original dataset in terms of textual content. However, no study analyzed the impact of data persistence bias on social media studies such as those focusing on controversial topics. In this study, we analyze the data persistence and the bias it introduces on the datasets of three types: controversial topics, trending topics, and framing of issues. We report which topics are more likely to suffer from data persistence among these datasets. We quantify the data persistence bias using the change in political orientation, the presence of potentially harmful content and topics as measures. We found that controversial datasets are more likely to suffer from data persistence and they lean towards the political left upon recollection. The turnout of the data that contain potentially harmful content is significantly lower on non-controversial datasets. Overall, we found that the topics promoted by right-aligned users are more likely to suffer from data persistence. Account suspensions are the primary factor contributing to data removals, if not the only one. Our results emphasize the importance of accounting for the data persistence bias by collecting the data in real time when the dataset employed is vulnerable to data persistence bias.

Diversity matters: Robustness of bias measurements in Wikidata

With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack in the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms - TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. In particular, we find that the embedding algorithm ComplEx is more robust to the choice of demographics compared to TransE. Subsequently, we observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. This is often overlooked by most of the coarse-grained approaches working at the aggregate level. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain itself from the popular dogma of “one-size-fits-all”.

Fair Link Prediction with Multi-Armed Bandit Algorithms

Recommendation systems have been used in many domains, and in recent years, ethical problems associated with such systems have gained serious attention. The problem of unfairness in friendship or link recommendation systems in social networks has begun attracting attention, as such unfairness can cause problems like segmentation and echo chambers. One challenge in this problem is that there are many fairness metrics for networks, and existing methods only consider the improvement of a single specific fairness indicator [16, 17, 20].

In this work, we model the fair link prediction problem as a multi-armed bandit problem. We propose FairLink, a multi-armed bandit based framework that predicts new edges that are both accurate and well-behaved with respect to a fairness property of choice. This method allows the user to specify the desired fairness metric. Experiments on five real-world datasets show that FairLink can achieve a significant fairness improvement as compared to a standard recommendation algorithm, with only a small reduction in accuracy.

Monitoring Gender Gaps via LinkedIn Advertising Estimates: the case study of Italy

Women remain underrepresented in the labour market. Although significant advancements are being made to increase female participation in the workforce, the gender gap is still far from being bridged. We contribute to the growing literature on gender inequalities in the labour market, evaluating the potential of the LinkedIn estimates to monitor the evolution of the gender gaps sustainably, complementing the official data sources. In particular, assessing the labour market patterns at a subnational level in Italy. Our findings show that the LinkedIn estimates accurately capture the gender disparities in Italy regarding sociodemographic attributes such as gender, age, geographic location, seniority, and industry category. At the same time, we assess data biases such as the digitalisation gap, which impacts the representativity of the workforce in an imbalanced manner, confirming that women are under-represented in Southern Italy. Additionally to confirming the gender disparities to the official census, LinkedIn estimates are a valuable tool to provide dynamic insights; we showed an immigration flow of highly skilled women, predominantly from the South. Digital surveillance of gender inequalities with detailed and timely data is particularly significant to enable policymakers to tailor impactful campaigns.

Qbias - A Dataset on Media Bias in Search Queries and Query Suggestions

This publication describes the motivation and generation of Qbias, a large dataset of Google and Bing search queries, a scraping tool and dataset for biased news articles, as well as language models for the investigation of bias in online search. Web search engines are a major factor and trusted source in information search, especially in the political domain. However, biased information can influence opinion formation and lead to biased opinions. To interact with search engines, users formulate search queries and interact with search query suggestions provided by the search engines. A lack of datasets on search queries inhibits research on the subject. We use Qbias to evaluate different approaches to fine-tuning transformer-based language models with the goal of producing models capable of biasing text with left and right political stance. Additionally to this work we provided datasets and language models for biasing texts that allow further research on bias in online information search.

SESSION: Harmful and Problematic Behavior

Transfer Learning for Multilingual Abusive Meme Detection

The exponential growth of social media platforms has permitted people to connect worldwide. However, it has also fueled the elevation of several harmful and abusive content on the Internet. Repeated exposure to abusive content may lead to psychological effects on the target users. Thus it is necessary to detect such abusive content in all forms to keep these platforms safe and healthy. So far, several works have been done for abusive speech detection; however, most of these are text-based. Yet, social media contents are often multimodal, comprising text, images, videos, etc. Internet memes have recently emerged as a predominant mode of content shared on social media and are used to express vitriol or harm toward others. Hence it is essential to detect such abusive memes. Although several works have been done for abusive/harmful meme detection, most of these are in English with only a very few extending to non-English datasets. Therefore, one of the immediate solutions is to detect abusive memes in one language and transfer them to other languages. This work explores several model transfer techniques to bridge the gap by creating various baseline models.

Understanding Online Migration Decisions Following the Banning of Radical Communities

The proliferation of radical online communities and their violent offshoots has sparked great societal concern. However, the current practice of banning such communities from mainstream platforms has unintended consequences: (i) the further radicalization of their members in fringe platforms where they migrate; and (ii) the spillover of harmful content from fringe back onto mainstream platforms. Here, in a large observational study on two banned subreddits, r/The_Donald and r/fatpeoplehate, we examine how factors associated with the RECRO radicalization framework relate to users’ migration decisions. Specifically, we quantify how these factors affect users’ decisions to post on fringe platforms and, for those who do, whether they continue posting on the mainstream platform. Our results show that individual-level factors, those relating to the behavior of users, are associated with the decision to post on the fringe platform. Whereas social-level factors, users’ connection with the radical community, only affect the propensity to be coactive on both platforms. Overall, our findings pave the way for evidence-based moderation policies, as the decisions to migrate and remain coactive amplify unintended consequences of community bans.

Social Media as a Vector for Escort Ads:A Study on OnlyFans advertisements on Twitter

Online sex trafficking is on the rise and a majority of trafficking victims report being advertised online. The use of OnlyFans as a platform for adult content is also increasing, with Twitter as its main advertising tool. Furthermore, we know that traffickers usually work within a network and control multiple victims. Consequently, we suspect that there may be networks of traffickers promoting multiple OnlyFans accounts belonging to their victims. To this end, we present the first study of OnlyFans advertisements on Twitter in the context of finding organized activities.

Preliminary analysis of this space shows that most tweets related to OnlyFans contain generic text, making text-based methods less reliable. Instead, focusing on what ties the authors of these tweets together, we propose a novel method for uncovering coordinated networks of users based on their behaviour. Our method, called Multi-Level Clustering (MLC), combines two levels of clustering that considers both the network structure as well as embedded node attribute information. It focuses jointly on user connections (through mentions) and content (through shared URLs). We apply MLC to real-world data of 2 million tweets pertaining to OnlyFans and analyse the detected groups. We also evaluate our method on synthetically generated data (with injected ground truth) and show its superior performance compared to competitive baselines. Finally, we discuss examples of organized clusters as case studies and provide interesting conclusions to our study.

Understanding Misogynoir: A Study of Annotators’ Perspectives

“Misogynoir" is the anti-Black racist misogyny experienced by Black women, which is characterised by components of both racism and sexism. Misogynoir is challenging to detect due to its inherent subjectivity and its intersectional nature, and people’s opinions and interpretations of such hate might vary, which adds to the challenges of understanding it. In this paper, we explored how and some potential why’s different annotator characteristics influence how they interpret and annotate a dataset for potential cases of Misogynoir and Allyship. We sampled tweets containing public responses to self-reported misogynoir cases by four prominent Black women in technology, designed an online annotation task study, and recruited annotators of diverse ethnicities and genders from the Prolific crowdsourcing platform. We found that participants’ sources of evidence in judging and interpreting content for potential cases of Misogynoir and Allyship, even in circumstances where they all agree on a prospective label, vary across different factors, such as different ethnicity, lived experiences and gender. In addition, we present a variety of plausible interpretations influenced by the various annotators’ characteristics. This study demonstrates the relevance of different annotator perspectives and content comprehension in hate speech and the need for further efforts to understand intersectional hate better.

From Yellow Peril to Model Minority: Asian stereotypes in social media during the COVID-19 pandemic

Heightened racial tensions during the COVID-19 pandemic contributed to the increase and rapid propagation of online hate speech towards Asians. In this work, we study the relationship between the racist narratives and conspiracy theories that emerged related to COVID-19 and historical stereotypes underpinning Asian hate and counter-hate speech on Twitter, in particular the Yellow Peril and model minority tropes. We find that the pandemic catalyzed a broad increase in discourse engaging with racist stereotypes extending beyond COVID-19 specifically. We also find that racist narratives and conspiracy theories which emerged during the pandemic and gained widespread attention were rooted in deeply-embedded Asian stereotypes. In alignment with theories of idea habitat and processing fluency, our work suggests that historical stereotypes provided an environment vulnerable to the racist narratives and conspiracy theories which emerged during the pandemic. Our work offers insight for ongoing and future anti-racist efforts.

A longitudinal study of the top 1% toxic Twitter profiles

Toxicity is endemic to online social networks (OSNs) including Twitter. It follows a Pareto-like distribution where most of the toxicity is generated by a very small number of profiles and as such, analyzing and characterizing these “toxic profiles” is critical. Prior research has largely focused on sporadic, event-centric toxic content (i.e., tweets) to characterize toxicity on the platform. Instead, we approach the problem of characterizing toxic content from a profile-centric point of view. We study 143K Twitter profiles and focus on the behavior of the top 1% producers of toxic content on Twitter, based on toxicity scores of their tweets availed by Perspective API. With a total of 293M tweets, spanning 16 years of activity, the longitudinal data allows us to reconstruct the timelines of all profiles involved. We use these timelines to gauge the behavior of the most toxic Twitter profiles compared to the rest of the Twitter population. We study the pattern of tweet posting from highly toxic accounts, based on the frequency and how prolific they are, the nature of hashtags and URLs, profile metadata, and Botometer scores. We find that the highly toxic profiles post coherent and well-articulated content, their tweets keep to a narrow theme with lower diversity in hashtags, URLs, and domains, they are thematically similar to each other, and have a high likelihood of bot-like behavior, likely to have progenitors with intentions to influence, based on high fake followers score. Our work contributes insight into the top 1% toxic profiles on Twitter and establishes the profile-centric approach to investigate toxicity on Twitter to be beneficial. The identification of the most toxic profiles can aid in the reporting and suspension of such profiles, making Twitter a better place for discussions. Finally, we contribute to the research community with this large-scale and longitudinal dataset1, annotated with six types of toxic scores.

SESSION: Online Communities and Digital Analytics

Popular, but hardly used: Has Google Analytics been to the detriment of Web Analytics?

Since 2005, Google has been offering a free version of Google Analytics, allowing website owners to access detailed user behavior data. However, while more and more features and tools have been added to the Google measurement suite since then, it is unclear if the free availability of these tools has really helped users to derive actionable insights for their websites. Earlier studies based on a small number of interviews have suggested that users tend to play with the tools as they lack data literacy, but a broader analysis has been missing by now. Our contribution is a large-scale study of Google Analytics implementations to examine what advanced features are used, allowing conclusions to be drawn about the webmasters’ analysis capabilities. In addition, we detail how difficult it has become to conduct such a study due to the arrangements that website owners have to put in place to comply with the GDPR requirements, but also due to the possibility of obfuscation with the latest development of web analytics software.

What Makes Some Workplaces More Favorable to Remote Work? Unpacking Employee Experiences During COVID-19 Via Glassdoor

The COVID-19 pandemic has altered the working culture at various organizations; what began as a public health safety measure, remote work is continuing to reshape work in America and beyond. However, remote work has fared differently for different workers and for different organizations, contributing to better work-life balance for some, while increased burnout for others. What aspects of an organization’s culture make it less or more favorable to remote work? We answer this question by creating, analyzing, and subsequently releasing a large dataset of employee reviews shared anonymously on Glassdoor. Adopting a worker-centered approach grounded in organizational culture theory, we extract organizational cultural factors salient in the language of employee reviews of 52 Fortune 500 companies. Through a prediction task, we identify what distinguishes companies perceived to be desirable for remote work versus others, noted in company rankings following the pandemic. Our dataset and findings can serve to be valuable evidence-base and resources for efforts to define a new future of work post-pandemic.

What Web Search Behaviors Lead to Online Purchase Satisfaction?

This study investigates the relationship between web information-seeking behavior and post-purchase satisfaction. We examine web search logs as a record of web information-seeking behaviors and product ratings in an e-commerce (EC) site as self-rated post-purchase satisfaction. Our analysis revealed that web search behaviors are different for satisfied and dissatisfied customers, and even for different types of customers and products. In particular, we found that (1) within a week prior to the purchase, satisfied users more frequently searched for a wider range of product-related information than dissatisfied users. (2) satisfied users searched with more specific queries than dissatisfied users prior to the purchase, especially when they are relatively familiar with web search and purchasing a relatively expensive product. (3) customers looking for opinions of others are more likely to be satisfied with their purchase. Furthermore, we also addressed the problem of predicting customers’ post-purchase satisfaction based on their web search behaviors before and after the purchase. This attempt demonstrated that customers are likely to have different levels of post-purchase satisfaction if they conducted different types of web search.

Who Broke Amazon Mechanical Turk?: An Analysis of Crowdsourcing Data Quality over Time

We present the results of a survey fielded in June of 2022 as a lens to examine recent data reliability issues on Amazon Mechanical Turk. We contrast bad data from this survey with bad data from the same survey fielded among US workers in October 2013, April 2018, and February 2019. Application of an established data cleaning scheme reveals that unusable data has risen from a little over 2% in 2013 to almost 90% in 2022. Through symptomatic diagnosis, we attribute the data reliability drop not to an increase in bad faith work, but rather to a continuum of English proficiency levels. A qualitative analysis of workers’ responses to open-ended questions allows us to distinguish between low fluency workers, ultra-low fluency workers, satisficers, and bad faith workers. We go on to show the effects of the new low fluency work on Likert scale data and on the study's qualitative results. Attention checks are shown to be much less effective than they once were at identifying survey responses that should be discarded.

Follow Us and Become Famous! Insights and Guidelines From Instagram Engagement Mechanisms

With 1.3 billion users, Instagram (IG) has become an essential business tool. IG influencer marketing, expected to generate $33.25 billion in 2022, encourages companies and influencers to create trending content. Various methods have been proposed for predicting a post’s popularity, i.e., how much engagement (e.g., Likes) it will generate. However, these methods are limited: first, they focus on forecasting the likes, ignoring the number of comments, which became crucial in 2021. Secondly, studies often use biased or limited data. Third, researchers focused on Deep Learning models to increase predictive performance, which are difficult to interpret. As a result, end-users can only estimate engagement after a post is created, which is inefficient and expensive. A better approach is to generate a post based on what people and IG like, e.g., by following guidelines.

In this work, we uncover part of the underlying mechanisms driving IG engagement. We rely on statistical analysis and interpretable models rather than Deep Learning (black-box) approaches to achieve this goal. Leveraging innovative domain-relevant features, we first build classifiers to predict posts’ engagement. Then, we interpret the best models to determine which type of content will generate the most engagement, maximizing influencers’ and companies’ profits. We conduct extensive experiments using a worldwide dataset of 10 million posts created by 34K global influencers in nine different categories. Our simple yet powerful algorithms can effectively predict engagement, making us comparable and even superior to Deep Learning-based methods, reaching up to 94% F1-Score. Furthermore, we propose a novel unsupervised algorithm for finding highly engaging topics on IG. Thanks to our interpretable approaches, we conclude by outlining guidelines for creating successful posts.

Link Topics from Q&A Platforms using Wikidata: A Tool for Cross-platform Hierarchical Classification

This paper proposes a novel rule-based topic classification tool for questions on Q&A platforms mediated by the Wikidata ontology – an open and accessible multilingual ontology curated by a large community of online users. Q&A platforms are important sources of information on the Web and often appear as part of Web search results. By adopting Wikidata taxonomic relations as references, our tool can categories the Web content from different platforms in a unified coarse-to-fine mode based on their domain coverage. To validate and demonstrate the potential applicability of our tool, a set of use cases and experiments are carried out on two popular Q&A platforms – Zhihu and Quora, where the impact of topic categories on question lifecycles is explored. Furthermore, we compare our results with the output generated by GPT-3 classifier. This tool sheds light on how structured knowledge bases can enable data interoperability and serve as a filtering functionality to mitigate classification bias of OpenAI.