WebSci '20: 12th ACM Conference on Web Science

Full Citation in the ACM Digital Library

Still out there: Modeling and Identifying Russian Troll Accounts on Twitter

There is evidence that Russia’s Internet Research Agency attempted to interfere with the 2016 U.S. election by running fake accounts on Twitter—often referred to as “Russian trolls”. In this work, we: 1) develop machine learning models that predict whether a Twitter account is a Russian troll within a set of 170K control accounts; and, 2) demonstrate that it is possible to use this model to find active accounts on Twitter still likely acting on behalf of the Russian state. Using both behavioral and linguistic features, we show that it is possible to distinguish between a troll and a non-troll with a precision of 78.5% and an AUC of 98.9%, under cross-validation. Applying the model to out-of-sample accounts still active today, we find that up to 2.6% of top journalists’ mentions are occupied by Russian trolls. These findings imply that the Russian trolls are very likely still active today. Additional analysis shows that they are not merely software-controlled bots, and manage their online identities in various complex ways. Finally, we argue that if it is possible to discover these accounts using externally-accessible data, then the platforms—with access to a variety of private internal signals—should succeed at similar or better rates.

DeepHate: Hate Speech Detection via Multi-Faceted Text Representations

Online hate speech is an important issue that breaks the cohesiveness of online social communities and even raises public safety concerns in our societies. Motivated by this rising issue, researchers have developed many traditional machine learning and deep learning methods to detect hate speech in online social platforms automatically. However, most of these methods have only considered single type textual feature, e.g., term frequency, or using word embeddings. Such approaches neglect the other rich textual information that could be utilized to improve hate speech detection. In this paper, we propose DeepHate, a novel deep learning model that combines multi-faceted text representations such as word embeddings, sentiments, and topical information, to detect hate speech in online social platforms. We conduct extensive experiments and evaluate DeepHate on three large publicly available real-world datasets. Our experiment results show that DeepHate outperforms the state-of-the-art baselines on the hate speech detection task. We also perform case studies to provide insights into the salient features that best aid in detecting hate speech in online social platforms.

How India Censors the Web

One of the primary ways in which India engages in online censorship is by ordering Internet Service Providers (ISPs) operating in its jurisdiction to block access to certain websites for its users. This paper reports the different techniques Indian ISPs are using to censor websites, and investigates whether website blocklists are consistent across ISPs. We propose a suite of tests that prove more robust than previous work in detecting DNS and HTTP based censorship. Our tests also discern the use of SNI inspection for blocking websites, which is previously undocumented in the Indian context. Using information from court orders, user reports and government orders, we compile the largest known list of potentially blocked websites in India. We pass this list to our tests and run them from connections of six different ISPs, which together serve more than 98% of Internet users in India. Our findings not only confirm that ISPs are using different techniques to block websites, but also demonstrate that different ISPs are not blocking the same websites.

Semi-Supervised Granular Classification Framework for Resource Constrained Short-texts: Towards Retrieving Situational Information During Disaster Events

During the time of disasters, lots of short-texts are generated containing crucial situational information. Proper extraction and identification of situational information might be useful for various rescue and relief operations. Few specific types of infrequent situational information might be critical. However, obtaining labels for those resource-constrained classes is challenging as well as expensive. Supervised methods pose limited usability in such scenarios. To overcome this challenge, we propose a semi-supervised learning framework which utilizes abundantly available unlabelled data by self-learning. The proposed framework improves the performance of the classifier for resource-constrained classes by selectively incorporating highly confident samples from unlabelled data for self-learning. Incremental incorporation of unlabelled data, as and when they become available, is suitable for ongoing disaster mitigation. Experiments on three disaster-related datasets show that such improvement results in overall performance increase over standard supervised approach.

Phans, Stans and Cishets: Self-Presentation Effects on Content Propagation in Tumblr

Research on content propagation in social media has largely focused on features from the content of posts and the network structure of users. However, social media platforms are also spaces where users present their identities in particular ways. How do the ways users present themselves affect how content they produce is propagated? In this paper, we address this question with an empirical study of interaction and self-presentation data from Tumblr. We use a pairwise learning-to-rank framework to predict whether a given user will reblog (share) another user’s post from features comparing self-presented textual and visual identity information. We find evidence that alignment in identity presentation is associated with content propagation, as these features increase performance over a baseline of content features. Interpreting learned feature weights on self-presented text identity labels, we find that users who present labels that match or indicate shared interests and values are generally more likely to propagate each other’s content.

Roots of Trumpism: Homophily and Social Feedbackin Donald Trump Support on Reddit

We study the emergence of support for Donald Trump in Reddit’s political discussion. With almost 800k subscribers, “r/The_Donald” is one of the largest communities on Reddit, and one of the main hubs for Trump supporters. It was created in 2015, shortly after Donald Trump began his presidential campaign. By using only data from 2012, we predict the likelihood of being a supporter of Donald Trump in 2016, the year of the last US presidential elections. To characterize the behavior of Trump supporters, we draw from three different sociological hypotheses: homophily, social influence, and social feedback. We operationalize each hypothesis as a set of features for each user, and train classifiers to predict their participation in r/The_Donald.

We find that homophily-based and social feedback-based features are the most predictive signals. Conversely, we do not observe a strong impact of social influence mechanisms. We also perform an introspection of the best-performing model to build a “persona” of the typical supporter of Donald Trump on Reddit. We find evidence that the most prominent traits include a predominance of masculine interests, a conservative and libertarian political leaning, and links with politically incorrect and conspiratorial content.

Blogger or President? Exploitation of Patterns in Entity Type Graphs for Representative Entity Type Classification

Thirty years of the Web have led to a tremendous amount of contents. While contents of the early years have been predominantly “simple” HTML documents, more recent ones have become more and more “machine-interpretable”. Named entities - ideally explicitly and intentionally annotated - pave the way toward a semantic exploration and exploitation of the data. While this appears to be the golden sky toward a more human-centric Web, it not necessarily is. The key-point is simple: “the more the merrier” is not necessarily the case along all dimensions. For instance, each and every named entity provides via the Web of data a plenitude of information potentially overwhelming the end-user. In particular, named entities are predominantly annotated with multiple types without any order of importance associated. In order to depict the most concise type information, we introduce an approach towards Pattern Utilization for Representative Entity type classification called PURE. To this end, PURE aims at exploiting solely structural patterns derived from knowledge graphs in order to “purify” the most representative type(s) associated with a named entity. Our experiments with named entities in Wikipedia demonstrate the viability of our approach and improvement over competing strategies.

International Scientific Collaboration in Artificial Intelligence An Analysis based on Web Data

In this era of interdisciplinary science, Web science and artificial intelligence (AI), have brought dramatic revolutions to human society. The increasing availability of structured and unstructured information on the Web has become a critical research resource and offers unprecedented opportunities to explore the science of science (SciSci). SciSci use science technique to study the past, current, and evolving scientific discovery. Although many significant SciSci works study the patterns of international scientific collaboration, the relevant knowledge in AI is sorely lacking. In this work, we study the evolution of the international scientific collaboration patterns in the AI field. By graphing the entities and relationships in the international collaboration pattern, we make multiple multidimensional statistical analyses from the perspectives of institutions and countries.

What a Tangled Web We Weave: Understanding the Interconnectedness of the Third Party Cookie Ecosystem

When users browse to a so-called “First Party” website, other third parties are able to place cookies on the users’ browsers. Although this practice can enable some important use cases, in practice, these third party cookies also allow trackers to identify that a user has visited two or more first parties which both share the second party. This simple feature been used to bootstrap an extensive tracking ecosystem that can severely compromise user privacy.

In this paper, we develop a metric called “tangle factor” that measures how a set of first party websites may be interconnected or tangled with each other based on the common third parties used. Our insight is that the interconnectedness can be calculated as the chromatic number of a graph where the first party sites are the nodes, and edges are induced based on shared third parties.

We use this technique to measure the interconnectedness of the browsing patterns of over 100 users in 25 different countries, through a Chrome browser plugin which we have deployed. The users of our plugin consist of a small carefully selected set of 15 test users in UK and China, and 1000+ in-the-wild users, of whom 124 have shared data with us. We show that different countries have different levels of interconnectedness, for example China has a lower tangle factor than the UK. We also show that when visiting the same sets of websites from China, the tangle factor is smaller, due to blocking of major operators like Google and Facebook.

We show that selectively removing the largest trackers is a very effective way of decreasing the interconnectedness of third party websites. We then consider blocking practices employed by privacy-conscious users (such as ad blockers) as well as those enabled by default by Chrome and Firefox, and compare their effectiveness using the tangle factor metric we have defined. Our results help quantify for the first time the extent to which one ad blocker is more effective than others, and how Firefox defaults also greatly help decrease third party tracking compared to Chrome.

Russian trolls speaking Russian: Regional Twitter operations and MH17

The role of social media in promoting media pluralism was initially viewed as wholly positive as social media could break the oligopoly of (often state-owned) mainstream media. However, some governments are allegedly manipulating social media by hiring online commentators (also known as trolls) to spread propaganda and disinformation. In particular, an alleged system of professional trolls operating both domestically and internationally exists in Russia.

To improve transparency on trolls’ influence on social media, Twitter released in 2018 longitudinal data on accounts identified as Russian trolls and their tweets, starting a wave of quantitative research on Russian trolls. However, while foreign-targeted English language operations of these trolls have received significant attention, no research has analyzed their Russian language domestic and regional-targeted activities. This is despite the fact that half of the tweets in the Twitter-released data are in Russian. We address this gap by characterizing the Russian-language operations of Russian trolls using the Twitter data. We first take a broad view with a descriptive and temporal analysis, and then focus in on the trolls’ operation related to the crash of Malaysia Airlines flight MH17, one of the deadliest incidents in the conflict in Ukraine.

Among other things, we find that Russian-language trolls have run 163 hashtag campaigns (where the use of a hashtag grows abruptly within one month). The main political sentiments of such campaigns are praising Russia and Putin (29%), criticizing Ukraine (26%), and criticizing the United States (US) along with Obama (9%). Further, we discovered that trolls actively reshared information. Namely, 76% of tweets were retweets or contained a URL. Particularly often trolls distributed the news of mainstream media. Additionally, we observe periodic temporal patterns of tweet arrival, with three distinct periods that change over time, suggesting that trolls use automation tools for posting. Further, we find that trolls’ information campaign on the MH17 crash was the largest in terms of tweet count. However, around 68% of tweets posted with MH17 hashtags were likely used simply for hashtag amplification. With these tweets excluded, about 49% of the tweets suggested to varying levels that Ukraine was responsible for the crash, and only 13% contained disinformation and propaganda presented as news. Interestingly, trolls promoted inconsistent alternative theories for the incident. Namely, half of the false news tweets suggested that Ukraine downed the plane with an air-to-air missile, whereas 23% promoted the ground-to-air missile version.

Constructive Approach for Early Extraction of Viral Spreading Social Issues from Twitter

In recent years, there has been a rapid increase in viral spreading social issues that emerge in the public consciousness through the fast spread of information online. With the advent of social media, they sometimes yield unexpected side effects such as product boycotts. Therefore, it is important to recognize them as the earliest and take preventive measures against them. Existing researches on social issue extraction have mainly focused on news channels and newspapers as the primary information sources. However, such approaches cannot be applied for the early extraction of viral spreading social issues because their epicenter is the online public’s opinion. In this study, we propose a constructive method inspired by a social issues research approach, called constructivism, for the early extraction of viral spreading social issues. It is characteristic that our method identifies the keywords related to social issues using information obtained from the claims-making activities on Twitter and Twitter-user clustering. We conducted experiments on a large Twitter dataset comprising tens of billions of tweets and the proposed method successfully extracted six out of the seven viral spreading social issues earlier than their first TV news coverage. Furthermore, the proposed method could identify such cases approximately two weeks earlier, on average, than the first national TV news coverage.

Gender Classification and Bias Mitigation in Facial Images

Gender classification algorithms have important applications in many domains today such as demographic research, law enforcement, as well as human-computer interaction. Recent research showed that algorithms trained on biased benchmark databases could result in algorithmic bias. However, to date, little research has been carried out on gender classification algorithms’ bias towards gender minorities subgroups, such as the LGBTQ and the non-binary population, who have distinct characteristics in gender expression. In this paper, we began by conducting surveys on existing benchmark databases for facial recognition and gender classification tasks. We discovered that the current benchmark databases lack representation of gender minority subgroups. We worked on extending the current binary gender classifier to include a non-binary gender class. We did that by assembling two new facial image databases: 1) a racially balanced inclusive database with a subset of LGBTQ population 2) an inclusive-gender database that consists of people with non-binary gender. We worked to increase classification accuracy and mitigate algorithmic biases on our baseline model trained on the augmented benchmark database. Our ensemble model has achieved an overall accuracy score of 90.39%, which is a 38.72% increase from the baseline binary gender classifier trained on Adience. While this is an initial attempt towards mitigating bias in gender classification, more work is needed in modeling gender as a continuum by assembling more inclusive databases.

ACT : Automatic Fake News Classification Through Self-Attention

Automatic detection of fake news is an important issue given the disproportionate effect of fake news on democratic processes, individuals and institutions. Research on automated fact-checking has proposed different approaches based on traditional machine learning methods, using hand-crafted lexical features. Nevertheless, these approaches focus on analyzing the text claim without considering the facts that are not explicitly given but can be derived from it. For example, external evidence that is retrieved from the Web as a knowledge source of the claim can provide complementary context of the claim and gives convincing reasons from it to support or oppose. Recent approaches study this deficit by incorporating supportive evidence (article) corresponding to the claim. However, these methods are either requiring substantial feature modeling, not considering several supporting evidences, or even not analyzing the language of the supporting evidence deeply.

To this end, we propose an end-to-end framework, named Automatic Fake News Classification Through Self-Attention (ACT), which exploits different supportive articles to a claim which mimics manual fact-checking processes. The model presents an approach that computes the claim credibility by aggregating over the prediction generated by every claim-retrieved article pair. The article input is represented by using self-attention on the top of a bidirectional LSTM neural network. By using the self-attention, the model concentrates on nuanced linguistic features and does not require any feature engineering, lexicons or any other manual intervention. Moreover, different aspects of the supporting article are extracted into multiple vector representations. Hence, different meaningful article representations can be extracted into a two-dimensional matrix to represent the article. In the end, a majority vote over the several external articles of a given claim is applied to assess the claim’s credibility. We conduct experiments on three different real-world datasets, compare them to the state-of-the-art approaches and analyze our results, which shows performance improvements.

Measuring and Characterizing Hate Speech on News Websites

The Web has become the main source for news acquisition. At the same time, news discussion has become more social: users can post comments on news articles or discuss news articles on other platforms like Reddit. These features empower and enable discussions among the users; however, they also act as the medium for the dissemination of toxic discourse and hate speech. The research community lacks a general understanding on what type of content attracts hateful discourse and the possible effects of social networks on the commenting activity on news articles.

In this work, we perform a large-scale quantitative analysis of 125M comments posted on 412K news articles over the course of 19 months. We analyze the content of the collected articles and their comments using temporal analysis, user-based analysis, and linguistic analysis, to shed light on what elements attract hateful comments on news articles. We also investigate commenting activity when an article is posted on either 4chan’s Politically Incorrect board (/pol/) or six selected subreddits. We find statistically significant increases in hateful commenting activity around real-world divisive events like the “Unite the Right” rally in Charlottesville and political events like the second and third 2016 US presidential debates. Also, we find that articles that attract a substantial number of hateful comments have different linguistic characteristics when compared to articles that do not attract hateful comments. Furthermore, we observe that the post of a news articles on either /pol/ or the six subreddits is correlated with an increase of (hateful) commenting activity on the news articles.

HPRA: Hyperedge Prediction using Resource Allocation

Many real-world systems involve higher-order interactions and thus demand complex models such as hypergraphs. For instance, a research article could have multiple collaborating authors, and therefore the co-authorship network is best represented as a hypergraph. In this work, we focus on the problem of hyperedge prediction. This problem has immense applications in multiple domains, such as predicting new collaborations in social networks, discovering new chemical reactions in metabolic networks, etc. Despite having significant importance, the problem of hyperedge prediction hasn’t received adequate attention, mainly because of its inherent complexity. In a graph with n nodes the number of potential edges is , whereas in a hypergraph, the number of potential hyperedges is . To avoid searching through the huge space of hyperedges, current methods restrain the original problem in the following two ways. One class of algorithms assume the hypergraphs to be k-uniform where each hyperedge can have exactly k nodes. However, many real-world systems are not confined only to have interactions involving k components. Thus, these algorithms are not suitable for many real-world applications. The second class of algorithms requires a candidate set of hyperedges from which the potential hyperedges are chosen. In the absence of domain knowledge, the candidate set can have possible hyperedges, which makes this problem intractable. More often than not, domain knowledge is not readily available, making these methods limited in applicability. We propose HPRA - Hyperedge Prediction using Resource Allocation, the first of its kind algorithm, which overcomes these issues and predicts hyperedges of any cardinality without using any candidate hyperedge set. HPRA is a similarity-based method working on the principles of the resource allocation process. In addition to recovering missing hyperedges, we demonstrate that HPRA can predict future hyperedges in a wide range of hypergraphs. Our extensive set of experiments shows that HPRA achieves statistically significant improvements over state-of-the-art methods.

Comparing Audience Appreciation to Fact-Checking Across Political Communities on Reddit

As a countermeasure to disinformation, many fact-checking websites, such as Snopes.com, provide valuable resources to verify news stories or claims. In this paper, we study how such fact-checking resources are used in online political discussions on Reddit, and how audiences or readers respond to their use in the context of the 2016 US Presidential Election. We first characterize the role of fact-checking resources by developing a typology for labeling instances in which they are employed in three political subreddits, r/politics, r/The_Donald and r/hillaryclinton. We find that fact-checking, when used as a correction to false information, is more prevalent on r/politics than on r/The_Donald or r/hillaryclinton. Next, we quantify audience responses to fact-checking by using comment score as a measure of popularity and find that the correction of facts is also more appreciated in r/politics than the other subreddits. Finally, we estimate the impact of corrections, and other uses of fact-checks, on the sustainability of a conversational thread and find that presence of corrections in r/politics appear to be correlated with short conversations. Overall, these findings indicate that the use of fact-checking resources within r/politics is distinct from more partisan subreddits.

Analyzing Temporal Relationships between Trending Terms on Twitter and Urban Dictionary Activity

As an online, crowd-sourced, open English-language slang dictionary, the Urban Dictionary platform contains a wealth of opinions, jokes, and definitions of terms, phrases, acronyms, and more. However, it is unclear exactly how activity on this platform relates to larger conversations happening elsewhere on the web, such as discussions on larger, more popular social media platforms. In this research, we study the temporal activity trends on Urban Dictionary and provide the first analysis of how this activity relates to content being discussed on a major social network: Twitter. By collecting the whole of Urban Dictionary, as well as a large sample of tweets over seven years, we explore the connections between the words and phrases that are defined and searched for on Urban Dictionary and the content that is talked about on Twitter. Through a series of cross-correlation calculations, we identify cases in which Urban Dictionary activity closely reflects the larger conversation happening on Twitter. Then, we analyze the types of terms that have a stronger connection to discussions on Twitter, finding that Urban Dictionary activity that is positively correlated with Twitter is centered around terms related to memes, popular public figures, and offline events. Finally, We explore the relationship between periods of time when terms are trending on Twitter and the corresponding activity on Urban Dictionary, revealing that new definitions are more likely to be added to Urban Dictionary for terms that are currently trending on Twitter.

Examining the Role of Mood Patterns in Predicting Self-Reported Depressive symptoms

Researchers have explored automatic screening models as a quick way to identify potential risks of developing depressive symptoms. Most existing models include a person’s mood as reflected on social media at a single point in time as one of the predictive variables. In this paper, we study the changes and transition in mood reflected on social media text over a period of one year using a mood profile. We used a subset of the ”MyPersonality” Facebook data set that comprises users who have consented to and completed an assessment of depressive symptoms. The subset consists of 93,378 Facebook posts from 781 users. We observed less evidence of mood fluctuation expressed in social media text from those with low symptom measures compared to others with high symptom scores. Next, we leveraged a daily mood representation in Hidden Markov Models to determine its associations with subsequent self-reported symptoms. We found that individuals who have specific mood patterns are highly likely to have reported high depressive symptoms. However, not all of the high symptoms individuals necessarily displayed this characteristic, which indicates presence of potential subgroups driving these findings. Finally, we leveraged multiple mood representations to characterize levels of depressive symptoms with a logistic regression model. Our findings support the claim that for some people, derived mood from social media text can be a proxy of real-life mood, in particular depressive symptoms. Combining the mood representations with other proxy signals can potentially advance responsibly used semi-automatic screening procedures.

Every Colour You Are: Stance Prediction and Turnaround in Controversial Issues

Web platforms have allowed political manifestation and debate for decades. Technology changes have brought new opportunities for expression, and the availability of longitudinal data of these debates entice new questions regarding who participates, and who updates their opinion. The aim of this work is to provide a methodology to measure these phenomena, and to test this methodology on a specific topic, abortion, as observed on one of the most popular micro-blogging platforms. To do so, we followed the discussion on Twitter about abortion in two Spanish-speaking countries from 2015 to 2018. Our main insights are two fold. On the one hand, people adopted new technologies to express their stances, particularly colored variations of heart emojis ( & ) in a way that mirrored physical manifestations on abortion. On the other hand, even on issues with strong opinions, opinions can change, and these changes show differences in demographic groups. These findings imply that debate on the Web embraces new ways of stance adherence, and that changes of opinion can be measured and characterized.

Cores matter? An analysis of graph decomposition effects on influence maximization problems

Estimating the spreading potential of nodes in a social network is an important problem which finds application in a variety of different contexts, ranging from viral marketing to spread of viruses and rumor blocking. Several studies have exploited both mesoscale structures and local centrality measures in order to estimate the spreading potential of nodes. To this end, one known result in the literature establishes a correlation between the spreading potential of a node and its coreness: i.e., in a core-decompostion of a network, nodes in higher cores have a stronger influence potential on the rest of the network. In this paper we show that the above result does not hold in general under common settings of propagation models with submodular activation function on directed networks, as those ones used in the influence maximization (IM) problem.

Motivated by this finding, we extensively explore where the set of influential nodes extracted by state-of-the-art IM methods are located in a network w.r.t. different notions of graph decomposition. Our analysis on real-world networks provides evidence that, regardless of the particular IM method, the best spreaders are not always located within the inner-most subgraphs defined according to commonly used graph-decomposition methods. We identify the main reasons that explain this behavior, which can be ascribed to the inability of classic decomposition methods in incorporating higher-order degree of nodes. By contrast, we find that a distance-based generalization of the core-decomposition for directed networks can profitably be exploited to actually restrict the location of candidate solutions for IM to a single, well-defined portion of a network graph.

The Reception of Education Reforms through the Blogosphere

Teachers and other Edu-professionals in the United Kingdom are a professional body charged with educating each generation of children from birth to age 18, and beyond to Higher Education (HE). Over the last 10 years or so, they have formed a significant community in the blogosphere.

This paper presents an exploration of the Edu-blogosphere in order to gain a better understanding of the topics discussed, and how reforms instigated by the government are received. This is particularly important as the Secretary of State for Education appointed in 2010, Michael Gove, introduced a number of reforms which were wide-ranging and not always well received by the Edu-community. The blogs of some of them contributed to an eventual change in policy.

The challenges of harvesting and analysing the data on such a large scale requires semi-automated approaches; however, it is not sufficient to use such tools and techniques ‘out of the box’. Here, we present a methodology that draws on the specialist domain knowledge of the researchers, as well as a robust approach to evaluating the parameters of the selected algorithms. Attention has also been paid to the cleaning and pre-processing of the data, in particular the generation of a bespoke list of stopwords. A review of the existing literature provided a list of seven categories to classify the blogs, using semi-supervised learning. This was followed by topic modelling using Latent Dirichlet Allocation (LDA). Both methods present challenges, discussed in this paper. This approach represents the first part of the original contribution of this work: the combination of the analysis of a community – research rooted in the Social Sciences - using tools from Computer Science.

Following the application of the methodology on blog posts written by the Edu-community, a discourse focused on professional practice, sharing resources, and discussing a wide range of topics was revealed. There was also a noticeable spike in the number of blogs focusing on the impact of the Education reforms mentioned above. The richness of the discourse in the Edu-community is presented as the second original contribution of this research. This methodology is promising for the analysis of other blog communities.

Challenges and proposals for further work include adding blog post titles, comments and author tags to the data; and combining the data with tweets made by those members of the community that are identifiable on Twitter.

Representativity Fairness in Clustering

Incorporating fairness constructs into machine learning algorithms is a topic of much societal importance and recent interest. Clustering, a fundamental task in unsupervised learning that manifests across a number of web data scenarios, has also been subject of attention within fair ML research. In this paper, we develop a novel notion of fairness in clustering, called representativity fairness. Representativity fairness is motivated by the need to alleviate disparity across objects’ proximity to their assigned cluster representatives, to aid fairer decision making. We illustrate the importance of representativity fairness in real-world decision making scenarios involving clustering and provide ways of quantifying objects’ representativity and fairness over it. We develop a new clustering formulation, RFKM, that targets to optimize for representativity fairness along with clustering quality. Inspired by the K-Means framework, RFKM incorporates novel loss terms to formulate an objective function. The RFKM objective and optimization approach guides it towards clustering configurations that yield higher representativity fairness. Through an empirical evaluation over a variety of public datasets, we establish the effectiveness of our method. We illustrate that we are able to significantly improve representativity fairness at only marginal impact to clustering quality.

Preference-driven Control over Incompleteness of Knowledge Graph Query Answers

Entities in today’s knowledge graphs do not only differ in their property values but also in the schematic structures they are represented by. Given their extraction-based foundation, it is quite common that in practical knowledge base instances totally unrelated graph structures describe entities of the same type. Hence, operators for handling such heterogeneity are mandatory when designing a robust query language for knowledge graphs. While SPARQL does offer optional patterns for this purpose, their query answers often suffer from an unintuitive matching behavior. In contrast, preference semantics seem to be a much more intuitive and still robust way of expressing how the optimal query result may look like. While preferences over data value domains are already applied for graph data, we argue for structural preferences to achieve fine-grained control of heterogeneity in the query answers. Therefore, we propose a new operator for SPARQL, enabling the expression of structural as well as some value preferences. Equipped with a Pareto-style semantics, we give examples of how to model preferences with the new operator. Our prototypical implementation allows for evaluating several encodings of the new construct at DBpedia’s SPARQL endpoint.

On the use of Jargon and Word Embeddings to Explore Subculture within the Reddit’s Manosphere

Understanding the identities, needs, realities and development of subcultures has been a long term target of sociology and cultural studies. Socio-cultural linguistics, in particular, examines the use of language and, in particular, the existence and use of neologisms, slang and jargon. These terms capture concepts and expressions that are not in common use and represent the new realities, norms and values of subcommunities. Identifying and understanding such terms, however, is a very complex task, particularly considering the vast amount of content that is currently available online for many such groups. In this paper, we propose a combination of computational and socio-linguistic methods to automatically extract new terminology from large amounts of data, using word-embeddings to semantically contextualise their meaning. As a use case, we explore subculture on the platform Reddit. More specifically, we investigate groups considered part of the manosphere, a loose online community where men’s perspectives, gripes, frustrations and desires are explicitly expressed and where women are typically targets of hostility. Characterisations of this group as a subculture are then provided, based on an in-depth analysis of the identified jargon.

Unveiling Community Dynamics on Instagram Political Network

Online Social Networks (OSNs) allow users to generate and consume content in an easy and personalized way. Among OSNs, Instagram has seen a surge in popularity, and political actors exploit it to reach people at scale, bypassing traditional media and often triggering harsh debates with and among followers. Uncovering the structural properties and dynamics of such interactions is paramount for understanding the online political debate. This is a challenging task due to both the size of the network and the nature of interactions.

In this paper, we define a probabilistic model to extract the backbone of the interaction network among Instagram commenters and, after that, we uncover communities. We apply our model to 10 weeks of comments centered around election times in Brazil and Italy. We monitor both politicians and other categories of influencers, finding persistent commenters, i.e., those who often comment together on Instagram posts.

Our methodology allows us to unveil interesting facts: i) commenters’ networks are split into few communities; ii) community structure in politics is weaker than in general profiles, indicating that the political debate is a blur, with some commenters bridging strongly opposed political actors; and iii) communities engaging on political profiles are bigger, more active and more stable during electoral period.

Exploring Low-degree nodes first accelerates Network Exploration

We consider information diffusion on Web-like networks and how random walks can simulate it. A well-studied problem in this domain is Partial Cover Time, i.e., the calculation of the expected number of steps a random walker needs to visit a given fraction of the nodes of the network. We notice that some of the fastest solutions in fact require that nodes have perfect knowledge of the degree distribution of their neighbors, which in many practical cases is not obtainable, e.g., for privacy reasons. We thus introduce a version of the Cover problem that considers such limitations: Partial Cover Time with Budget. The budget is a limit on the number of neighbors that can be inspected for their degree; we have adapted optimal random walks strategies from the literature to operate under such budget. Our solution is called Min-degree (MD) and, essentially, it biases random walkers towards visiting peripheral areas of the network first. Extensive benchmarking on six real datasets proves that the—perhaps counter-intuitive strategy—MD strategy is in fact highly competitive wrt. state-of-the-art algorithms for cover.

Scholarly Social Machines: A Web Science Perspective on our Knowledge Infrastructure

A Knowledge Infrastructure comprises the people, artefacts, and institutions that generate, share, and maintain knowledge, very often mediated by the Web. Our scholarly Knowledge Infrastructure is evolving as researchers embrace digital techniques enabled by increasing availability of digital data, computational power, and analytical tools and techniques. Crucially, the social structures are changing also. Taking a Web Science approach, this paper encourages the reader to view the scholarly Knowledge Infrastructure as an ecosystem of interacting and evolving Social Machines. We illustrate these Scholarly Social Machines with a series of descriptive examples, and reflect on these to propose Scholarly Primitives associated with Scholarly Social Machines. We suggest that this approach facilitates a holistic understanding of our scholarly Knowledge Infrastructure and informs its evolution.

Dancing to the Partisan Beat: A First Analysis of Political Communication on TikTok

TikTok is a video-sharing social networking service, whose popularity is increasing rapidly. It was the world’s second-most downloaded app in 2019. Although the platform is known for having users posting videos of themselves dancing, lip-syncing, or showcasing other talents, user-videos expressing political views have seen a recent spurt. This study aims to perform a primary evaluation of political communication on TikTok. We collect a set of US partisan Republican and Democratic videos to investigate how users communicated with each other about political issues. With the help of computer vision, natural language processing, and statistical tools, we illustrate that political communication on TikTok is much more interactive in comparison to other social media platforms, with users combining multiple information channels to spread their messages. We show that political communication takes place in the form of communication trees since users generate branches of responses to existing content. In terms of user demographics, we find that users belonging to both the US parties are young and behave similarly on the platform. However, Republican users generated more political content and their videos received more responses; on the other hand, Democratic users engaged significantly more in cross-partisan discussions.

Is There Personalization in Twitter Search? A Study on polarized opinions about the Brazilian Welfare Reform

Personalization algorithms play an essential role in the way search platforms fetch results to users. While there are many empirical studies about the effects of these algorithms on Web searches like Google and Bing, reports about personalization on social media searches are rare. This exploratory study aims to understand and quantify the limits of personalization in Twitter search results. We developed a measurement methodology and agents to train a pair of polarized Twitter accounts and simultaneously collected search results from these accounts. The agents were run in a political context, the Brazilian Welfare Reform. Our findings show a significant amount of personalization differences when we compare search results from a new fresh profile to non-fresh ones. Peculiarly, little evidence for differences between two profiles that followed different accounts with polarized viewpoints about the same topic was found – the filter bubble hypothesis cannot be null.

An Automated Pipeline for Character and Relationship Extraction from Readers Literary Book Reviews on Goodreads.com

Reader reviews of literary fiction on social media, especially those in persistent, dedicated forums, create and are in turn driven by underlying narrative frameworks. In their comments about a novel, readers generally include only a subset of characters and their relationships, thus offering a limited perspective on that work. Yet in aggregate, these reviews capture an underlying narrative framework comprised of different actants (people, places, things), their roles, and interactions that we label the “consensus narrative framework”. We represent this framework in the form of an actant-relationship story graph. Extracting this graph is a challenging computational problem, which we pose as a latent graphical model estimation problem. Posts and reviews are viewed as samples of sub graphs/networks of the hidden narrative framework. Inspired by the qualitative narrative theory of Greimas, we formulate a graphical generative Machine Learning (ML) model where nodes represent actants, and multi-edges and self-loops among nodes capture context-specific relationships. We develop a pipeline of interlocking automated methods to extract key actants and their relationships, and apply it to thousands of reviews and comments posted on Goodreads.com. We manually derive the ground truth narrative framework from SparkNotes, and then use word embedding tools to compare relationships in ground truth networks with our extracted networks. We find that our automated methodology generates highly accurate consensus narrative frameworks: for our four target novels, with approximately 2900 reviews per novel, we report average coverage/recall of important relationships of >80% and an average edge detection rate of >89%. These extracted narrative frameworks can generate insight into how people (or classes of people) read and how they recount what they have read to others. 1

Analysing Privacy Leakage of Life Events on Twitter

People share a wide variety of information on Twitter, including the events in their lives, without understanding the size of their audience. While some of these events can be considered harmless such as getting a new pet, some of them can be sensitive such as gender-transition experiences. Every interaction increases the visibility of the tweets and even if the original tweet is protected or deleted, public replies to it will stay in the platform. These replies might signal the events in the original tweet which cannot be managed by the event subject. In this paper, we aim to understand the scope of life event disclosures for those with both public and protected (private) accounts. We collected 635k tweets with the phrase “happy for you” over four months. We found that roughly 10% of the tweets collected were celebrating a mentioned user’s life event, ranging from marriage to surgery recovery. 8% of these tweets were directed at protected accounts. The majority of mentioned users also interacted with these tweets by liking, retweeting, or replying.

User Identity Linkage in Social Media Using Linguistic and Social Interaction Features

Social media users often hold several accounts in their effort to multiply the spread of their thoughts, ideas, and viewpoints. In the particular case of objectionable content, users tend to create multiple accounts to bypass the combating measures enforced by social media platforms and thus retain their online identity even if some of their accounts are suspended. User identity linkage aims to reveal social media accounts likely to belong to the same natural person so as to prevent the spread of abusive/illegal activities. To this end, this work proposes a machine learning-based detection model, which uses multiple attributes of users’ online activity in order to identify whether two or more virtual identities belong to the same real natural person. The models efficacy is demonstrated on two cases on abusive and terrorism-related Twitter content.

A Systematic Media Frame Analysis of 1.5 Million New York Times Articles from 2000 to 2017

Framing is an indispensable narrative device for news media because even the same facts may lead to conflicting understandings if deliberate framing is employed. Therefore, identifying media framing is a crucial step to understanding how news media influence the public. Framing is, however, difficult to operationalize and detect, and thus traditional media framing studies had to rely on manual annotation, which is challenging to scale up to massive news datasets. Here, by developing a media frame classifier that achieves state-of-the-art performance, we systematically analyze the media frames of 1.5 million New York Times articles published from 2000 to 2017. By examining the ebb and flow of media frames over almost two decades, we show that short-term frame abundance fluctuation closely corresponds to major events, while there also exist several long-term trends, such as the gradually increasing prevalence of the “Cultural identity” frame. By examining specific topics and sentiments, we identify characteristics and dynamics of each frame. Finally, as a case study, we delve into the framing of mass shootings, revealing three major framing patterns. Our scalable, computational approach to massive news datasets opens up new pathways for systematic media framing studies.

Misplaced Trust: Measuring the Interference of Machine Learning in Human Decision-Making

ML decision-aid systems are increasingly common on the web, but their successful integration relies on people trusting them appropriately: they should use the system to fill in gaps in their ability, but recognize signals that the system might be incorrect. We measured how people’s trust in ML recommendations differs by expertise and with more system information through a task-based study of 175 adults. We used two tasks that are difficult for humans: comparing large crowd sizes and identifying similar-looking animals. Our results provide three key insights: (1) People trust incorrect ML recommendations for tasks that they perform correctly the majority of the time, even if they have high prior knowledge about ML or are given information indicating the system is not confident in its prediction; (2) Four different types of system information all increased people’s trust in recommendations; and (3) Math and logic skills may be as important as ML for decision-makers working with ML recommendations.

How Biased is the Population of Facebook Users? Comparing the Demographics of Facebook Users with Census Data to Generate Correction Factors

Censuses and representative sampling surveys around the world are key sources of data to guide government investments and public policies. However, these sources are very expensive to obtain and are collected relatively infrequently. Over the last decade, there has been growing interest in the use of data from social media to complement more traditional data sources. However, social media users are not representative of the general population. Thus, analyses based on social media data require statistical adjustments, like post-stratification, in order to remove the bias and make solid statistical claims. These adjustments are possible only when we have information about the frequency of demographic groups using social media. These data, when compared with official statistics, enable researchers to produce appropriate statistical correction factors. In this paper, we leverage the Facebook advertising platform to compile the equivalent of an aggregate-level census of Facebook users. Our compilation includes the population distribution for seven demographic attributes such as gender, political leaning, and educational attainment at different geographic levels for the U.S. (country, state, and city). By comparing the Facebook counts with official reports provided by the U.S. Census and Gallup, we found very high correlations, especially for political leaning and race. We also identified instances where official statistics may be underestimating population counts as in the case of immigration. We use the information collected to calculate bias correction factors for all computed attributes in order to evaluate the extent to which different demographic groups are more or less represented on Facebook, and to derive the actual distributions for specific audiences of interest. We provide the first comprehensive analysis for assessing biases in Facebook users across several dimensions. This information can be used to generate bias-adjusted population estimates and demographic counts in a timely way and at fine geographic granularity in between data releases of official statistics.

Joint Local and Global Sequence Modeling in Temporal Correlation Networks for Trending Topic Detection

Trending topics represent the topics that are becoming increasingly popular and attract a sudden spike in human attention. Trending topics are critical and useful in modern search engines, which can not only enhance user engagements but also improve user search experiences. Large volumes of user search queries over time are indicative aggregated user interests and thus provide rich information for detecting trending topics. The topics derived from query logs can be naturally treated as a temporal correlation network, suggesting both local and global trending signals. The local signals represent the trending/non-trending information within each frequency sequence, and the global correlation signals denote the relationships across frequency sequences. We hypothesize that integrating local and global signals can benefit trending topic detection. In an attempt to jointly exploit the complementary information of local and global signals in temporal correlation networks, we propose a novel framework, Local-Global Ranking (LGRank), to both capture local temporal sequence representation with adversarial learning and model global sequence correlations simultaneously for trending topic detection. The experimental results on real-world datasets from a commercial search engine demonstrate the effectiveness of LGRank on detecting trending topics.

AI 2000: A Decade of Artificial Intelligence

In the past decades, artificial intelligence has dramatically changed the way we work and live. Moreover, it is increasingly becoming a national strategy for its rapid development and broad application in industries. However, the way artificial intelligence advances itself is sorely lacking until now. One of the most important reasons is the deficiency of timely and reliable knowledge graph in this field. To illustrate the problem, we introduce an academic knowledge graph of AI, named AI 2000, which combines techniques in data mining, bibliometrics, natural language processing, etc. In this work, we try to link the entities of scholars, academic papers, researches, etc. in the field of artificial intelligence. AI 2000 aims at serving as a medium for us to explore the evolution of artificial intelligence in the past decade and looking forward to its future trend. The methodology illustrates its timeliness and reliability, and analysis demonstrates high quality and availability. It is freely available at AMiner1.

Tackling Peer-to-Peer Discrimination in the Sharing Economy

Sharing economy platforms such as Airbnb and Uber face a major challenge in the form of peer-to-peer discrimination based on sensitive personal attributes such as race and gender. As shown by a recent study under controlled settings, reputation systems can eliminate social biases on these platforms by building trust between the users. However, for this to work in practice, the reputation systems must themselves be non-discriminatory. In fact, a biased reputation system will further reinforce the bias and create a vicious feedback loop. Given that the reputation scores are generally aggregates of ratings provided by human users to one another, it is not surprising that the scores often inherit the human bias. In this paper, we address the problem of making reputation systems on sharing economy platforms more fair and unbiased. We show that a game-theoretical incentive mechanism can be used to encourage users to go against common bias and provide a truthful rating about others, obtained through a more careful and deeper evaluation. In situations where an incentive mechanism can’t be implemented, we show that a simple post-processing approach can also be used to correct bias in the reputation scores, while minimizing the loss in the useful information provided by the scores. We evaluate the proposed solution on synthetic and real datasets from Airbnb.