WebSci '18- Proceedings of the 10th ACM Conference on Web Science

Full Citation in the ACM Digital Library

SESSION: Best of Web Science 2018

Understanding the Roots of Radicalisation on Twitter

In an increasingly digital world, identifying signs of online extremism sits at the top of the priority list for counter-extremist agencies. Researchers and governments are investing in the creation of advanced information technologies to identify and counter extremism through intelligent large-scale analysis of online data. However, to the best of our knowledge, these technologies are neither based on, nor do they take advantage of, the existing theories and studies of radicalisation. In this paper we propose a computational approach for detecting and predicting the radicalisation influence a user is exposed to, grounded on the notion of 'roots of radicalisation' from social science models. This approach has been applied to analyse and compare the radicalisation level of 112 pro-ISIS vs.112 "general" Twitter users. Our results show the effectiveness of our proposed algorithms in detecting and predicting radicalisation influence, obtaining up to 0.9 F-1 measure for detection and between 0.7 and 0.8 precision for prediction. While this is an initial attempt towards the effective combination of social and computational perspectives, more work is needed to bridge these disciplines, and to build on their strengths to target the problem of online radicalisation.

Collective Attention towards Scientists and Research Topics

Emergent patterns of collective attention towards scientists and their research may function as a proxy for scientific impact which traditionally is assessed via committees that award prizes to scientists. Therefore it is crucial to understand the relationships between scientific impact and online demand and supply for information about scientists and their work. In this paper, we compare the temporal pattern of information supply (article creations) and information demand (article views) on Wikipedia for two groups of scientists: scientists who received one of the most prestigious awards in their field and influential scientists from the same field who did not receive an award. Our research highlights that awards function as external shocks which increase supply and demand for information about scientists, but hardly affect information supply and demand for their research topics. Further, we find interesting differences in the temporal ordering of information supply between the two groups: (i) award-winners have a higher probability that interest in them precedes interest in their work; (ii) for award winners interest in articles about them and their work is temporally more clustered than for non-awarded scientists.

Fake News vs Satire: A Dataset and Analysis

Fake news has become a major societal issue and a technical chal- lenge for social media companies to identify. This content is dif- cult to identify because the term "fake news" covers intention- ally false, deceptive stories as well as factual errors, satire, and sometimes, stories that a person just does not like. Addressing the problem requires clear de nitions and examples. In this work, we present a dataset of fake news and satire stories that are hand coded, veri ed, and, in the case of fake news, include rebutting stories. We also include a thematic content analysis of the articles, identifying major themes that include hyperbolic support or con- demnation of a gure, conspiracy theories, racist themes, and dis- crediting of reliable sources. In addition to releasing this dataset for research use, we analyze it and show results based on language that are promising for classi cation purposes. Overall, our contri- bution of a dataset and initial analysis are designed to support fu- ture work by fake news researchers.

Third Party Tracking in the Mobile Ecosystem

Third party tracking allows companies to identify users and track their behaviour across multiple digital services. This paper presents an empirical study of the prevalence of third-party trackers on 959,000 apps from the US and UK Google Play stores. We find that most apps contain third party tracking, and the distribution of trackers is long-tailed with several highly dominant trackers accounting for a large portion of the coverage. The extent of tracking also differs between categories of apps; in particular, news apps and apps targeted at children appear to be amongst the worst in terms of the number of third party trackers associated with them. Third party tracking is also revealed to be a highly trans-national phenomenon, with many trackers operating in jurisdictions outside the EU. Based on these findings, we draw out some significant legal compliance challenges facing the tracking industry.

A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

A quality annotated corpus is essential to research. Despite the re- cent focus of the Web science community on cyberbullying research, the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an o ensive words lexicon capturing di erent types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political1. We rst crawled data from Twitter using this content-tailored o ensive lexicon. As mere presence of an o ensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the ve types of harassment content and is available on the Git repository2.

SESSION: Digging Into Social Networks

Uncovering the Nucleus of Social Networks

Many social network studies have focused on identifying communities through clustering or partitioning a large social network into smaller parts. While community structure is important in social network analysis, relatively little attention has been paid to the problem of "core structure'' analysis in many social networks. Intuitively, one may expect that many social networks possess some sort of a "core'' which holds various parts of the network (or constituent "communities'' ) together. We believe that it is just as important to uncover and extract the "core'' structure -- referred to as the "nucleus'' in this paper -- of a social network as to identify its community structure. In this paper, we propose a scalable and effective procedure to uncover the "nucleus'' of social networks by building upon and generalizing ideas from the existing k-shell decomposition approach. We employ our approach to uncover the nucleus in several example communication, collaboration, interaction, location-based and online social networks. Our methodology is very scalable and can also be applied to massive networks (hundreds million nodes and billion edges).

Viewpoint Discovery and Understanding in Social Networks

The Web has evolved to a dominant platform where everyone has the opportunity to express their opinions, to interact with other users, and to debate on emerging events happening around the world. On the one hand, this has enabled the presence of different viewpoints and opinions about a - usually controversial - topic (like Brexit), but at the same time, it has led to phenomena like media bias, echo chambers and filter bubbles, where users are exposed to only one point of view on the same topic. Therefore, there is the need for methods that are able to detect and explain the different viewpoints. In this paper, we propose a graph partitioning method that exploits social interactions to enable the discovery of different communities (representing different viewpoints) discussing about a controversial topic in a social network like Twitter. To explain the discovered viewpoints, we describe a method, called Iterative Rank Difference (IRD), which allows detecting descriptive terms that characterize the different viewpoints as well as understanding how a specific term is related to a viewpoint (by detecting other related descriptive terms). The results of an experimental evaluation showed that our approach outperforms state-of-the-art methods on viewpoint discovery, while a qualitative analysis of the proposed IRD method on three different controversial topics showed that IRD provides comprehensive and deep representations of the different viewpoints.

Guidelines for Online Network Crawling: A Study of Data Collection Approaches and Network Properties

Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the properties or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many online network crawling methods have been proposed, but it is not always clear which method should be used for a given network. In this paper, we perform a detailed, hypothesis-driven analysis of several online crawling algorithms, ranging from classical crawling methods to modern, state-of-the-art algorithms, with respect to the task of collecting as much data (nodes or edges) as possible given a fixed query budget. We show that the performance of these methods depends strongly on the network structure. We identify three relevant network characteristics: community separation, average community size, and average node degree. We present experiments on both real and synthetic networks, and provide guidelines to researchers regarding selection of an appropriate sampling method.

Under the Shadow of Sunshine: Characterizing Spam Campaigns Abusing Phone Numbers Across Online Social Networks

Cybercriminals abuse Online Social Networks (OSNs) to lure victims into a variety of spam. Among different spam types, a less explored area is OSN abuse that leverages the telephony channel to defraud users. Phone numbers are advertized via OSNs, and users are tricked into calling these numbers. To expand the reach of such scam / spam campaigns, phone numbers are advertised across multiple platforms like Facebook, Twitter, GooglePlus, Flickr, and YouTube. In this paper, we present the first data-driven characterization of cross-platform campaigns that use multiple OSN platforms to reach their victims and use phone numbers for monetization. We collect -23M posts containing -1.8M unique phone numbers from Twitter, Facebook, GooglePlus, Youtube, and Flickr over a period of six months. Clustering these posts helps us identify 202 campaigns operating across the globe with Indonesia, United States, India, and United Arab Emirates being the most prominent originators. We find that even though Indonesian campaigns generate highest volume (-3.2M posts), only 1.6% of the accounts propagating Indonesian campaigns have been suspended so far. By examining campaigns running across multiple OSNs, we discover that Twitter detects and suspends -93% more accounts than Facebook. Therefore, sharing intelligence about abuse-related user accounts across OSNs can aid in spam detection. According to our dataset, around -35K victims and -$8.8M could have been saved if intelligence was shared across the OSNs. By analyzing phone number based spam campaigns running on OSNs, we highlight the unexplored variety of phone-based attacks surfacing on OSNs.

Wisdom in Sum of Parts: Multi-Platform Activity Prediction in Social Collaborative Sites

In this paper, we proposed a novel framework which uses user interests inferred from activities (a.k.a., activity interests) in multiple social collaborative platforms to predict users' platform activities. Included in the framework are two prediction approaches: (i) direct platform activity prediction, which predicts a user's activities in a platform using his or her activity interests from the same platform (e.g., predict if a user answers a given Stack Overflow question using the user's interests inferred from his or her prior answer and favorite activities in Stack Overflow), and (ii) cross-platform activity prediction, which predicts a user's activities in a platform using his or her activity interests from another platform (e.g., predict if a user answers a given Stack Overflow question using the user's interests inferred from his or her fork and watch activities in GitHub). To evaluate our proposed method, we conduct prediction experiments on two widely used social collaborative platforms in the software development community: GitHub and Stack Overflow. Our experiments show that combining both direct and cross platform activity prediction approaches yield the best accuracies for predicting user activities in GitHub (AUC=0.75) and Stack Overflow (AUC=0.89).

SESSION: Flow, Information and News

On Identifying Anomalies in Tor Usage with Applications in Detecting Internet Censorship

We develop a means to detect ongoing per-country anomalies in the daily usage metrics of the Tor anonymous communication network, and demonstrate the applicability of this technique to identifying likely periods of internet censorship and related events. The presented approach identifies contiguous anomalous periods, rather than daily spikes or drops, and allows anomalies to be ranked according to deviation from expected behaviour. The developed method is implemented as a running tool, with outputs published daily by mailing list. This list highlights per-country anomalous Tor usage, and produces a daily ranking of countries according to the level of detected anomalous behaviour. This list has been active since August 2016, and is in use by a number of individuals, academics, and NGOs as an early warning system for potential censorship events. We focus on Tor, however the presented approach is more generally applicable to usage data of other services, both individually and in combination. We demonstrate that combining multiple data sources allows more specific identification of likely Tor blocking events. We demonstrate the our approach in comparison to existing anomaly detection tools, and against both known historical internet censorship events and synthetic datasets. Finally, we detail a number of significant recent anomalous events and behaviours identified by our tool.

Web Access Literacy Scale to Evaluate How Critically Users Can Browse and Search for Web Information

We propose a web access literacy scale to assess user ability to scrutinize web information and gather accurate information using information access systems, such as web search engines.

We conducted an online study with participants recruited through a crowdsourcing service. Analysis of the questionnaire responses confirmed that the proposed web access literacy scale is reliable and valid. We also noted the following pointers: (1) Web users may not pay significant attention to web page authors and their expertise when judging information credibility. (2) Users may have weaknesses relative to the use of web search engines and tolerance for cognitive bias that appears in credibility assessment of web information.

The results of this study are expected to contribute to the design of information access systems or educational classes to encourage users to reflect on and improve their web access literacy relative to critical information seeking.

Investigating the Effects of Google's Search Engine Result Page in Evaluating the Credibility of Online News Sources

Recent research has suggested that young users are not particularly skilled in assessing the credibility of online content. A follow up study comparing students to fact checkers noticed that students spend too much time on the page itself, while fact checkers performed "lateral reading", searching other sources. We have taken this line of research one step further and designed a study in which participants were instructed to do lateral reading for credibility assessment by inspecting Google's search engine result page (SERP) of unfamiliar news sources. In this paper, we summarize findings from interviews with 30 participants. A component of the SERP noticed regularly by the participants is the so-called Knowledge Panel, which provides contextual information about the news source being searched. While this is expected, there are other parts of the SERP that participants use to assess the credibility of the source, for example, the freshness of top stories, the panel of recent tweets, or a verified Twitter account. Given the importance attached to the presence of the Knowledge Panel, we discuss how variability in its content affected participants' opinions. Additionally, we perform data collection of the SERP page for a large number of online news sources and compare them. Our results indicate that there are widespread inconsistencies in the coverage and quality of information included in Knowledge Panels.

Observing Burstiness in Wikipedia Articles during New Disease Outbreaks

Wikipedia can be conceptualized as an open sociotechnical environment that supports communities of humans and bots that update and contest information in Wikipedia articles. This environment affords a view to community or domain interactions and reactions to salient topics, such as disease outbreaks. But do reactions to different topics vary, and how can we measure them? One widely-used approach when answering these questions is to delineate levels of burstiness-communication flows characterized by repeated bursts instead of a continuous stream-in the construction of a Wikipedia article. A literature review, however, reveals that current burstiness approaches do not fully support efforts to compare Wikipedia community reactions to different articles. Through an empirical analysis of the construction of Wikipedia health-related articles, we both extend and refine burstiness as an analytical technique to understand the community dynamics underlying the construction of Wikipedia articles. We define a method by which we can categorize burstiness as high medium and low. Our empirical results suggest a proposed a model of burstiness.

Early Public Responses to the Zika-Virus on YouTube: Prevalence of and Differences Between Conspiracy Theory and Informational Videos

In this paper, we analyze the content of the most popular videos posted on YouTube in the first phase of the Zika-virus outbreak in 2016, and the user responses to those videos. More specifically, we examine the extent to which informational and conspiracy theory videos differ in terms of user activity (number of comments, shares, likes and dislikes), and the sentiment and content of the user responses. Our results show that 12 out of the 35 videos in our data set focused on conspiracy theories, but no statistical differences were found in the number of user activity and sentiment between the two types of videos. The content of the user responses shows that users respond differently to sub-topics related to Zika-virus. The implications of the results for future online health promotion campaigns are discussed.

SESSION: New Perspectives on Web Science

Social Gamification in Enterprise Crowdsourcing

Enterprise crowdsourcing capitalises on the availability of employees for in-house data processing. Gamification techniques can help aligning employees' motivation to the crowdsourcing endeavour. Although hitherto, research efforts were able to unravel the wide arsenal of gamification techniques to construct engagement loops, little research has shed light into the social game dynamics that those foster and how those impact crowdsourcing activities. This work reports on a study that involved 101 employees from two multinational enterprises. We adopt a user-centric approach to apply and experiment with gamification for enterprise crowdsourcing purposes. Through a qualitative study, we highlight the importance of the competitive and collaborative social dynamics within the enterprise. By engaging the employees with a mobile crowdsourcing application, we showcase the effectiveness of competitiveness towards higher levels of engagement and quality of contributions. Moreover, we underline the contradictory nature of those dynamics, which combined might lead to detrimental effects towards the engagement to crowdsourcing activities.

A Fair Share of the Work?: The Evolving Ecosystem of Crowd Workers

Crowdsourcing's ability to forge new digital, and thus not location-bound, job opportunities spurred many visions of crowdsourcing's social impact as an answer to failing economies and recessions, especially in developing countries. Yet, did the digital solution take the business world by storm and redefine the classical business process? Did it indeed mature into a stable source of income for a vast agile workforce, and did it fulfill the visions of social impact? While exploring whether the market place's visions were fulfilled or not, we uncover a whole ecosystem the workers have built to leverage their productivity and earnings in the market place. In this paper, we shed light upon this system and all of its components, thus providing insights into the inner workings of the crowdsourcing platforms from the crowd's side and raises attention to the need for engaging in research and creating tools.

Not Every Remix is an Innovation: A Network Perspective on the 3D-Printing Community

A better understanding of how information in networks is reused or mixed, has the potential to significantly contribute to the way value is exchanged under a market- or commons-based paradigm. Data as collaborative commons, distributed under creative commons licenses, can generate novel business models and significantly spur the continuing development of the knowledge society. However, looking at data reuse in a large 3d-printing community, we show that the remixing of existing 3d models is substantially influenced by bots, customizers and self-referential designs. Linking these phenomena to a more fine-grained understanding of the process and product dimensions of innovations, we conclude that remixing patterns cannot be taken as direct indicators of innovative behavior on sharing platforms. A further exploration of remixing networks in terms of their topological characteristics is suggested as a way forward. For the empirical underpinning of our arguments, we analyzed 893,383 three-dimensional designs shared by 193,254 members.

And Now for Something Completely Different: Visual Novelty in an Online Network of Designers

Novelty is a key ingredient of innovation but quantifying it is difficult. This is especially true for visual work like graphic design. Using designs shared on an online social network of professional digital designers, we measure visual novelty using statistical learning methods to compare an image's features with those of images that have been created before. We then relate social network position to the novelty of the designer's images. We find that on this professional platform, users with dense local networks tend to produce more novel but generally less successful images, with important exceptions. Namely, users making novel images while embedded in cohesive local networks are more successful.

SESSION: Methods and Practice

Perspectives on Data and Practices

There is not something like big data research, but a variety of diverse big data research practices in different fields. They are based on different logics, rationalities, epistemological beliefs, types of data, and even different forms of objectivity as well as concepts of theory. Practices are specific regarding their temporalities and materialities. Furthermore, big data practices are not necessarily about doing research, but also about governance. To make these differences explicit is a step towards better collaboration in the many fields of the web sciences.

Predicting Email and Article Clickthroughs with Domain-adaptive Language Models

Marketing practices have adopted the use of computational approaches in order to optimize the performance of their promotional emails and site advertisements. In the case of promotional emails, subject lines have been found to offer a reliable signal of whether the recipient will open an email or not. Clickbait headlines are also known to drive reader engagement. In this study, we explore the differences in recipients' preferences for subject lines of marketing emails from different industries, in terms of their clickthrough rates on marketing emails sent by different businesses in Finance, Cosmetics and Television industries. Different stylistic strategies of subject lines characterize high clickthroughs in different commercial verticals. For instance, words providing insight and signaling cognitive processing lead to more clickthroughs for the Finance industry; on the other hand, social words yield more clickthroughs for the Movies and Television industry. Domain adaptation can further improve predictive performance for unseen businesses by an average of 16.52% over generic industry-specific predictive models. We conclude with a discussion on the implications of our findings and suggestions for future work.

Quest for the Gold Par: Minimizing the Number of Gold Questions to Distinguish between the Good and the Bad

The benefits of crowdsourcing for data science have furthered its widespread use over the past decade. Yet fraudulent workers undermine the emerging crowdsourcing economy: requestors face the choice of either risking low quality results or having to pay extra money for quality safeguards like e.g., gold questions or majority voting. Obviously, the more safeguards injected into the workload, the lower the risks imposed by fraudulent workers, yet the higher the costs are. So, how many of them are really needed? Is there such a 'one size fits all' number? The aim of this paper is to identify custom-tailored numbers of gold questions per worker for managing the cost/quality balance. Our new method follows real life experiences: the more we know about workers before assigning a task, the clearer our belief or disbelief in this worker's reliability gets. Employing probabilistic models, namely Bayesian belief networks and certainty factor models, our method creates worker profiles reflecting different a-priori belief values, and we prove that the actual number of gold questions per worker can indeed be assessed. Our evaluation on real-world crowdsourcing datasets demonstrates our method's efficiency in saving money while maintaining high quality results. Moreover, our methods performs well despite the quite limited information known about workers in today's platforms.

Automated Discovery of Internet Censorship by Web Crawling

Censorship of the Internet is widespread around the world. As access to the web becomes increasingly ubiquitous, filtering of this resource becomes more pervasive. Transparency about specific content and information that citizens are denied access to is atypical. To counter this, numerous techniques for maintaining URL filter lists have been proposed by various individuals, organisations and researchers. These aim to improve empirical data on censorship for benefit of the public and wider censorship research community, while also increasing the transparency of filtering activity by oppressive regimes. We present a new approach for discovering filtered domains in different target countries. This method is fully automated and requires no human interaction. The system uses web crawling techniques to traverse between filtered sites and implements a robust method for determining if a domain is filtered. We demonstrate the effectiveness of the approach by running experiments to search for filtered content in four different censorship regimes. Our results show that we perform better than the current state of the art and have built domain filter lists an order of magnitude larger than the most widely available public lists as of April 2018. Further, we build a dataset mapping the interlinking nature of blocked content between domains and exhibit the tightly networked nature of censored web resources.

SESSION: The Reality of Social Media

Worth its Weight in Likes: Towards Detecting Fake Likes on Instagram

Instagram is a significant platform for users to share media; reflecting their interests. It is used by marketers and brands to reach their potential audience for advertisement. The number of likes on posts serves as a proxy for social reputation of the users, and in some cases, social media influencers with an extensive reach are compensated by marketers to promote products. This emerging market has led to users artificially bolstering the likes they get to project an inflated social worth. In this study, we enumerate the potential factors which contribute towards a genuine like on Instagram. Based on our analysis of liking behaviour, we build an automated mechanism to detect fake likes on Instagram which achieves a high precision of 83.5%. Our work serves an important first step in reducing the effect of fake likes on Instagram influencer market.

Public Opinion Spamming: A Model for Content and Users on Sina Weibo

Microblogs serve hundreds of millions of active users, but have also attracted large numbers of spammers. While traditional spam often seeks to endorse specific products or services, nowadays there are increasingly also paid posters intent on promoting particular views on hot topics and influencing public opinion. In this work, we fill an important research gap by studying how to detect such opinion spammers and their micro-manipulation of public opinion. Our model is unsupervised and adopts a Bayesian framework to distinguish spammers from other classes of users. Experiments on a Sina Weibo hot topic dataset demonstrate the effectiveness of the proposed approach. A further diachronic analysis of the collected data demonstrates that public opinion spammers have developed sophisticated techniques and have seen success in subtly manipulating the public sentiment.

Can We Count on Social Media Metrics?: First Insights into the Active Scholarly Use of Social Media

Measuring research impact is important for ranking publications in academic search engines and for research evaluation. Social media metrics or altmetrics measure the impact of scientific work based on social media activity. Altmetrics are complementary to traditional, citation-based metrics, e.g. allowing the assessment of new publications for which citations are not yet available. Despite the increasing importance of altmetrics, their characteristics are not well understood: Until now it has not been researched what kind of researchers are actively using which social media services and why - important questions for scientific impact prediction. Based on a survey among 3,430 scientists, we uncover previously unknown and significant differences between social media services: We identify services which attract young and experienced researchers, respectively, and detect differences in usage motivations. Our findings have direct implications for the future design of altmetrics for scientific impact prediction.

DistrustRank: Spotting False News Domains

In this paper we propose a semi-supervised learning strategy to automatically separate fake News from reliable News sources: DistrustRank. We first select a small set of unreliable News, manually evaluated and classified by experts on fact checking portals. Once this set is created, DistrustRank constructs a weighted graph where nodes represent websites, connected by edges based on a minimum similarity between a pair of websites. Next it computes the centrality using a biased PageRank, where a bias is applied to the selected set of seeds. As an output of the proposed model we obtain a trust (or distrust) rank that can be used in two ways: a) as a counter-bias to be applied when News about a specific subject is ranked, in order to discount possible boosts achieved by false claims; and b) to assist humans to identify sources that are likely to be source of fake News (or that are likely to be reputable), suggesting websites that should be examined more closely or to be avoided. In our experiments, DistrustRank outperforms the supervised approaches in either ranking and classification task.

SESSION: Location, Geography and Fragmentation

Where in the World Is Carmen Sandiego?: Detecting Person Locations via Social Media Discussions

In today's social media, news often spread faster than in mainstream media, along with additional context and aspects about the current affairs. Consequently, users in social networks are up-to-date with the details of real-world events and the involved individuals. Examples include crime scenes and potential perpetrator descriptions, public gatherings with rumors about celebrities among the guests, rallies by prominent politicians, concerts by musicians, etc. We are interested in the problem of tracking persons mentioned in social media, namely detecting the locations of individuals by leveraging the online discussions about them.

Existing literature focuses on the well-known and more convenient problem of user location detection in social media, mainly as the location discovery of the user profiles and their messages. In contrast, we track individuals with text mining techniques, regardless whether they hold a social network account or not. We observe what the community shares about them and estimate their locations. Our approach consists of two steps: firstly, we introduce a noise filter that prunes irrelevant posts using a recursive partitioning technique. Secondly, we build a model that reasons over the set of messages about an individual and determines his/her locations. In our experiments, we successfully trace the last U.S. presidential candidates through millions of tweets published from November 2015 until January 2017. Our results outperform previously introduced techniques and various baselines.

Assessing Twitter Geocoding Resolution

User-defined location privacy settings on Twitter cause geolocated tweets to be placed at four different resolutions: precise, point of interest (POI), neighbourhood and city levels. The latter two levels are not described by Twitter or the API, resulting in a risk that clustered tweets are unintentionally treated as real clusters in spatial analyses. This paper outlines a framework to address these differing spatial resolutions and highlight the impact they can have on cartographic representations. As part of this framework this paper also outlines a method of discovering sources (third-party applications) that produce geolocated tweets but do not reflect genuine human activity. We found that including tweets at all spatial resolutions created an artificially inflated importance of certain locations within a city. Discovering device-level geocoded tweets was straight forward, but querying Foursquare's API was required to differentiate between neighbourhood level clusters and POIs.

Domain-Independent Detection of Emergency Situations Based on Social Activity Related to Geolocations

In general, existing methods for automatically detecting emergency situations using Twitter rely on features based on domain-specific keywords found in messages. This type of keyword-based methods usually require training on domain-specific labeled data, using multiple languages, and for different types of events (e.g., earthquakes, floods, wildfires, etc.). In addition to being costly, these approaches may fail to detect previously unexpected situations, such as uncommon catastrophes or terrorist attacks. However, collective mentions of certain keywords are not the only type of self-organizing phenomena that may arise in social media when a real-world extreme situation occurs. Just as nearby physical sensors become activated when stimulated, localized citizen sensors (i.e., users) will also react in a similar manner. To leverage this information, we propose to use self-organized activity related to geolocations to identify emergency situations. We propose to detect such events by tracking the frequencies, and probability distributions of the interarrival time of the messages related to specific locations. Using an off-the-shelf classifier that is independent of domain-specific features, we study and describe emergency situations based solely on location-based features in messages. Our findings indicate that anomalies in location-related social media user activity indeed provide information for automatically detecting emergency situations independent of their domain.

Pathways to Fragmentation: User Flows and Web Distribution Infrastructures

This study analyzes how web audiences flow across online digital features. We construct a directed network of user flows based on sequential user clickstreams for all popular websites(n=1761), using traffic data obtained from a panel of a million web users in the United States. We analyze these data to identify constellations of websites that are frequently browsed together in temporal sequences, both by similar user groups in different browsing sessions as well as by disparate users. Our analyses thus render visible previously hidden online collectives and generate insight into the varied roles that curatorial infrastructures may play in shaping audience fragmentation on the web.

SESSION: Networks, Collaboration and Participation

Locations & Languages: Towards Multilingual User Movement Analysis in Social Media

Social microblogging platforms such as Twitter have been used by many users to express their sentiments and opinions resulting in exponentially growing amounts of heterogeneous data. This opens new research proposals to map with such data many natural phenomena. In this paper, we visualize various patterns related to multilingualism on social microblogging platforms. In particular, we analyze characteristics of Twitter users based on their choice of languages and the change in locations from which they tweet. The analysis we undertake assumes language and location of tweets as key factors. The results show that locations and languages are correlated with mobility patterns of multilingual Twitter users.

The Refugee/Migrant Crisis Dichotomy on Twitter: A Network and Sentiment Perspective

Media reports, political statements, and social media debates on the refugee/migrant crisis shape the ways in which people and societies respond to those displaced people arriving at their borders world wide. These current events are framed and experienced as a crisis, entering the media, capturing worldwide political attention, and producing diverse and contradictory discourses and responses. The labels "migrant'' and "refugee'' are frequently distinguished and conflated in traditional as well as social media when describing the same groups of people. In this paper, we focus on the simultaneous struggle over meaning, legitimization, and power in representations of the refugee crisis, through the specific lens of Twitter. The 369,485 tweets analyzed in this paper cover two days after a picture of Alan Kurdi -- a three-year-old Syrian boy who drowned in the Mediterranean Sea while trying to reach Europe with his family -- made global headlines and sparked wide media engagement. More specifically, we investigate the existence of the dichotomy between the "deserving'' refugee versus the "undeserving'' migrant, as well as the relationship between sentiment expressed in tweets, their influence, and the popularity of Twitter users involved in this dichotomous characterization of the crisis. Our results show that the Twitter debate was predominantly focused on refugee related hashtags and that those tweets containing such hashtags were more positive in tone. Furthermore, we find that popular Twitter users as well as popular tweets are characterized by less emotional intensity and slightly less positivity in the debate, contrary to prior expectations. Co-occurrence networks expose the structure underlying hashtag usage and reveal a refugee-centric core of meaning, yet divergent goals of some prominent users. As social media become increasingly prominent venues for debate over a crisis, how and why people express their opinions offer valuable insights into the nature and direction of these debates.

Ego-Centric Analysis of Supportive Networks

The way we think about ourselves has a direct influence on our emotional state and our mood. Consequently by changing the way we think we can positively influence our mood and how we respond to situations. Many studies have shown a robust relationship in which emotional support from others positively affects how we think about ourselves. Emotional support can come from many sources, such as family, friends, neighbors, and more recently we have seen the emergence of emotional support networks. In such networks users share their moods and receive emotional support from others, and the objective is to nurture supportive relationships and build a social support network. In this paper we present an ego-centric study of supportive networks to show how user mood evolves as the user ego-network is expanded. We considered different types of ego-networks induced by gender and psychological disorders. We found that the way user mood evolves strongly depends on the type of connections that are created. The behavior of users that show mood improvement is very distinct from the behavior of users that do not show mood improvement.

Everybody Thinks Online Participation is Great - for Somebody Else: A Qualitative and Quantitative Analysis of Perceptions and Expectations of Online Participation in the Green Party Germany

Based on a case study from the Green Party Germany, we discuss the expectations and potential effects of the introduction of new online participation opportunities. These methods are often used in hopes of drawing in a wider group of participants, but existing literature on digital inequality suggests that this is unlikely to happen. Applying a mixed methods approach, we investigate how likely the expectations related to these new opportunities are to be met. We used semi-structure interviews to draw out what effects party members think online participation will have. We then conducted a survey asking members about their plans to change their behaviour. Comparing expectations to prospective behavioural changes, we find that the high hopes of both party members and leaders - to draw in those members who currently do not engage - are likely to be disappointed. Members who are better off, better educated, and already more active, will likely benefit more than those the party hopes to engage. We argue that this is linked to the prevailing digital divide, and that those who are targeted for more participation need to be more actively addressed to achieve broader participation.

SESSION: Controversy and Culture

Tweets, Death and Rock 'n' Roll: Social Media Mourning on Twitter and Sina Weibo

This paper introduces a new line of investigation into Social Media Mourning (SMM), the act of individual and collective grieving on social media. Previous research has analysed this behaviour as a response to a death within a family unit or amongst a group of friends. We report on SMM in the context of the death of a celebrity.

We present a comparative analysis of two social media platforms, Twitter and Sina Weibo (henceforth 'Weibo'). Uniquely, we have also sought to understand the feelings and attitudes of social media users who do not engage in SMM, but inevitably encounter the posts of others. This was accomplished through online surveys in both English and Chinese, representing the majority language groups of each platform.

We have critically evaluated the theoretical frameworks ofslacktivism, information cascades, andherd behaviour, and found herd behaviour to be the most applicable lens for understanding our specific case study: SMM centred on the death of Chester Bennington, an American singer-songwriter best known as the lead vocalist for the group Linkin Park.

Through a mixed method approach combining qualitative and quantitative analyses, we discovered that Twitter users, who are more likely than Weibo users to actively mourn the death of a celebrity by posting on social media, are also more likely to be emotionally affected by it. Weibo users, on the other hand, are more willing to see the content of mourning the death of a celebrity, but being more emotionally distanced, viewing SMM postings simply as news. Finally, although SMM is a manifestation of herd behaviour in our case study, we also point to an example where the power of the masses was successfully harnessed for real world effect.

The Shape of Arab Feminism on Facebook

Much has been said about the influence of Western culture on social movements worldwide, and this claimed influence has caused some to accuse Arabic feminism of being merely an alien import to the Arab world. New waves of feminism have arisen as a reaction to the claimed prevalent western culture. Global Feminism argues that women worldwide experience similar subjugation in many social constructs because many cultures are based on a patriarchal past, but other waves reject the concept of a universal womens experience and stresses the significance of diversity in women s experiences and see their activities as transnational rather than global. Others expect that the confrontation of secular and Islamist paradigms will dominate. Social Media has global reach, and there are signs that Facebook pages are used by feminists worldwide to boost their social and political activism. Facebook gives public pages' owners the ability to associate their pages with pages with similar ideologies. This provides a global space where feminist pages are clustered and exposes clues about their patterns of influence. By crawling Arabic feminist pages over Facebook, this paper builds a dataset that can be analysed using social network analysis tools and reveals the map of influence between Arabic feminist network and the western, transnational, and Global feminist networks. The map shows that Arabic womens pages are clustered in two segments: Arab feminism, and Sect feminism. The later consists of pages which distance themselves from associating with secular feminism pages whether they are Arabic or not, and in contrary to the former, they are less likely to restrict themselves with national Identity.

Analyzing Right-wing YouTube Channels: Hate, Violence and Discrimination

As of 2018, YouTube, the major online video sharing website, hosts multiple channels promoting right-wing content. In this paper, we observe issues related to hate, violence and discriminatory bias in a dataset containing more than 7,000 videos and 17 million comments. We investigate similarities and differences between users' comments and video content in a selection of right-wing channels and compare it to a baseline set using a three-layered approach, in which we analyze (a) lexicon, (b) topics and (c) implicit biases present in the texts. Among other results, our analyses show that right-wing channels tend to (a) contain a higher degree of words from "negative'' semantic fields, (b) raise more topics related to war and terrorism, and (c) demonstrate more discriminatory bias against Muslims (in videos) and towards LGBT people (in comments). Our findings shed light not only into the collective conduct of the YouTube community promoting and consuming right-wing content, but also into the general behavior of YouTube users.

SESSION: Finding Information Now and Then

Focused Crawl of Web Archives to Build Event Collections

Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past

Decay of Relevance in Exponentially Growing Networks

We propose a new preferential attachment-based network growth model in order to explain two properties of growing networks: (1) the power-law growth of node degrees and (2) the decay of node relevance. In preferential attachment models, the ability of a node to acquire links is affected by its degree, its fitness, as well as its relevance which typically decays over time. After a review of existing models, we argue that they cannot explain the above-mentioned two properties (1) and (2) at the same time. We have found that apart from being empirically observed in many systems, the exponential growth of the network size over time is the key to sustain the power-law growth of node degrees when node relevance decays. We therefore make a clear distinction between the event time and the physical time in our model, and show that under the assumption that the relevance of a node decays with its age τ, there exists an analytical solution of the decay function f_R with the form f_R(τ) = ?^(τ1). Other properties of real networks such as power-law alike degree distributions can still be preserved, as supported by our experiments. This makes our model useful in explaining and analysing many real systems such as citation networks.

Micro Archives as Rich Digital Object Representations

Digital objects as well as real-world entities are commonly referred to in literature or on the Web by mentioning their name, linking to their website or citing unique identifiers, such as DOI and ORCID, which are backed by a set of meta information. All of these methods have severe disadvantages and are not always suitable though: They are not very precise, not guaranteed to be persistent or mean a big additional effort for the author, who needs to collect the metadata to describe the reference accurately. Especially for complex, evolving entities and objects like software, pre-defined metadata schemas are often not expressive enough to capture its temporal state comprehensively. We found in previous work that a lot of meaningful information about software, such as a description, rich metadata, its documentation and source code, is usually available online. However, all of this needs to be preserved coherently in order to constitute a rich digital representation of the entity. We show that this is currently not the case, as only 10% of the studied blog posts and roughly 30% of the analyzed software websites are archived completely, i.e., all linked resources are captured as well. Therefore, we propose Micro Archives as rich digital object representations, which semantically and logically connect archived resources and ensure a coherent state. With Micrawler we present a modular solution to create, cite and analyze such Micro Archives. In this paper, we show the need for this approach as well as discuss opportunities and implications for various applications also beyond scholarly writing.

SESSION: Methods and Navigation

Internet Regulation Media Coverage in Russia: Topics and Countries

Russia first introduced Internet regulation in 2012 with site blockings and then progressed to personal data retention and ban on VPNs. This makes an interesting case because online media had spread and established a parallel political agenda in Russia in the 2000s, before the onset of regulations. The focus of this study is the contents and dynamics of media coverage of Internet regulation in Russia over years, particularly the topics covered and the countries involved. It uses topic modeling and social network analysis to analyze 6,140 texts from Russia's largest mass media collection. The automatic modeling approach helps obtain reproducible evidence on the structure and actors of the otherwise highly politicized discourse. The study demonstrated, first, the growing interest of Russian media to Internet regulation, with comparable shares of state-controlled and private media in this discourse. Second, it revealed the structure of 50 topics arranging into nine clusters, from gambling to international relations, with one dominant network segment spanning over five clusters. Third, it identified groups of countries by their appearance in the texts and co-appearance in one text as 'communities' of countries that can 'put on the map' the discourse on certain topics of Internet regulation in Russia.

EPICURE - Aspect-based Multimodal Review Summarization

Restaurant reviews are popular and a valuable source of information. Often, large number of reviews are written for restaurants which warrants the need for automated summarization systems. In this paper we present epicure, a novel text and image summarization platform. For the summarization of opinionated content like reviews, considering different aspects have largely been ignored, and we address this by creating balanced reviews for different aspects like food and service. We argue that traditional criteria for extractive review summarization such as coverage and diversity have limited applicability. We draw on the power and usefulness of submodular functions for extractive summarization and introduce novel submodular functions such as importance, freshness, purity, trustworthiness and balanced opinion. We are also one of the first to provide an image summary for diffeerent aspects of a restaurant by mapping text to images using a multimodal neural network, for which we provide initial experiments. We show the effectiveness of our platform by evaluating it against strong baselines and also use crowdsourcing experiments for a subjective comparison of our approach with existing works.

Query for Architecture, Click through Military: Comparing the Roles of Search and Navigation on Wikipedia

As one of the richest sources of encyclopedic information on the Web, Wikipedia generates an enormous amount of traffic. In this paper, we study large-scale article access data of the English Wikipedia in order to compare articles with respect to the two main paradigms of information seeking, i.e., search by formulating a query, and navigation by following hyperlinks. To this end, we propose and employ two main metrics, namely (i) searchshare -- the relative amount of views an article received by search --, and (ii) resistance -- the ability of an article to relay traffic to other Wikipedia articles -- to characterize articles. We demonstrate how articles in distinct topical categories differ substantially in terms of these properties. For example, architecture-related articles are often accessed through search and are simultaneously a "dead end'' for traffic, whereas historical articles about military events are mainly navigated. We further link traffic differences to varying network, content, and editing activity features. Lastly, we measure the impact of the article properties by modeling access behavior on articles with a gradient boosting approach. The results of this paper constitute a step towards understanding human information seeking behavior on the Web.

Using the Web of Data to Study Gender Differences in Online Knowledge Sources: the Case of the European Parliament

Gender inequalities are known to exist in Wikipedia. However, objective measures of inequality are hard to obtain, especially when comparing across languages. We study gender differences in the various Wikipedia language editions with respect to coverage of the Members of the European Parliament. This topic allows a relatively fair comparison of coverage between the (European) language editions of Wikipedia. Moreover, the availability of open data about this group allows us to relate measures of Wikipedia coverage to objective measures of their notable actions in the offline world. In addition, we measure gender differences in the content of Wikidata entries, which aggregate content from across Wikipedia language editions.