WWW '22: Companion Proceedings of the Web Conference 2022

Full Citation in the ACM Digital Library

SESSION: Alternate Track: Industry Track

WISE: Wavelet based Interpretable Stock Embedding for Risk-Averse Portfolio Management

Markowitz’s portfolio theory is the cornerstone of the risk-averse portfolio selection (RPS) problem, the core of which lies in minimizing the risk, i.e., a value calculated based on a portfolio risk matrix. Because the real risk matrix is unobservable, usual practices compromise to utilize the covariance matrix of all stocks in the portfolio based on their historical prices to estimate the risk matrix, which, however, lack the interpretability of the computed risk degree. In this paper, we propose a novel RPS method named WISE based on wavelet decomposition, which not only fully exploits stock time series from the perspectives of the time domain and frequency domain, but also has the advantage of providing interpretability on the portfolio decision from different frequency angles. In addition, in WISE, we design a theoretically guaranteed wavelet basis selection mechanism and three auxiliary enhancement tasks to adaptively find the suitable wavelet parameters and improve the representation ability of the stock embeddings respectively. Extensive experiments conducted on three real-world datasets demonstrate WISE’s superiority over the state-of-the-art portfolio selection methods in terms of return and risk. In addition, we introduce a qualitative analysis of the computed risk matrices of portfolios to indicate the interpretability of WISE on the computed risk degree from different frequency angles.

Cyclic Arbitrage in Decentralized Exchanges

Decentralized Exchanges (DEXes) enable users to create markets for exchanging any pair of cryptocurrencies. The direct exchange rate of two tokens may not match the cross-exchange rate in the market, and such price discrepancies open up arbitrage possibilities with trading through different cryptocurrencies cyclically. In this paper, we conduct a systematic investigation on cyclic arbitrages in DEXes. We propose a theoretical framework for studying cyclic arbitrage. With our framework, we analyze the profitability conditions and optimal trading strategies of cyclic transactions. We further examine exploitable arbitrage opportunities and the market size of cyclic arbitrages with transaction-level data of Uniswap V2. We find that traders have executed 292,606 cyclic arbitrages over eleven months and exploited more than 138 million USD in revenue. However, the revenue of the most profitable unexploited opportunity is persistently higher than 1 ETH (4,000 USD), which indicates that DEX markets may not be efficient enough. By analyzing how traders implement cyclic arbitrages, we find that traders can utilize smart contracts to issue atomic transactions and the atomic implementations could mitigate users’ financial loss in cyclic arbitrage from the price impact.

An Exploratory Study of Stock Price Movements from Earnings Calls

Financial market analysis has focused primarily on extracting signals from accounting, stock price, and other numerical “hard” data reported in P&L statements or earnings per share reports. Yet, it is well-known that decision-makers routinely use “soft” text-based documents that interpret the hard data they narrate. Recent advances in computational methods for analyzing unstructured and soft text-based data at scale offer possibilities for understanding financial market behavior that could improve investments and market equity. A critical and ubiquitous form of soft data are earnings calls. Earnings calls are periodic (often quarterly) statements usually by CEOs who attempt to influence investors’ expectations of a company’s past and future performance. Here, we study the statistical relationship between earnings calls, company sales, stock performance, and analysts’ recommendations. Our study covers a decade of observations with approximately 100,000 transcripts of earnings calls from 6,300 public companies from January 2010 to December 2019. In this study, we report three novel findings. First, the buy, sell and hold recommendations from professional analysts made prior to the earnings have low correlation with stock price movements after the earnings call. Second, using our graph neural network based method that processes the semantic features of earnings calls, we reliably and accurately predict stock price movements in five major areas of the economy. Third, the semantic features of transcripts are more predictive of stock price movements than sales and earnings per share, i.e., traditional hard data in most of the cases.

DC-GNN: Decoupled Graph Neural Networks for Improving and Accelerating Large-Scale E-commerce Retrieval

In large-scale E-commerce retrieval, the Graph Neural Networks (GNNs) has become one of the stage-of-the-arts due to its powerful capability on topological feature extraction and relational reasoning. However, the conventional GNNs-based large-scale E-commerce retrieval suffers from low training efficiency, as such scenario normally has billions of entities and tens of billions of relations. Under the limitation on efficiency, only shallow graph algorithms can be employed, which severely hinders the GNNs representation capability and consequently weakens the retrieval quality. In order to deal with the trade-off between training efficiency and representation capability, we propose the Decoupled Graph Neural Networks (DC-GNN) to improve and accelerate the GNNs-based large-scale E-commerce retrieval. Specifically, DC-GNN decouples the conventional framework into three stages: pre-train, deep aggregation, and CTR prediction. By decoupling the graph operations and the CTR prediction, DC-GNN can effectively improve the training efficiency. More importantly, it can enable deeper graph operations to adequately mine higher-order proximity to boost model performance. Extensive experiments on large-scale industrial datasets demonstrate that DC-GNN gains significant improvements in both model performance and training efficiency.

Multilingual Semantic Sourcing using Product Images for Cross-lingual Alignment

In online retail stores with ever-increasing catalog, product search is the primary means for customers to discover products of their interest. Surfacing irrelevant products can lead to poor customer experience and in extreme situations loss in engagement. With the recent advances in NLP, Deep Learning models are being used to represent queries and products in shared semantic space to enable semantic sourcing. These models require a lot of human annotated (query, product, relevance) tuples to give competitive results which is expensive to generate. The problem becomes more prominent in the emerging marketplaces/languages due to data paucity problem. When expanding to new marketplaces, it becomes imperative to support regional languages to reach a wider customer base and delighting them with good customer experience. Recently, in the NLP domain, approaches using parallel data corpus for training multilingual models have become prominent, but they are expensive to generate. In this work, we learn semantic alignment across languages using product images as an anchor between them. This overcomes the necessity of parallel data corpus. We use the human annotated data from established marketplace to transfer relevance classification knowledge to new/emerging marketplaces to solve the data paucity problem. Our experiments performed on datasets from Amazon reveal that we outperform state-of-the-art baselines with 2.4%-3.65% ROC-AUC lifts on relevance classification task across non-English marketplaces, 34.69%-51.67% Recall@k lifts on language-agnostic retrieval task and 6.25%-13.42% Precision@k lifts on semantic neighborhood quality task, respectively. Our models demonstrate efficient transfer of relevance classification knowledge from data rich marketplaces to new marketplaces by achieving ROC-AUC lifts of 3.74%-6.25% for the relevance classification task in the zero-shot setting where the human annotated relevance data of target marketplace is unavailable during training.

Short, Colorful, and Irreverent! A Comparative Analysis of New Users on WallstreetBets During the Gamestop Short-squeeze

WallStreetBets (WSB) is a Reddit community that primarily discusses high-risk options and stock trading. In January 2021, it attracted worldwide attention as one of the epicentres of a significant short squeeze on US markets. Following this event, the number of users and their activity increased exponentially. In this paper, we study the changes caused in the WSB community by such an increase in activity. We perform a comparative analysis between long-term users and newcomers and examine their respective writing styles, topics, and susceptibility to community feedback. We report a significant difference in the post length and the number of emojis between the regular and new users joining WSB. Newer users’ activity also closely follows the affected companies’ stock prices. Finally, although community feedback affects the choices of topics for all users, new users are less prone to select their subsequent message topics based on past community feedback.

PEAR: Personalized Re-ranking with Contextualized Transformer for Recommendation

The goal of recommender systems is to provide ordered item lists to users that best match their interests. As a critical task in the recommendation pipeline, re-ranking has received increasing attention in recent years. In contrast to conventional ranking models that score each item individually, re-ranking aims to explicitly model the mutual influences among items to further refine the ordering of items given an initial ranking list. In this paper, we present a personalized re-ranking model (dubbed PEAR) based on contextualized transformer. PEAR makes several major improvements over the existing methods. Specifically, PEAR not only captures feature-level and item-level interactions, but also models item contexts from both the initial ranking list and the historical clicked item list. In addition to item-level ranking score prediction, we also augment the training of PEAR with a list-level classification task to assess users’ satisfaction on the whole ranking list. Experimental results on both public and production datasets have shown the superior effectiveness of PEAR compared to the previous re-ranking models.

FastClip: An Efficient Video Understanding System with Heterogeneous Computing and Coarse-to-fine Processing

Recently, video medias are exponentially growing in many areas such as E-commerce shopping and gaming. Understanding the video contents is critical for real-world applications. However, processing long videos is usually time-consuming and expensive. In this paper, we present an efficient video understanding system, which aims to speed up the video processing with a coarse-to-fine two-stage pipeline and heterogeneous computing framework. First, we use a coarse but fast multi-modal filtering module to recognize and remove useless video segments from a long video, which could be deployed on an edge device and reduce computations for the next processing. Second, several semantic models are applied for finely parsing the remained sequences. To accelerate the model inference, we propose a novel heterogeneous computing framework, which trains a model with lightweight and heavyweight backbones to support a distributed deployment on a powerful device (e.g., cloud or GPU) and another different device (e.g., edge or CPU). In this way, the model could be both efficient and effective. The proposed system has been widely used in Alibaba, including “Taobao Live Analysis” and “Commodity Short-Video Generation”, which could achieve a 10 × speedup for the system.

Modeling Position Bias Ranking for Streaming Media Services

We tackle the problem of position bias estimation for streaming media services. Position bias is a widely studied topic in ranking literature and its impact on ranking quality is well understood. Although several methods exist to estimate position bias, their applicability to an industrial setting is limited, either because they require ad-hoc interventions that harm user experience, or because their learning accuracy is poor. In this paper, we present a novel position bias estimator that overcomes these limitations: it can be applied to streaming media services without manual interventions while delivering best in class estimation accuracy. We compare the proposed method against existing ones on real and synthetic data and illustrate its applicability to Amazon Music.

Fair Effect Attribution in Parallel Online Experiments

A/B tests serve the purpose of reliably identifying the effect of changes introduced in online services. It is common for online platforms to run a large number of simultaneous experiments by splitting incoming user traffic randomly in treatment and control groups. Despite a perfect randomization between different groups, simultaneous experiments can interact with each other and create a negative impact on average population outcomes such as engagement metrics. These are measured globally and monitored to protect overall user experience. Therefore, it is crucial to measure these interaction effects and attribute their overall impact in a fair way to the respective experimenters. We suggest an approach to measure and disentangle the effect of simultaneous experiments by providing a cost sharing approach based on Shapley values. We also provide a counterfactual perspective, that predicts shared impact based on conditional average treatment effects making use of causal inference techniques. We illustrate our approach in real world and synthetic data experiments.

On Reliability Scores for Knowledge Graphs

The Instacart KG is a central data store which contains facts regarding grocery products, ranging from taxonomic classifications to product nutritional information. With a view towards providing reliable and complete information for downstream applications, we propose an automated system for providing these facts with a score based on their reliability. This system passes data through a series of contextualized unit tests; the outcome of these tests are aggregated in order to provide a fact with a discrete score: reliable, questionable, or unreliable. These unit tests are written with explainability, scalability, and correctability in mind.

ROSE: Robust Caches for Amazon Product Search

Product search engines like Amazon Search often use caches to improve the customer user experience; caches can improve both the system’s latency as well as search quality. However, as search traffic increases over time, the cache’s ever-growing size can diminish the overall system performance. Furthermore, typos, misspellings, and redundancy widely witnessed in real-world product search queries can cause unnecessary cache misses, reducing the cache’s utility. In this paper, we introduce ROSE, a RObuSt cachE, a system that is tolerant to misspellings and typos while retaining the look-up cost of traditional caches. The core component of ROSE is a randomized hashing schema that makes ROSE able to index and retrieve an arbitrarily large set of queries with constant memory and constant time. ROSE is also robust to any query intent, typos, and grammatical errors with theoretical guarantees. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of ROSE. ROSE is deployed in the Amazon Search Engine and produced a significant improvement over the existing solutions across several key business metrics.

Semantic IR fused Heterogeneous Graph Model in Tag-based Video Search

With the rapid growth of video resources on the Internet, text-video retrieval has become a common requirement. Scholars handled text-video retrieval tasks with two-broad-category: concept-based methods and neural semantics match networks. Besides deep neural semantics matching models, some scholars mined queries and videos relationships from click-graphs, which express the users’ implicit judgments on relevance relations. However, bad generalization of click-based or concept-based models hardly capture semantic information from short queries, which stunt existing methods to fully utilize the methods to enhance the IR performance. In this paper, we propose a framework ETHGS to combine the abilities of concept-based, click-based and semantic-based models in IR and publish a new video retrieval dataset QVT from a real-world video search engine. In ETHGS, we make use of tags (i.e. concept) to construct a heterogeneous graph to alleviate the sparsity of click-through data. And we also overcome the problem of long-tailed query representation without graph information by fusing tag embeddings to represent queries. ETHGS leverages semantic embeddings to review deviant semantic information from graph nodes information. Finally, we evaluate our model ETHGS on the QVT.

Beyond NDCG: Behavioral Testing of Recommender Systems with RecList

As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and tests must be employed to ensure the desired quality in actual deployments. We introduce RecList, a testing methodology providing a general plug-and-play framework to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box APIs, and we release it as an open source, extensible package for the community.

Privacy-Preserving Methods for Repeated Measures Designs

Evolving privacy practices have led to increasing restrictions around the collection and storage of user level data. In turn, this has resulted in analytical challenges, such as properly estimating experimental statistics, especially in the case of long-running tests with repeated measurements. We propose a method for analyzing A/B tests which avoids aggregating and storing data at the unit-level. The approach utilizes a unit-level hashing mechanism which generates and stores the first and second moments of random subsets of the original population, thus allowing estimation of statistics, such as the variance of the average treatment effect (ATE), by bootstrap. Across a sample of past A/B tests at Netflix, we provide empirical results that demonstrate the effectiveness of the approach, and show how techniques to improve the sensitivity of experiments, such as regression adjustment, are still feasible under this new design.

DCAF-BERT: A Distilled Cachable Adaptable Factorized Model For Improved Ads CTR Prediction

In this paper we present a Click-through-rate (CTR) prediction model for product advertisement at Amazon. CTR prediction is challenging because the model needs to a) learn from text and numeric features, b) maintain low-latency at inference time, and c) adapt to a temporal advertisement distribution shift. Our proposed model is DCAF-BERT, a novel lightweight cache-friendly factorized model that consists of twin-structured BERT-like encoders for text with a mechanism for late fusion for tabular and numeric features. The factorization of the model allows for compartmentalised retraining which enables the model to easily adapt to distribution shifts. The twin encoders are carefully trained to leverage historical CTR data, using a large pre-trained language model and cross-architecture knowledge distillation (KD). We empirically find the right combination of pretraining, distillation and fine-tuning strategies for teacher and student which leads to a 1.7% ROC-AUC lift over the previous best model offline. In an online experiment we show that our compartmentalised refresh strategy boosts the CTR of DCAF-BERT by 3.6% on average over the baseline model consistently across a month.

A Multi-Task Learning Approach for Delayed Feedback Modeling

Conversion rate (CVR) prediction is one of the most essential tasks for digital display advertising. In industrial recommender systems, online learning is particularly favored for its capability to capture the dynamic change of data distribution, which often leads to significantly improvement of conversion rates. However, the gap between a click behavior and the corresponding conversion ranges from a few minutes to days; therefore, fresh data may not have accurate label information when they are ingested by the training algorithm, which is called the delayed feedback problem of CVR prediction. To solve this problem, previous works label the delayed positive samples as negative and correct them at their conversion time, then they optimize the expectation of actual conversion distribution via important sampling under the observed distribution. However, these methods approximate the actual feature distribution as the observed feature distribution, which may introduce additional bias to the delayed feedback modeling. In this paper, we prove the observed conversion rate is the product of the actual conversion rate and the observed non-delayed positive rate. Then we propose Multi-Task Delayed Feedback Model (MTDFM), which consists of two sub-networks: actual CVR network and NDPR (non-delayed positive rate) network. We train the actual CVR network by simultaneously optimizing the observed conversion rate and non-delayed positive rate. The proposed method does not require the observed feature distribution to remain the same as the actual distribution. Finally, experimental results on both public and industrial datasets demonstrate that the proposed method outperforms the previous state-of-the-art methods consistently.

Search Filter Ranking with Language-Aware Label Embeddings

A search on the major eCommerce platforms returns up to thousands of relevant products making it impossible for an average customer to audit all the results. Browsing the list of relevant items can be simplified using search filters for specific requirements (e.g., shoes of the wrong size). The complete list of available filters is often overwhelming and hard to visualize. Thus, successful user interfaces desire to display only the ones relevant to customer queries.

In this work, we frame the filter selection task as an extreme multi-label classification (XMLC) problem based on historical interaction with eCommerce sites. We learn from customers’ clicks and purchases which subset of filters is most relevant to their queries treating the relevant/not-relevant signal as binary labels.

A common problem in classification settings with a large number of classes is that some classes are underrepresented. These rare categories are difficult to predict. Building on previous work we show that classification performance for rare classes can be improved by accounting for the language structure of the class labels. Furthermore, our results demonstrate that including language structure in category names enables relatively simple deep learning models to achieve better predictive performance than transformer networks with much higher capacity.

Multi-task Ranking with User Behaviors for Text-video Search

Text-video search has become an important demand in many industrial video sharing platforms, e.g., YouTube, TikTok, and WeChat Channels, thereby attracting increasing research attention. Traditional relevance-based ranking methods for text-video search concentrate on exploiting the semantic relevance between video and query. However, relevance is no longer the principal issue in the ranking stage, because the candidate items retrieved from the matching stage naturally guarantee adequate relevance. Instead, we argue that boosting user satisfaction should be an ultimate goal for ranking and it is promising to excavate cheap and rich user behaviors for model training. To achieve this goal, we propose an effective Multi-Task Ranking pipeline with User Behaviors (MTRUB) for text-video search. Specifically, to exploit the multi-modal data effectively, we put forward a Heterogeneous Multi-modal Fusion Module (HMFM) to fuse the query and video features of different modalities in adaptive ways. Besides that, we design an Independent Multi-modal Input Scheme (IMIS) to alleviate competing task correlation problems in multi-task learning. Experiments on the offline dataset gathered from WeChat Search demonstrate that MTRUB outperforms the baseline by 12.0% in mean gAUC and 13.3% in mean nDCG@10. We also conduct live experiments on a large-scale mobile search engine, i.e., WeChat Search, and MTRUB obtains substantial improvement compared with the traditional relevance-based ranking model.

Deriving Customer Experience Implicitly from Social Media

Organizations that focus on maximizing satisfaction, a consistent and seamless experience throughout the entire customer journey are the ones who dominate the market. Net Promoter Score (NPS) is a widely accepted metric to measure the customer experience, and the most common way to calculate it to date is by conducting a survey. But this comes with a bottleneck. The whole process can be costly, low-sample, responder-biased, and issues could be limited to the questionnaire used for the survey. We have devised a mechanism to approximate it implicitly from the mentions extracted from the four major social media platforms - Twitter, Facebook, Instagram, and YouTube. Our Data Cleaning pipeline discards the viral and promotional content (from brands, sellers, marketplaces, or public figures), and the Machine Learning pipeline captures the different customer journey nodes specific to e-commerce (like discovery, delivery, pricing) with their appropriate sentiment. Since the framework is generic and relies only on publicly available social media data, any organization can estimate its NPS after making suitable adjustments depending on the industry and geography. Our NPS model has a Mean absolute percentage error (MAPE) of 1.9%, Pearson correlation of 79%, and enables us to understand the actual drivers at the weekly level.

A Cluster-Based Nearest Neighbor Matching Algorithm for Enhanced A/A Validation in Online Experimentation

Online controlled experiments are commonly used to measure how much value new features deployed to the products bring to the users. Although the experiment design is straightforward in theory, running large-scale online experiments can be quite complex. An essential step to run a rigorous experiment is to validate the balance between the buckets (a.k.a. the random samples) before it proceeds to the A/B phase. This step is called A/A validation and it serves to ensure that there is no pre-existing significant difference between the test and control buckets. In this paper, we propose a new matching algorithm to assign users to buckets and improve A/A balance. It has the capability to deal with massive user size and shows improved performance compared to existing methods.

Informative Integrity Frictions in Social Networks

Social media platforms such as Facebook and Twitter benefited from massive adoption in the last decade, and in turn facilitated the possibility of spreading harmful content, including false and misleading information. Some of these contents get massive distribution through user actions such as sharing, to a point that content removal or distribution reduction does not always stop its viral spread. At the same time, social media platforms efforts to implement solutions to preserve its integrity are typically not transparent, causing that users are not aware of any integrity intervention happening on the site. In this paper we present the rationale for adding what are now visible friction mechanisms to content share actions in the Facebook News Feed, its design and implementation challenges, and results obtained when applying them in the platform. We discuss effectiveness metrics for such interventions, and show their effects in terms of positive integrity outcomes, as well as in terms of bringing awareness to users about potentially making harmful content viral.

Personalized Complementary Product Recommendation

Complementary product recommendation aims at providing product suggestions that are often bought together to serve a joint demand. Existing work mainly focuses on modeling product relationships at a population level, but does not consider personalized preferences of different customers. In this paper, we propose a framework for personalized complementary product recommendation capable of recommending products that fit the demand and preferences of the customers. Specifically, we model product relations and user preferences with a graph attention network and a sequential behavior transformer, respectively. Two networks are cast together through personalized re-ranking and contrastive learning, in which the user and product embedding are learned jointly in an end-to-end fashion. The system recognizes different customer interests by learning from their purchase history and the correlations among customers and products. Experimental results demonstrate that our model benefits from learning personalized information and outperforms non-personalized methods on real production data.

Spot Virtual Machine Eviction Prediction in Microsoft Cloud

Azure Spot Virtual Machines (Spot VMs) utilize unused compute capacity at significant cost savings. They can be evicted when Azure needs the capacity back, therefore suitable for workloads that can tolerate interruptions. A good prediction of Spot VM evictions is beneficial for Azure to optimize capacity utilization and offers users information to better plan Spot VM deployments by selecting clusters to reduce potential evictions. The current in-service cluster-level prediction method ignores the node heterogeneity by aggregating node information. In this paper, we propose a spatial-temporal node-level Spot VM eviction prediction model to capture the inter-node relations and time dependency. The experiments with Azure data show that our node-level eviction prediction model performs better than the node-level and cluster-level baselines.

Unsupervised Customer Segmentation with Knowledge Graph Embeddings

We propose an unsupervised customer segmentation method from behavioral data. We model sequences of beer consumption from a publicly available dataset of 2.9M reviews of more than 110,000 brands over 12 years as a knowledge graph, learn their representations with knowledge graph embedding models, and apply off-the-shelf cluster analysis. Experiments and clusters interpretation show that we learn meaningful clusters of beer customers, without relying on expensive consumer surveys or time-consuming data annotation campaigns.

SESSION: Track : Poster and Demo Track

From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer

Knowledge graph completion aims to address the problem of extending a KG with missing triples. In this paper, we provide an approach GenKGC, which converts knowledge graph completion to sequence-to-sequence generation task with the pre-trained language model. We further introduce relation-guided demonstration and entity-aware hierarchical decoding for better representation learning and fast inference. Experimental results on three datasets show that our approach can obtain better or comparable performance than baselines and achieve faster inference speed compared with previous methods with pre-trained language models. We also release a new large-scale Chinese knowledge graph dataset OpenBG500 for research purpose1.

Web Mining to Inform Locations of Charging Stations for Electric Vehicles

The availability of charging stations is an important factor for promoting electric vehicles (EVs) as a carbon-friendly way of transportation. Hence, for city planners, the crucial question is where to place charging stations so that they reach a large utilization. Here, we hypothesize that the utilization of EV charging stations is driven by the proximity to points-of-interest (POIs), as EV owners have a certain limited willingness to walk between charging stations and POIs. To address our research question, we propose the use of web mining: we characterize the influence of different POIs from OpenStreetMap on the utilization of charging stations. For this, we present a tailored interpretable model that takes into account the full spatial distributions of both the POIs and the charging stations. This allows us then to estimate the distance and magnitude of the influence of different POI types. We evaluate our model with data from approx. 300 charging stations and 4,000 POIs in Amsterdam, Netherlands. Our model achieves a superior performance over state-of-the-art baselines and, on top of that, is able to offer an unmatched level of interpretability. To the best of our knowledge, no previous paper has quantified the POI influence on charging station utilization from real-world usage data by estimating the spatial proximity in which POIs are relevant. As such, our findings help city planners in identifying effective locations for charging stations.

XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages

Multiple critical scenarios need automated generation of descriptive text in low-resource (LR) languages given English fact triples. For example, Wikipedia text generation given English Infoboxes, automated generation of non-English product descriptions using English product attributes, etc. Previous work on fact-to-text (F2T) generation has focused on English only. Building an effective cross-lingual F2T (XF2T) system requires alignment between English structured facts and LR sentences. Either we need to manually obtain such alignment data at a large scale, which is expensive, or build automated models for cross-lingual alignment. To the best of our knowledge, there has been no previous attempt on automated cross-lingual alignment or generation for LR languages. We propose two unsupervised methods for cross-lingual alignment. We contribute XAlign, an XF2T dataset with 0.45M pairs across 8 languages, of which 5402 pairs have been manually annotated. We also train strong baseline XF2T generation models on XAlign. We make our code and dataset publicly available1, and hope that this will help advance further research in this critical area.

Hypermedea: A Framework for Web (of Things) Agents

Hypermedea is an extension of the JaCaMo multi-agent programming framework to act on Web and Web of Things environments. In this demo, the performance of Hypermedea’s Linked Data navigation and planning components are evaluated, both encapsulating computation-intensive algorithms.

GraphReformCD: Graph Reformulation for Effective Community Detection in Real-World Graphs

Community detection, one of the most important tools for graph analysis, finds groups of strongly connected nodes in a graph. However, community detection may suffer from misleading information in a graph, such as a nontrivial number of inter-community edges or an insufficient number of intra-community edges. In this paper, we propose GraphReformCD that reformulates a given graph into a new graph in such a way that community detection can be conducted more accurately. For the reformulation, it builds a k-nearest neighbor graph that gives a node k opportunities to connect itself to those nodes that are likely to belong to the same community together with the node. To find the nodes that belong to the same community, it employs the structural similarities such as Jaccard index and SimRank. To validate the effectiveness of our GraphReformCD, we perform extensive experiments with six real-world and four synthetic graphs. The results show that our GraphReformCD enables state-of-the-art methods to improve their accuracy significantly up to 40.6% in community detection.

GraphZoo: A Development Toolkit for Graph Neural Networks with Hyperbolic Geometries

Hyperbolic spaces have recently gained prominence for representation learning in graph processing tasks such as link prediction and node classification. Several Euclidean graph models have been adapted to work in the hyperbolic space and the variants have shown a significant increase in performance. However, research and development in graph modeling currently involve several tedious tasks with a scope of standardization including data processing, parameter configuration, optimization tricks, and unavailability of public codebases. With the proliferation of new tasks such as knowledge graph reasoning and generation, there is a need in the community for a unified framework that eases the development and analysis of both Euclidean and hyperbolic graph networks, especially for new researchers in the field. To this end, we present a novel framework, GraphZoo, that makes learning, designing and applying graph processing pipelines/models systematic through abstraction over the redundant components. The framework contains a versatile library that supports several hyperbolic manifolds and an easy-to-use modular framework to perform graph processing tasks which aids researchers in different components, namely, (i) reproduce evaluation pipelines of state-of-the-art approaches, (ii) design new hyperbolic or Euclidean graph networks and compare them against the state-of-the-art approaches on standard benchmarks, (iii) add custom datasets for evaluation, (iv) add new tasks and evaluation criteria.

QuatRE: Relation-Aware Quaternions for Knowledge Graph Embeddings

We propose a simple yet effective embedding model to learn quaternion embeddings for entities and relations in knowledge graphs. Our model aims to enhance correlations between head and tail entities given a relation within the Quaternion space with Hamilton product. The model achieves this goal by further associating each relation with two relation-aware rotations, which are used to rotate quaternion embeddings of the head and tail entities, respectively. Experimental results show that our proposed model produces state-of-the-art performances on well-known benchmark datasets for knowledge graph completion. Our code is available at: https://github.com/daiquocnguyen/QuatRE.

Universal Graph Transformer Self-Attention Networks

We introduce a transformer-based GNN model, named UGformer, to learn graph representations. In particular, we present two UGformer variants, wherein the first variant (publicized in September 2019) is to leverage the transformer on a set of sampled neighbors for each input node, while the second (publicized in May 2021) is to leverage the transformer on all input nodes. Experimental results demonstrate that the first UGformer variant achieves state-of-the-art accuracies on benchmark datasets for graph classification in both inductive setting and unsupervised transductive setting; and the second UGformer variant obtains state-of-the-art accuracies for inductive text classification. The code is available at: https://github.com/daiquocnguyen/Graph-Transformer.

A Two-stage User Intent Detection Model on Complicated Utterances with Multi-task Learning

As one of the most natural manner of human–machine interaction, the dialogue systems have attracted much attention in recent years, such as chatbots and intelligent customer service bots, etc. Intents of concise user utterances can be easily detected with classic text classification models or text matching models, while complicated utterances are harder to understand directly. In this paper, for improving the user intent detection from complicated utterances in an intelligent customer service bot JIMI (JD Instant Messaging intelligence), which is designed for creating an innovative online shopping experience in E-commerce, we propose a two-stage model which combines sentence Compression and intent Classification together with Multi-task learning, called MCC. Besides, a dialogue-oriented language model is trained to further improve the performance of MCC. Experimental results show that our model can achieve good performance on both a public dataset and the JIMI dataset.

Know Your Victim: Tor Browser Setting Identification via Network Traffic Analysis

Network traffic analysis (NTA) is widely researched to fingerprint users’ behavior by analyzing network traffic with machine learning algorithms. It has introduced new lines of de-anonymizing attacks [1] in the Tor network, inclusive of Website Fingerprinting (WF) and Hidden Service Fingerprinting (HSF). Previous work [4] observed that the Tor browser version may affect network traffic and claimed that having identical browsing settings between the users and adversaries is one of the challenges in WF and HSF. Based on this observation, we propose a NTA method to identify users’ browser settings in the Tor network. We confirm that browser settings have notable impacts on network traffic and create a classifier to identify the browser settings. The classifier can establish over 99% accuracy under the closed-world assumption. The open-world assumption results indicate classification success except for one security setting option. Last, we provide our observations and insights through feature analysis and changelog inspection.

HybEx: A Hybrid Tool for Template Extraction

HybEx is a site-level web template extractor that combines two algorithms for template and content extraction: (i) TemEx, a site-level template detection technique, and (ii) Page-level ConEx, a content extraction technique. The key idea is to add a preprocess to TemEx that removes the main content inferred by Page-level ConEx. It is a fact that adding this new phase to the TemEx algorithm involves an increase of its runtime, however, this increase is very small compared to the TemEx runtime because Page-level ConEx is a page-level technique. On the other hand, HybEx improves the precision, recall, and F1 of TemEx. This paper describes the new template extractor and its internal architecture. Furthermore, the paper also presents the results of its empirical evaluation.

Knowledge Distillation for Discourse Relation Analysis

Automatically identifying the discourse relations can help many downstream NLP tasks such as reading comprehension. It can be categorized into explicit and implicit discourse relation recognition (EDRR and IDRR). Due to the lack of connectives, IDRR remains to be a big challenge. In this paper, we take the first step to exploit the knowledge distillation (KD) technique for discourse relation analysis. Our target is to train a focused single-data single-task student with the help of a general multi-data multi-task teacher. Specifically, we first train one teacher for both the top and second level relation classification tasks with explicit and implicit data. We then transfer the feature embeddings and soft labels from the teacher network to the student network. Extensive experimental results on the popular PDTB dataset proves that our model achieves a new state-of-the-art performance. We also show the effectiveness of our proposed KD architecture through detailed analysis.

User Donations in Online Social Game Streaming: The Case of Paid Subscription in Twitch.tv

Online social game streaming has proliferated with the rise of communities like Twitch.tv and Youtube Gaming. Beyond entertainment, they become vibrant communities for streamers and viewers to interact and support each other, and the phenomenon of user donation is rapidly emerging in these communities. In this article, we provide a publicly available (anonymized) dataset and conduct an in-depth analysis of user donations (made through paid user subscriptions) on Twitch, a worldwide popular online social game streaming community.

Based on information of over 2.77 million subscription relationships that worth in total over 14.1 million US dollars, we reveal the scale and diversity of paid user subscriptions received and made. Among other results, we find that (i) the paid subscriptions received and made are highly skewed, (ii) majority streamers are casual streamers who only come online occasionally, while regular streamers often stream in multiple categories and receive more paid subscriptions, in total as well as per streaming hour, (iii) a considerable amount of viewers on Twitch subscribe to multiple streamers and most viewers support their streamers moderately while a small amount of devoted fans are willing to pay more and longer. Our discussions and finding shed lights on how to maintain community prosperity and provide significant reference for the system design.

COCTEAU: an Empathy-Based Tool for Decision-Making

Traditional approaches to data-informed policymaking are often tailored to specific contexts and lack strong citizen involvement and collaboration, which are required to design sustainable policies. We argue the importance of empathy-based methods in the policymaking domain given the successes in diverse settings, such as healthcare and education. In this paper, we introduce COCTEAU (Co-Creating The European Union), a novel framework built on the combination of empathy and gamification to create a tool aimed at strengthening interactions between citizens and policy-makers. We describe our design process and our concrete implementation, which has already undergone preliminary assessments with different stakeholders. Moreover, we briefly report pilot results from the assessment. Finally, we describe the structure and goals of our demonstration regarding the newfound formats and organizational aspects of academic conferences.

Personal Attribute Prediction from Conversations

Personal knowledge bases (PKBs) are critical to many applications, such as Web-based chatbots and personalized recommendation. Conversations containing rich personal knowledge can be regarded as a main source to populate the PKB. Given a user, a user attribute, and user utterances from a conversational system, we aim to predict the personal attribute value for the user, which is helpful for the enrichment of PKBs. However, there are three issues existing in previous studies: (1) manually labeled utterances are required for model training; (2) personal attribute knowledge embedded in both utterances and external resources is underutilized; (3) the performance on predicting some difficult personal attributes is unsatisfactory. In this paper, we propose a framework DSCGN based on the pre-trained language model with a noise-robust loss function to predict personal attributes from conversations without requiring any labeled utterances. We yield two categories of supervision, i.e., document-level supervision via a distant supervision strategy and contextualized word-level supervision via a label guessing method, by mining the personal attribute knowledge embedded in both unlabeled utterances and external resources to fine-tune the language model. Extensive experiments over two real-world data sets (i.e., a profession data set and a hobby data set) show our framework obtains the best performance compared with all the twelve baselines in terms of nDCG and MRR.

Multi-task GNN for Substitute Identification

Substitute product recommendation is important to improve customer satisfaction on E-commerce domain. E-commerce in nature provides rich sources of substitute relationships, e.g., customers purchase a substitute product when the viewed product is sold out, etc. However, existing recommendation systems usually learn the product substitution correlations without jointly considering variant customer behavior sources. In this paper, we propose a unified multi-task heterogeneous graph neural network (M-HetSage), which captures the complementary information across various customer behavior data sources. This allows us to explore synergy across sources with different attributes and quality. Moreover, we introduce a list-aware average precision (LaAP) loss, which exploits correlations among lists of substitutes and non-substitutes by directly optimizing an approximation of the target ranking metric. On top of that, LaAP leverages a list-aware attention mechanism to differentiate substitute qualities for better recommendations. Comprehensive experiments on Amazon proprietary datasets demonstrate the superiority of our proposed M-HetSage framework equipped with LaAP loss, showing 33%+ improvements on NDCG and mAP metrics comparing to traditional HetSage optimized by a single Triplet loss without differentiating customer behavior data sources.

Unsupervised Post-Time Fake Social Message Detection with Recommendation-aware Representation Learning

This paper deals with a more realistic scenario of fake message detection on social media, i.e., unsupervised post-time detection. Given a source message, our goal is to determine whether it is fake without using labeled data and without requiring user interacted with the given message. We present a novel learning framework, Recommendation-aware Message Representation (RecMR), to achieve the goal. The key idea is to learn user preferences and have them encoded into the representation of the source message through jointly training the tasks of user recommendation and binary detection. Experiments conducted on two real Twitter datasets exhibit the promising performance of RecMR, and show the effectiveness of recommended users in unsupervised detection.

Cross-Language Learning for Product Matching

Transformer-based entity matching methods have significantly moved the state of the art for less-structured matching tasks such as matching product offers in e-commerce. In order to excel at these tasks, Transformer-based matching methods require a decent amount of training pairs. Providing enough training data can be challenging, especially if a matcher for non-English product descriptions should be learned. This poster explores along the use case of matching product offers from different e-shops to which extent it is possible to improve the performance of Transformer-based matchers by complementing a small set of training pairs in the target language, German in our case, with a larger set of English-language training pairs. Our experiments using different Transformers show that extending the German set with English pairs improves the matching performance in all cases. The impact of adding the English pairs is especially high in low-resource settings in which only a rather small number of non-English pairs is available. As it is often possible to automatically gather English training pairs from the Web by exploiting schema.org annotations, our results are relevant for many product matching scenarios targeting low-resource languages.

A Graph Temporal Information Learning Framework for Popularity Prediction

Effectively predicting the future popularity of online content has important implications in a wide range of areas, including online advertising, user recommendation, and fake news detection. Existing approaches mainly consider the popularity prediction task via path modeling or discrete graph modeling. However, most of them heavily exploit underlying diffusion structural and sequential information, while ignoring the temporal evolution information among different snapshots of cascades. In this paper, we propose a graph temporal information learning framework based on an improved graph convolutional network (GTGCN), which can capture both the temporal information governing the spread of information in a snapshot, and the inherent temporal dependencies among different snapshots. We validate the effectiveness of the GTGCN by applying it on a Sina Weibo dataset in the scenario of predicting retweet cascades. Experimental results demonstrate the superiority of our proposed method over the state-of-the-art approaches.

PREP: Pre-training with Temporal Elapse Inference for Popularity Prediction

Predicting the popularity of online content is a fundamental problem in various applications. One practical challenge takes roots in the varying length of observation time or prediction horizon, i.e., a good model for popularity prediction is desired to handle various prediction settings. However, most existing methods adopt a separate training paradigm for each prediction setting and the obtained model for one setting is difficult to be generalized to others, causing a great waste of computational resources and a large demand for downstream labels. To solve the above issues, we propose a novel pre-training framework for popularity prediction, namely PREP, aiming to pre-train a general representation model from the readily available unlabeled diffusion data, which can be effectively transferred into various prediction settings. We design a novel pretext task for pre-training, i.e., temporal elapse inference for two randomly sampled time slices of popularity dynamics, impelling the representation model to learn intrinsic knowledge about popularity dynamics. Experimental results conducted on two real datasets demonstrate the generalization and efficiency of the pre-training framework for different popularity prediction task settings.

Supervised Contrastive Learning for Product Matching

Contrastive learning has moved the state of the art for many tasks in computer vision and information retrieval in recent years. This poster is the first work that applies supervised contrastive learning to the task of product matching in e-commerce using product offers from different e-shops. More specifically, we employ a supervised contrastive learning technique to pre-train a Transformer encoder which is afterward fine-tuned for the matching task using pair-wise training data. We further propose a source-aware sampling strategy that enables contrastive learning to be applied for use cases in which the training data does not contain product identifiers. We show that applying supervised contrastive pre-training in combination with source-aware sampling significantly improves the state-of-the-art performance on several widely used benchmarks: For Abt-Buy, we reach an F1-score of 94.29 (+3.24 compared to the previous state-of-the-art), for Amazon-Google 79.28 (+ 3.7). For WDC Computers datasets, we reach improvements between +0.8 and +8.84 in F1-score depending on the training set size. Further experiments with data augmentation and self-supervised contrastive pre-training show that the former can be helpful for smaller training sets while the latter leads to a significant decline in performance due to inherent label noise. We thus conclude that contrastive pre-training has a high potential for product matching use cases in which explicit supervision is available.

QAnswer: Towards Question Answering Search over Websites

Question Answering (QA) is increasingly used by search engines to provide results to their end-users, yet very few websites currently use QA technologies for their search functionality. To illustrate the potential of QA technologies for the website search practitioner, we demonstrate web searches that combine QA over knowledge graphs and QA over free text – each being usually tackled separately. We also discuss the different benefits and drawbacks of both approaches for web site searches. We use the case studies made of websites hosted by the Wikimedia Foundation (namely Wikipedia and Wikidata). Differently from a search engine (e.g. Google, Bing, etc), the data are indexed integrally, i.e. we do not index only a subset, and they are indexed exclusively, i.e. we index only data available on the corresponding website.

Scriptoria: A Crowd-powered Music Transcription System

In this demo we present Scriptoria, an online crowdsourcing system to tackle the complex transcription process of classical orchestral scores. The system’s requirements are based on experts’ feedback from classical orchestra members. The architecture enables an end-to-end transcription process (from PDF to MEI) using a scalable microtask design. Reliability, stability, task and UI design were also evaluated and improved through Focus Group Discussions. Finally, we gathered valuable comments on the transcription process itself alongside future additions that could greatly enhance current practices in their field.

SHACL and ShEx in the Wild: A Community Survey on Validating Shapes Generation and Adoption

Knowledge Graphs (KGs) are widely used to represent heterogeneous domain knowledge on the Web and within organizations. Various methods exist to manage KGs and ensure the quality of their data. Among these, the Shapes Constraint Language (SHACL) and the Shapes Expression Language (ShEx) are the two state-of-the-art languages to define validating shapes for KGs. Since the usage of these constraint languages has recently increased, new needs arose. One such need is to enable the efficient generation of these shapes. Yet, since these languages are relatively new, we witness a lack of understanding of how they are effectively employed for existing KGs. Therefore, in this work, we answer How validating shapes are being generated and adopted? Our contribution is threefold. First, we conducted a community survey to analyze the needs of users (both from industry and academia) generating validating shapes. Then, we cross-referenced our results with an extensive survey of the existing tools and their features. Finally, we investigated how existing automatic shape extraction approaches work in practice on real, large KGs. Our analysis shows the need for developing semi-automatic methods that can help users generate shapes from large KGs.

Towards Knowledge-Driven Symptom Monitoring & Trigger Detection of Primary Headache Disorders

Headache disorders are experienced by many people around the world. In current clinical practice, the follow-up and diagnosis of headache disorder patients only happens intermittently, based on subjective data self-reported by the patient. The mBrain system tries to make this process more continuous, autonomous and objective by additionally collecting contextual and physiological data via a wearable, mobile app and machine learning algorithms. To support the monitoring of headache symptoms during attacks for headache classification and the detection of headache triggers, much knowledge and contextual data is available from heterogeneous sources, which can be consolidated with semantics. This paper presents a demonstrator of knowledge-driven services that perform these tasks using Semantic Web technologies. These services are deployed in a distributed cascading architecture that includes DIVIDE to derive and manage the RDF stream processing queries that perform the contextually relevant filtering in an intelligent and efficient way.

Using Schema.org and Solid for Linked Data-based Machine-to-Machine Sales Contract Conclusion

We present a demo in which two robotic arms, controlled by rule-based Linked Data agents, trade a good in Virtual Reality. The agents follow the necessary steps to conclude a sales contract under German law. To conclude the contract, the agents exchange messages between their Solid Pods. The data in the messages is modelled using suitable terms from Schema.org.

VisPaD: Visualization and Pattern Discovery for Fighting Human Trafficking

Human trafficking analysts investigate groups of related online escort advertisements (called micro-clusters) to detect suspicious activities and identify various modus operandi. This task is complex as it requires finding patterns and linked meta-data across micro-clusters such as the geographical spread of ads, cluster sizes, etc. Additionally, drawing insights from the data is challenging without visualizing these micro-clusters. To address this, in close-collaboration with domain experts, we built VisPaD, a novel interactive way for characterizing and visualizing micro-clusters and their associated meta-data, all in one place. VisPaD helps discover underlying patterns in the data by projecting micro-clusters in a lower dimensional space. It also allows the user to select micro-clusters involved in suspicious patterns and interactively examine them leading to faster detection and identification of trends in the data. A demo of VisPaD is also released1.

ECCE: Entity-centric Corpus Exploration Using Contextual Implicit Networks

In the Digital Age, the analysis and exploration of unstructured document collections is of central importance to members of investigative professions, whether they might be scholars, journalists, paralegals, or analysts. In many of their domains, entities play a key role in the discovery of implicit relations between the contents of documents and thus serve as natural entry points to a detailed manual analysis, such as the prototypical 5Ws in journalism or stock symbols in finance. To assist in these analyses, entity-centric networks have been proposed as a language model that represents document collections as a cooccurrence graph of entities and terms, and thereby enables the visual exploration of corpora. Here, we present ECCE, a web-based application that implements entity-centric networks, augments them with contextual language models, and provides users with the ability to upload, manage, and explore document collections. Our application is available as a web-based service at http://dimtools.uni.kn/ecce.

Towards Preserving Server-Side Privacy of On-Device Models

Machine learning-based predictions are popular in many applications including healthcare, recommender systems and finance. More recently, the development of low-end edge hardware (e.g., Apple’s Neural Engine and Intel’s Movidius VPU) has provided a path for the proliferation of machine learning on the edge with on-device modeling. Modeling on the device reduces latency and helps maintain the user’s privacy. However, on-device modeling can leak private server-side information. In this work, we investigate on-device machine learning models that are used to provide a service and propose novel privacy attacks that can leak sensitive proprietary information of the service provider. We demonstrate that different adversaries can easily exploit such models to maximize their profit and accomplish content theft. Motivated by the need to preserve both client and server privacy, we present preliminary ideas on thwarting such attacks.

Demo: PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs

Blacklists are a widely-used Internet security mechanism to protect Internet users from financial scams, malicious web pages, and other cyber attacks based on blacklisted URLs. This demo introduces PhishChain, a transparent and decentralized system for blacklisting phishing URLs. At present, public/private domain blacklists, such as PhishTank, CryptoScamDB, and APWG, are maintained by a centralized authority, but operate in a crowd sourcing fashion to create a manually verified blacklist periodically. In addition to being a single point of failure, the blacklisting process utilized by such systems is not transparent. We utilize the blockchain technology to support transparency and decentralization, where no single authority is controlling the blacklist and all operations are recorded in an immutable distributed ledger. Further, we design a page rank based truth discovery algorithm to assign a phishing score to each URL based on crowd sourced assessment of URLs. As an incentive for voluntary participation, we assign skill points to each user based on their participation in URL verification.

Technology Growth Ranking Using Temporal Graph Representation Learning

A key component of technology sector business strategy is understanding the mechanisms by which technologies are adopted and the rate of their growth over time. Furthermore, predicting how technologies grow in relation to each other informs business decision-making in terms of product definition, research and development, and marketing strategies. An important avenue for exploring technology trends is by looking at activity in the software community. Social networks for developers can provide useful technology trend insights and have an inherent temporal graph structure. We demonstrate an approach to technology growth ranking that adapts spatiotemporal graph neural networks to work with structured temporal relational graph data.

Linking Streets in OpenStreetMap to Persons in Wikidata

Geographic web sources such as OpenStreetMap (OSM) and knowledge graphs such as Wikidata are often unconnected. An example connection that can be established between these sources are links between streets in OSM to the persons in Wikidata they were named after. This paper presents StreetToPerson, an approach for connecting streets in OSM to persons in a knowledge graph based on relations in the knowledge graph and spatial dependencies. Our evaluation shows that we outperform existing approaches by 26 percentage points. In addition, we apply StreetToPerson on all OSM streets in Germany, for which we identify more than 180,000 links between streets and persons.

Effectiveness of Data Augmentation to Identify Relevant Reviews for Product Question Answering

With the rapid growth of e-commerce and an increasing number of questions posted on the Question Answer (QA) platforms of e-commerce websites, there is a need for providing automated answers to questions. In this paper, we use transformer-based review ranking models which provide a ranked list of reviews as a potential answer to a new question. Since no explicit training data is available, we exploit the product reviews along with available QA pairs to learn a relevance function between a question and a review sentence. Further, we present a data augmentation technique by fine-tuning the T5 model to generate new questions from customer reviews by considering the summary of the review as an answer and the review as the document. We conduct experiments on a real-world dataset from three categories in Amazon.com. To assess the performance of the models, we use the annotated question review dataset from RIKER [13]. Experimental results show that Deberta-RR model with the augmentation technique outperforms the current state-of-the-art model by 5.84%, 4.38%, 3.96%, and 2.96% on average in nDCG@1, nDCG@3, nDCG@5, and nDCG@10, respectively.

Does Evidence from Peers Help Crowd Workers in Assessing Truthfulness?

Misinformation has been rapidly spreading online. The current approach to deal with it is deploying expert fact-checkers that follow forensic processes to identify the veracity of statements. Unfortunately, such an approach does not scale well. To deal with this, crowdsourcing has been looked at as an opportunity to complement the work of trained journalists. In this work, we look at the effect of presenting the crowd with evidence from others while judging the veracity of statements. We implement various variants of the judgment task design to understand if and how the presented evidence may or may not affect the way of crowd workers judging truthfulness and their performance. Our results show that, in certain cases, the presented evidence may mislead crowd workers who would otherwise be more accurate if judging independently from others. Those who made correct use of the provided evidence, however, could benefit from it and generate better judgments.

Graph-level Semantic Matching model for Knowledge base Aggregate Question Answering

In knowledge base question answering, complex question always has long-distance dependencies, especially aggregate question, which affects query graph matching. Many previous approaches have made conspicuous progress in complex question answering. However, they mostly only compare based on the textual similarity of the predicate sequences, ignoring the degree of semantic information either questions or query graphs. In this paper, we propose a Graph-level Semantic Matching (GSM) model to obtain the global semantics representation. Due to the structural complexity of query graphs, we propose a global semantic model to explicitly encode the structural and relational semantics of query graphs. Then, a question-guiding mechanism is applied to enhance the understanding of question semantics in query graph representation. Finally, GSM outperforms existing question answering models, and exhibits capabilities to deal with aggregate questions, e.g., correctly handling counting and comparison in questions.

SESSION: Alternate Track: PhD Symposium

Query-Driven Graph Processing

Graphs are data model abstractions that are becoming pervasive in several real-life applications and practical use cases. In these settings, users primarily focus on entities and their relationships, further enhanced with multiple labels and properties to form the so-called property graphs. The processing of property graphs relies on graph queries that are used to extract and manipulate entities and relationships [2]. Whereas property graphs can be defined in a schema-less fashion, schema constraints and schema concepts are important assets when considering complex graph processing [1, 3, 6]. As witnessed by ongoing standardization activities for property graphs and their query language [5], graph processing systems are prone to enable query-driven, schema-driven or constraint-driven tasks. In this talk, I will focus on our work on graph processing in the static and streaming environments [7, 9], as well as schema inference methods [6], graph constraint and query workload designs tailored for property graphs [1, 4]. I will conclude by pinpointing future directions for graph processing and graph analytics inspired by our recent community-wide vision paper [8].

Interactions in Information Spread

Large quantities of data flow on the internet. When a user decides to help the spread of a piece of information (by retweeting, liking, posting content), most research works assumes she does so according to information’s content, publication date, the user’s position in the network, the platform used, etc. However, there is another aspect that has received little attention in the literature: the information interaction. The idea is that a user’s choice is partly conditioned by the previous pieces of information she has been exposed to. In this document, we review the works done on interaction modeling and underline several aspects of interactions that complicate their study. Then, we present an approach seemingly fit to answer those challenges and detail a dedicated interaction model based on it. We show our approach fits the problem better than existing methods, and present leads for future works. Throughout the text, we show that taking interactions into account improves our comprehension of information interaction processes in real-world datasets, and argue that this aspect of information spread is should not be neglected when modeling spreading processes.

Canonicalisation of SPARQL 1.1 Queries

SPARQL is the standard query language for RDF as stated by the W3C. It is a highly expressive querying language that contains the standard operations based on set algebra, as well as navigational operations found in graph querying languages. Because of this, there are various ways to represent the same query, which may lead to redundancy in applications of the Semantic Web such as caching systems. We propose a canonicalisation method as a solution, such that all (monotone) congruent queries will have the same canonical form. Despite the theoretical complexity of this problem, our experiments show a good performance over real-world queries. Finally, we anticipate applications in caching, log analysis, query optimisation, etc.

Towards Automated Technologies in the Referencing Quality of Wikidata

Wikidata is a general-purpose knowledge graph with the content being crowd-sourced through an open wiki, along with bot accounts. The Wikidata data model enables assigning references to every single statement. Currently, there are more than 1 billion statements in Wikidata, of which about 70% have got references. Due to the rapid growth of Wikidata, the quality of Wikidata references is not well covered in the literature. To cover the gap, we suggest using automated tools to verify and improve the quality of Wikidata references. For verifying reference quality, we develop a comprehensive referencing assessment framework based on Data Quality dimensions and criteria. Then, we implement the framework as automated reusable scripts. To improve reference quality, we use Relation Extraction methods to establish a reference-suggesting framework for Wikidata. During the research, we managed to develop a subsetting approach to create a comparison platform and handle the big size of Wikidata. We also investigated reference statistics in 6 Wikidata topical subsets. The results of the latter investigation indicate the need for a wider assessment framework, which we aim to address in this dissertation.

User Access Models to Event-Centric Information

Events such as terrorist attacks and Brexit play an important role in the research of social scientists and Digital Humanities researchers. They need innovative user access models to event-centric information which support them throughout their research. Current access models such as recommendation and information retrieval methods often fail to adequately capture essential features of events and provide acceptable access to event-centric information. This PhD research aims to develop efficient and effective user access models to event-centric information by leveraging well-structured information in Knowledge Graphs. The goal is to tackle the challenges researchers encounter during their research in a workflow, from exploratory search to accessing well-defined and complete information collections. This paper presents the specific research questions and presents the approach and preliminary results.

Geometric and Topological Inference for Deep Representations of Complex Networks

Understanding the deep representations of complex networks is an important step of building interpretable and trustworthy machine learning applications in the age of internet. Global surrogate models that approximate the predictions of a black box model (e.g. an artificial or biological neural net) are usually used to provide valuable theoretical insights for the model interpretability. In order to evaluate how well a surrogate model can account for the representation in another model, we need to develop inference methods for model comparison. Previous studies have compared models and brains in terms of their representational geometries (characterized by the matrix of distances between representations of the input patterns in a model layer or cortical area). In this study, we propose to explore these summary statistical descriptions of representations in models and brains as part of a broader class of statistics that emphasize the topology as well as the geometry of representations. The topological summary statistics build on topological data analysis (TDA) and other graph-based methods. We evaluate these statistics in terms of the sensitivity and specificity that they afford when used for model selection, with the goal to relate different neural network models to each other and to make inferences about the computational mechanism that might best account for a black box representation. These new methods enable brain and computer scientists to visualize the dynamic representational transformations learned by brains and models, and to perform model-comparative statistical inference.

Predicting SPARQL Query Dynamics

The SPARQL language is the recommendation for querying Linked Data, but querying SPARQL endpoints has problems with performance, particularly when clients remotely query SPARQL endpoints over the Web. Traditionally, caching techniques have been used to deal with performance issues by allowing the reuse of intermediate data and results across different queries. However, the resources in Linked Data represent real-world things which change over time. The resources described by these datasets are thus continuously created, moved, deleted, linked, and unlinked, which may lead to stale data in caches. This situation is more critical in the case of applications that consume or interact intensively with Linked Data through SPARQL, including query engines and browsers that constantly send expensive and repetitive queries. Applications that leverage Linked Data could benefit from knowledge about the dynamics of changing query results to efficiently deliver accurate services, since they could refresh at least the dynamic part of the queries. Along these lines, we want to address open questions in terms of assessing the dynamics of SPARQL query results in order to improve the way applications access dynamic Linked Data, making queries more efficient and ensuring fresher results.

Personal Knowledge Graphs: Use Cases in e-learning Platforms

Personal Knowledge Graphs (PKGs) are introduced by the semantic web community as small-sized user-centric knowledge graphs (KGs). PKGs fill the gap of personalised representation of user data and interests on the top of big, well-established encyclopedic KGs, such as DBpedia [21]. Inspired by the widely recent usage of PKGs in the medical domain to represent patient data, this PhD proposal aims to adopt a similar technique in the educational domain in e-learning platforms by deploying PKGs to represent users and learners. We propose a novel PKG development that relies on ontology and interlinks to Linked Open Data. Hence, adding the dimension of personalisation and explainability in users’ featured data while respecting privacy. This research design is developed in two use cases: a collaborative search learning platform and an e-learning platform. Our preliminary results show that e-learning platforms can get benefited from our approach by providing personalised recommendations and more user and group-specific data.

Enhancing Multilingual Accessibility of Question Answering over Knowledge Graphs

There are more than 7000 languages spoken in the world today. Yet, English dominates in many research communities, in particular in the field of Knowledge Graph Question Answering (KGQA). The goal of a KGQA system is to provide natural-language access to a knowledge graph. While many research works aim to achieve the best possible QA quality over English benchmarks, only a small portion of them focuses on providing these systems in a way that different user groups (e.g., speakers of different languages) may use them with the same efficiency (i.e., accessibility). To address this research gap, we investigate the multilingual aspect of the accessibility, which enables speakers of different languages (including low-resource and endangered languages) to interact with KGQA systems with the same efficiency.

Enhancing Query Answer Completeness with Query Expansion based on Synonym Predicates

Community-based knowledge graphs are generated following hybrid approaches, where human intelligence empowers computational methods to effectively integrate encyclopedic knowledge or provide a common understanding of a domain. Existing community-based knowledge graphs represent essential sources of knowledge for enhancing the accuracy of data mining, information retrieval, question answering, and multimodal processing. However, despite the enormous effort conducted by contributing communities, community-based knowledge graphs may be incomplete and integrate duplicated data and metadata. We tackle the problem of enhancing query answering against incomplete community-based knowledge graphs by proposing an efficient query processing approach to estimate answer completeness and increase the results. It assumes that community-based knowledge graphs comprise synonym predicates that complement knowledge graph triples required to raise query answering completeness. The aim is proposing a novel query expansion method based on synonym predicates identified using embeddings built on a knowledge graph. Our preliminary analysis shows that our approach improves query answer completeness. However, queries can be expanded with some similar predicates which do not lead to complete answers. This shows that more work is required for query expansion with the minimum synonym predicates that maximize answer completeness.

Comprehensive Event Representations using Event Knowledge Graphs and Natural Language Processing

Recent work has utilised knowledge-aware approaches to natural language understanding, question answering, recommendation systems, and other tasks. These approaches rely on well-constructed and large-scale knowledge graphs that can be useful for many downstream applications and empower knowledge-aware models with commonsense reasoning. Such knowledge graphs are constructed through knowledge acquisition tasks such as relation extraction and knowledge graph completion. This work seeks to utilise and build on the growing body of work that uses findings from the field of natural language processing (NLP) to extract knowledge from text and build knowledge graphs. The focus of this research project is on how we can use transformer-based approaches to extract and contextualise event information, matching it to existing ontologies, to build comprehensive knowledge graph-based event representations. Specifically, sub-event extraction is used as a way of creating sub-event-aware event representations. These event representations are then further enriched through fine-grained location extraction and contextualised through the alignment of historically relevant quotes.

SESSION: Alternate Track: Web developer and W3C

Web Audio Modules 2.0: An Open Web Audio Plugin Standard

A group of academic researchers and developers from the computer music industry have joined forces for over a year to propose a new version of Web Audio Modules, an open source framework facilitating the development of high-performance Web Audio plugins (instruments, realtime audio effects and MIDI processors). While JavaScript and Web standards are becoming increasingly flexible and powerful, C, C++, and domain-specific languages such as FAUST or Csound remain the prevailing languages used by professional developers of native plugins. Fortunately, it is now possible to compile them in WebAssembly, which means they can be integrated with the Web platform. Our work aims to create a continuum between native and browser based audio app development and to appeal to programmers from both worlds. This paper presents our proposal including guidelines and implementations for an open Web Audio plugin standard - essentially the infrastructure to support high level audio plugins for the browser.

Walks in Cyberspace: Improving Web Browsing and Network Activity Analysis with 3D Live Graph Rendering

Web navigation generates traces that are useful for Web cartography, user behavior analysis (UEBA) and resource allocation planning. However, this data still needs to be interpreted, sometimes enriched and appropriately visualized to reach its full potential. In this paper, we propose to explore the strengths and weaknesses of standard data collection methods such as mining Web browser history and network traffic dumps. We developed the DynaGraph framework that combines classical traces dumping tools with a Web application for live 3D rendering of graph data. We show that mining navigation history provides useful insights but fails to provide real-time analytics and is not easy to deploy. Conversely, mining network traffic dumps appears easy to set up but rapidly fails once the data traffic is encrypted. We show that 3D rendering allows to highlight navigation patterns for a given data sampling rate.

JSRehab: Weaning Common Web Interface Components from JavaScript Addiction

Leveraging JavaScript (JS) for User Interface (UI) interactivity has been the norm on the web for many years. Yet, using JS increases bandwidth and battery consumption as scripts need to be downloaded and processed by the browser. Plus, client-side JS may expose visitors to security vulnerabilities such as Cross-Site Scripting (XSS). This paper introduces a new server-side plugin, called JSRehab, that automatically rewrites common web interface components by alternatives that do not require any JavaScript (JS). The main objective of JSRehab is to drastically reduce—and ultimately remove—the inclusion of JS in a web page to improve its responsiveness and consume less resources. We report on our implementation of JSRehab for Bootstrap, the most popular UI framework by far, and evaluate it on a corpus of  100 webpages. We show through manual validation that it is indeed possible to lower the dependencies of pages on JS while keeping intact its interactivity and accessibility. We observe that JSRehab brings energy savings of at least 5 % for the majority of web pages on the tested devices, while introducing a median on-the-wire overhead of only 5 % to the HTML payload.

With One Voice: Composing a Travel Voice Assistant from Repurposed Models

Voice assistants provide users a new way of interacting with digital products, allowing them to retrieve information and complete tasks with an increased sense of control and flexibility. Such products are comprised of several machine learning models, like Speech-to-Text transcription, Named Entity Recognition and Resolution, and Text Classification. Building a voice assistant from scratch takes the prolonged efforts of several teams constructing numerous models and orchestrating between components. Alternatives such as using third-party vendors or re-purposing existing models may be considered to shorten time-to-market and development costs. However, each option has its benefits and drawbacks. We present key insights from building a voice search assistant for Booking.com. Our paper compares the achieved performance and development efforts in dedicated tailor-made solutions against existing re-purposed models. We share and discuss our data-driven decisions about implementation trade-offs and their estimated outcomes in hindsight, showing that a fully functional Machine-Learning product can be built from existing models.

SESSION: Alternate Track: Journal Track

Exploiting Anomalous Structural Nodes in Dynamic Social Networks

As a long-lasting challenge in dynamic social networks, anomaly analysis research has attracted much attention. Unfortunately, existing methods focus on the macro representation of dynamic social networks and fail to analyze the micro-level nodes. Therefore, this research proposes a multiple-neighbor fluctuation method to exploit anomalous structural nodes in dynamic social networks. Our method proposes a new multiple-neighbor similarity index by incorporating extensional similarity indices, which introduces observation nodes and characterizes the structural similarities of nodes within multiple-neighbor ranges. Subsequently, our method maximally reflects the structural change of each node, using a new superposition similarity fluctuation index from the perspective of diverse multiple-neighbor similarities. As a result, our method not only identifies anomalous structural nodes by detecting the anomalous structural changes of nodes, but also evaluates their anomalous degrees by quantifying these changes. Results obtained by comparing with state-of-the-art methods via extensive experiments show that our method can accurately identify anomalous structural nodes and evaluate their anomalous degrees well.

Cross-Site Prediction on Social Influence for Cold-Start Users in Online Social Networks

Online social networks (OSNs) have become a commodity in our daily life. As an important concept in sociology and viral marketing, the study of social influence has received a lot of attentions in academia. Most of the existing proposals work well on dominant OSNs, such as Twitter, since these sites are mature and many users have generated a large amount of data for the calculation of social influence. Unfortunately, cold-start users on emerging OSNs generate much less activity data, which makes it challenging to identify potential influential users among them. In this work, we propose a practical solution to predict whether a cold-start user will become an influential user on an emerging OSN, by opportunistically leveraging the user’s information on dominant OSNs. A supervised machine learning-based approach is adopted, transferring the knowledge of both the descriptive information and dynamic activities on dominant OSNs to emerging OSNs. Descriptive features are extracted from the public data on a user’s homepage. In particular, to extract useful information from the fine-grained dynamic activities which cannot be represented by the statistical indices, we use deep learning technologies to deal with the sequential activity data. Using the real data of millions of users collected from Twitter (a dominant OSN) and Medium (an emerging OSN), we evaluate the performance of our proposed framework to predict prospective influential users. Our system achieves a high prediction performance based on different social influence definitions.

Who Has the Last Word? Understanding How to Sample Online Discussions

In online debates, as in offline ones, individual utterances or arguments support or attack each other, leading to some subset of arguments (potentially from different sides of the debate) being considered more relevant than others. However, online conversations are much larger in scale than offline ones, with often hundreds of thousands of users weighing in, collaboratively forming large trees of comments by starting from an original post and replying to each other. In large discussions, readers are often forced to sample a subset of the arguments being put forth. Since such sampling is rarely done in a principled manner, users may not read all the relevant arguments to get a full picture of the debate from a sample. This paper is interested in answering the question of how users should sample online conversations to selectively favour the currently justified or accepted positions in the debate. We apply techniques from argumentation theory and complex networks to build a model that predicts the probabilities of the normatively justified arguments given their location in idealised online discussions of comments and replies which we represent as trees. Our model shows that the proportion of replies that are supportive, the distribution of the number of replies that comments receive, and the locations of comments that do not receive replies (i.e., the “leaves” of the reply tree) all determine the probability that a comment is a justified argument given its location. We show that when the distribution of the number of replies is homogeneous along the tree length, for acrimonious discussions (with more attacking comments than supportive ones), the distribution of justified arguments depends on the parity of the tree level which is the distance from the root expressed as number of edges. In supportive discussions, which have more supportive comments than attacks, the probability of having justified comments increases as one moves away from the root. For discussion trees which have a non-homogeneous in-degree distribution, for supportive discussions we observe the same behaviour as before, while for acrimonious discussions we cannot observe the same parity-based distribution. This is verified with data obtained from the online debating platform Kialo. By predicting the locations of the justified arguments in reply trees, we can therefore suggest which arguments readers should sample, to grasp the currently accepted opinions in such discussions. Our models have important implications for the design of future online debating platforms.

SESSION: Alternate Track: Tutorials

Accepted Tutorials at The Web Conference 2022

This paper summarizes the content of the 20 tutorials that have been given at The Web Conference 2022: 85% of these tutorials are lecture style, and 15% of these are hands on.

SESSION: Workshop: DataLit – 3rd International Workshop on Data Literacy

3rd Data Literacy Workshop

The 31st edition of the ACM Web Conference (Lyon, 2022 online) hosted the Third Data Literacy workshop. This series of Data Literacy workshops responds to the increasing relevance of data literacy in the field of education as well as to the need for a data literate workforce in our current data-driven economy and society. This half-day workshop included three presentations of the accepted and published papers, a keynote by a leading researcher in the field Ellen Mandinach, and a round table with principal investigators in European Data Literacy projects. In line with its two previous editions, this workshop aims to bring together a diverse community of practice and research from different fields with a common interest: equipping society with the necessary competencies to make sense of the sheer amounts of data to which we come across in our daily lives.

Culturally Responsive Data Literacy: An Emerging and Important Construct

Ellen B. Mandianach has given a Keynote Talk at the 3rd Data Literacy Workshop colocated with the Web Conference 2022 on Tuesday 26th April 2022. A summary of the talk titled ”Culturally Responsive Data Literacy: An Emerging and Important Construct” is described in this papers.

One year of DALIDA Data Literacy Workshops for Adults: A Report

This paper reports on the data literacy discussion workshops we held between April 2021 and January 2022. We developed these workshops for adults in Ireland and, while open to everyone, we sought to reach adults from socially, economically, or educationally disadvantaged groups. This experience paper describes the workshop’s structure, elaborates on our challenges (primarily due to the pandemic), and details some of the lessons we’ve learned. We also present the findings of our participant evaluations. While the drop-out rate was high, the participants who participated and filled in the evaluation form were delighted with the workshop’s content and format. The most important lesson we’ve learned is that a collaboration between scholars and education and public engagement teams (EPE), where both stakeholders approach the projects as equals, is crucial for successful projects.

Towards Benchmarking Data Literacy

Data literacy as a term is growing in presence in society. Until recently, most of the educational focus has been around how to equip people with the skills to use data. However the increased impact that data is having on society has demonstrated the need for a different approach, one where people are able to understand and think critically about how data is being collected, used and shared. Going beyond definitions, in this paper we present research on benchmarking data literacy through self assessment based upon the creation of a set of data literacy levels for adults. Although the work highlights the limitations of self assessment, there is clear potential to build on the definitions to create potential IQ-style tests that help boost critical thinking and demonstrate the importance of data literacy education.

Towards digital economy through data literate workforce

∗In today's digital economy, data are part of everyone's work. Not only decision-makers but also average workers are invited to conduct data-based experiments, interpret data, and create innovative data-based products and services. In this endower, the entire workforce needs additional skills to thrive in this world. This type of competence is united by the name data literacy, and as such, it becomes one of the most valuable skills in the labor market. This paper aims to highlight the needs and shortcomings in terms of competencies for working with data as a critical factor in the business of modern companies striving for digital transformation. Through systematic desk research spanning over 15 European countries, this paper sheds light on how data literacy is addressed in European Higher Education and professional training. In addition, our analysis uses results from an online survey conducted in 20 countries in Europe and North Africa. The results show that the most valuable data literacy competence of an employee is the ability to evaluate or reflect data and the skills related to reading or creating data classification.

SESSION: Workshop: BeyondFacts – 2nd International Workshop on Knowledge Graphs for Online Discourse Analysis

BeyondFacts’22: 2nd International Workshop on Knowledge Graphs for Online Discourse Analysis

Expressing opinions and interacting with others on the Web has led to an abundance of online discourse: claims and viewpoints on controversial topics, their sources and contexts. This constitutes a valuable source of insights for studies into mis- / disinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse. As opposed to facts, claims and viewpoints are inherently more complex. Their interpretation strongly depends on the context and a variety of intentional or unintended meanings, where terminology and conceptual understandings strongly diverge across communities from computational social science, to argumentation mining, fact-checking, or viewpoint/stance detection. The 2nd International Workshop on Knowledge Graphs for Online Discourse Analysis (BeyondFacts’22, equivalently abbreviated as KnOD’22) aims at strengthening the relations between these communities, providing a forum for shared works on the modeling, extraction and analysis of discourse on the Web. It addresses the need for a shared understanding and structured knowledge about discourse in order to enable machine-interpretation, discoverability and reuse, in support of studies into the analysis of societal debates.

Accurate and Explainable Misinformation Detection: Too Good to be True?

Many of the challenges entailed in detecting online misinformation are related to our own cognitive limitations as human beings: We can only see a small part of the world at once, so we need to rely on others to pre-process part of that information for us. This makes us vulnerable to misinformation and points at AI as a necessary means to amplify our ability to deal with it at scale. Recent advances [1] demonstrate it is possible to build semi-automatic tools to detect online misinformation. However, the limitations are still many: our algorithms are hard to explain to human stakeholders, the reduced availability of ground truth data is a bottleneck to train better models, and our processing pipelines are long and complex, with multiple points of potential failure. To address such limitations, strategies that wisely combine algorithms that learn from data with explicit knowledge representations are fundamental to reason with misinformation while engaging [2] with human stakeholders. In this talk, I advocate for a partnership between humans and AI to deal with online misinformation detection, go through the challenges such partnership faces, and share some of the ongoing work that pursues this vision.

Have you been misinformed?: Computational tools and analysis of our interactions with false and corrective information

Misinformation has always been part of humankind’s information ecosystem. The development of tools and methods for automatically detecting the reliability of information has received a great deal of attention in recent years, such as calculating the authenticity of images, calculating the likelihood of claims, and assessing the credibility of sources. Unfortunately, there is little evidence that the presence of these advanced technologies or the constant effort of fact-checkers worldwide can help stop the spread of misinformation. I will try to convince you that you also hold various false beliefs, and argue for the need for technologies and processes to assess the information shared by ourselves or by others, over a longer period of time, in order to improve our knowledge of our information credibility and vulnerability, as well as those of the people we listen to. Also, I will describe the benefits, challenges, and risks of automated information corrective actions, both for the target recipients and their wider audience.

Incorporating External Knowledge for Evidence-based Fact Verification

Existing fact verification methods employ pre-trained language models such as BERT for the contextual representation of evidence sentences. However, such representations do not take into account commonsense knowledge and these methods often conclude that there is not enough information to predict whether a claim is supported or refuted by the evidence sentences. In this work, we propose a framework called CGAT that incorporates external knowledge from ConceptNet to enrich the contextual representations of evidence sentences. We employ graph attention models to propagate the information among the evidence sentences before predicting the veracity of the claim. Experiment results on the benchmark FEVER dataset and UKP Snopes Corpus indicate that the proposed approach leads to higher accuracy and FEVER score compared to state-of-the-art claim verification methods.

Geotagging TweetsCOV19: Enriching a COVID-19 Twitter Discourse Knowledge Base with Geographic Information

Various aspects of the recent COVID-19 outbreak have been extensively discussed on online social media platforms and, in particular, on Twitter. Geotagging COVID-19-related discourse data on Twitter is essential for understanding the different discourse facets and their regional relevance, including calls for social distancing, acceptance of measures implemented to contain virus spread, anti-vaccination campaigns, and misinformation. In this paper, we aim at enriching TweetsCOV19—a large COVID-19 discourse knowledge base of more than 20 million tweets—with geographic information. For this purpose, we evaluate two state-of-the-art Geotagging algorithms: (1) DeepGeo—predicting the tweet location and (2) GeoLocation—predicting the user location. We compare pre-trained models with models trained on context-specific ground truth geolocation data extracted from TweetsCOV19. Models trained on our context-specific data achieve more than 6.7% improvement in Acc@25 compared to the pre-trained models. Further, our results show that DeepGeo outperforms GeoLocation and that longer tweets are, in general, easier to geotag. Finally, we use the two geotagging methods to study the distribution of tweets per country in TweetsCOV19 and compare the geographic coverage, i.e., the number of countries and cities each algorithm can detect.

Towards Building Live Open Scientific Knowledge Graphs

Due to the large number and heterogeneity of data sources, it becomes increasingly difficult to follow the research output and the scientific discourse. For example, a publication listed on DBLP may be discussed on Twitter and its underlying data set may be used in a different paper published on arXiv. The scientific discourse this publication is involved in is divided among not integrated systems, and for researchers it might be very hard to follow all discourses a publication or data set may be involved in. Also, many of these data sources—DBLP, arXiv, or Twitter, to name a few—are often updated in real-time. These systems are not integrated (silos), and there is no system for users to query the content/data actively or, what would be even more beneficial, in a publish/subscribe fashion, i.e., a system would actively notify researchers of work interesting to them when such work or discussions become available.

In this position paper, we introduce our concept of a live open knowledge graph which can integrate an extensible set of existing or new data sources in a streaming fashion, continuously fetching data from these heterogeneous sources, and interlinking and enriching it on-the-fly. Users can subscribe to continuously query the content/data of their interest and get notified when new content/data becomes available. We also highlight open challenges in realizing a system enabling this concept at scale.

Towards Analyzing the Bias of News Recommender Systems Using Sentiment and Stance Detection

News recommender systems are used by online news providers to alleviate information overload and to provide personalized content to users. However, algorithmic news curation has been hypothesized to create filter bubbles and to intensify users’ selective exposure, potentially increasing their vulnerability to polarized opinions and fake news. In this paper, we show how information on news items’ stance and sentiment can be utilized to analyze and quantify the extent to which recommender systems suffer from biases. To that end, we have annotated a German news corpus on the topic of migration using stance detection and sentiment analysis. In an experimental evaluation with four different recommender systems, our results show a slight tendency of all four models for recommending articles with negative sentiments and stances against the topic of refugees and migration. Moreover, we observed a positive correlation between the sentiment and stance bias of the text-based recommenders and the preexisting user bias, which indicates that these systems amplify users’ opinions and decrease the diversity of recommended news. The knowledge-aware model appears to be the least prone to such biases, at the cost of predictive accuracy.

Methodology to Compare Twitter Reaction Trends between Disinformation Communities, to COVID related Campaign Events at Different Geospatial Granularities

With still ongoing COVID pandemic, there is an immediate need for a deeper understanding of how Twitter discussions (or chatters) in disinformation spreading communities get triggered. More specifically, the value is in monitoring how such trigger events in Twitter discussion do align with the timelines of relevant influencing events in the society (indicated in this work as campaign events). For campaign events in regards to COVID pandemic, we consider both NPI (Nonpharmaceutical Interventions) campaigns and disinformation spreading campaigns together. In this short paper we have presented a novel methodology to quantify, compare and relate two Twitter disinformation communities, in terms of their reaction patterns to the timelines of major campaign events. We have also analyzed these campaigns at their three geospatial granularity contexts: local county, state, and country/ federal. We have conducted a novel dataset collection on campaigns (NPI + Disinformation) at these different geospatial granularities. Then, with collected dataset on Twitter disinformation communities, we have performed a case study to validate our proposed methodology.

SESSION: Workshop: CAAW – International Workshop on Cryptoasset Analytics Workshop

CAAW’22: 2022 International Workshop on Cryptoasset Analytics

The half-day workshop on Cryptoasset Analytics allows researchers from different disciplines to present their newest findings related to cryptoassets. This workshop is relevant for the Web research community for two reasons. First, on a technical level, fundamental concepts of cryptoassets are increasingly integrated with Web technologies. Second, we witness the formation of socio-technical cryptoasset ecosystems, which are tightly connected to the Web. The program features a mix of invited talks and a selection of peer-reviewed submissions. Workshop topics range from empirical studies, over analytics methods and tools, to case studies, datasets, and cross-cutting issues like legal or ethical aspects.

How much is the fork? Fast Probability and Profitability Calculation during Temporary Forks

Estimating the probability, as well as the profitability, of different attacks is of utmost importance when assessing the security and stability of prevalent cryptocurrencies. Previous modeling attempts of classic chain-racing attacks have different drawbacks: they either focus on theoretical scenarios such as infinite attack durations, do not account for already contributed blocks, assume honest victims which immediately stop extending their chain as soon as it falls behind, or rely on computationally heavy approaches which render them ill-suited when fast decisions are required. In this paper, we present a simple yet practical model to calculate the success probability of finite attacks, while considering already contributed blocks and victims that do not give up easily. Hereby, we introduce a more fine grained distinction between different actor types and the sides they take during an attack. The presented model simplifies assessing the profitability of forks in practical settings, while also enabling fast and more accurate estimations of the economic security grantees in certain scenarios. By applying and testing our model in the context of bribing attacks, we further emphasize that approaches where the attacker compensates already contributed attack-chain blocks are particularly cheap. Better and more realistic attack models also help to spot and explain certain events observed in the empirical analysis of cryptocurrencies, or provide valuable directions for future studies. For better reproducibility and to foster further research in this area, all source code, artifacts and calculations are made available on GitHub.

Analysis of Arbitrary Content on Blockchain-Based Systems using BigQuery

Blockchain-based systems have gained immense popularity as enablers of independent asset transfers and smart contract functionality. They have also, since as early as the first Bitcoin blocks, been used for storing arbitrary contents such as texts and images. On-chain data storage functionality is useful for a variety of legitimate use cases. It does, however, also pose a systematic risk. If abused, for example by posting illegal contents on a public blockchain, data storage functionality can lead to legal consequences for operators and users that need to store and distribute the blockchain, thereby threatening the operational availability of entire blockchain ecosystems. In this paper, we develop and apply a cloud-based approach for quickly discovering and classifying content on public blockchains. Our method can be adapted to different blockchain systems and offers insights into content-related usage patterns and potential cases of abuse. We apply our method on the two most prominent public blockchain systems—Bitcoin and Ethereum—and discuss our results. To the best of our knowledge, the presented study is the first to systematically analyze non-financial content stored on the Ethereum blockchain and the first to present a side-by-side comparison between different blockchains in terms of the quality and quantity of stored data.

Characterizing the OpenSea NFT Marketplace

Non Fungible Tokens (NFTs) are unique digital identifiers used to represent ownership of various cryptoassets such as music, artwork, collectibles, game assets and allow for an unalterable and provable chain of creation and ownership. The market for NFTs exploded in 2021, and much of the public is being exposed to this new digital ownership medium for the first time. While uptake of this new medium is relatively slow, we feel the prospects of blockchain technology are up-and-coming. NFTs are still largely misunderstood by the wider community, and as such, this provides a fascinating opportunity to explore the NFT marketplace in greater depth. We analyzed sales data from OpenSea - the NFT marketplace with the most extensive user base and sales volume. Our study considered 5.25 million sales that occurred between January 1, 2019, and December 31, 2021. We first lead by presenting an overview of our data collection process and summarizing key statistics of the data set. We examine user behaviour in the market to show that a small subset of heavy-hitters is driving massive growth. Secondly, we review the economic activity within the network to show how these power users drive extreme price volatility within the Art and Collectible categories. Lastly, we review the network of buyers and sellers to show that despite the sparsity that exists in the network, communities of users are forming, and most power users tend to congregate in these structures. These findings shed light on areas of the NFT marketplace that have been relatively unexamined and provide a multi-level analysis of a multi-billion dollar market.

SESSION: Workshop: CLEOPATRA – 3rd International Workshop on Cross-lingual Event-centric Open Analytics

CLEOPATRA’22: 3rd International Workshop on Cross-lingual Event-centric Open Analytics

The 3rd International Workshop on Cross-lingual Event-centric Open Analytics (CLEOPATRA 2022) is co-located with The Web Conference (WWW) and held on the 25th of April, 20221.

Modern society faces an unprecedented number of events that impact countries, communities and economies around the globe across language, country and community borders. Recent examples include sudden or unexpected events such as terrorist attacks and political shake-ups such as Brexit, events related to the ongoing COVID-19 pandemic, as well as longer ongoing and evolving topics such as the migration crisis in Europe that regularly spawn events of global importance affecting local communities. These developments result in a vast amount of event-centric, multilingual information available from heterogeneous sources on the Web, on the Web of Data, within Knowledge Graphs, in social media, inside Web archives and in news sources. Such event-centric information differs across sources, languages and communities, potentially reflecting community-specific aspects, opinions, sentiments and bias.

The theme of the workshop includes a variety of interdisciplinary challenges related to analysis, interaction with and interpretation of vast amounts of event-centric textual, semantic and visual information in multiple languages originating from different communities. The goal of the interdisciplinary CLEOPATRA workshop is to bring together researchers and practitioners from the fields of Semantic Web, the Web, NLP, IR, Human Computation, Visual Analytics and Digital Humanities to discuss and evaluate methods and solutions for effective and efficient analytics of event-centric multilingual information spread across heterogeneous sources. This will support the delivery of analytical results in ways meaningful to users, helping them to cross language barriers and better understand event representations and their context in other languages.

The workshop features advanced methods for extracting event-centric information, multi-modal geo-localisation, cross-lingual sentiment detection and causality detection, NLP tools for under-resourced languages and applications of these methods to digital humanities research. Furthermore, the workshop introduces selected tools developed by researchers working in the CLEOPATRA ITN - a Marie Skłodowska-Curie Innovative Training Network.

We would like to take this opportunity to sincerely thank the authors and presenters for their inspiring contributions to the workshop. Our sincere thanks are due to the program committee members for reviewing the submissions and ensuring the high quality of our workshop program. We also thank Manolis Koubarakis for his keynote talk in the workshop.

We are also very grateful to the organisers of The Web Conference 2022, and particularly the Workshops Chairs, Nathalie Hernandez and Preslav Nakov, for their support with the workshop organisation.

Geospatial Interlinking with JedAI-spatial

In this talk I will present the framework JedAI-spatial for the interlinking of geospatial data encoded in RDF.

Exploring Cross-Lingual Transfer to Counteract Data Scarcity for Causality Detection

Finding causal relations in text is an important task for many types of textual analysis. It is a challenging task, especially for the many languages with no or only little annotated training data available. To overcome this issue, we explore cross-lingual methods. Our main focus is on Swedish, for which we have a limited amount of data, and where we explore transfer from English and German. We also present additional results for German with English as a source language. We explore both a zero-shot setting without any target training data, and a few-shot setting with a small amount of target data. An additional challenge is the fact that the annotation schemes for the different data sets differ, and we discuss how we can address this issue. Moreover, we explore the impact of different types of sentence representations. We find that we have the best results for Swedish with German as a source language, for which we have a rather small but compatible data set. We are able to take advantage of a limited amount of noisy Swedish training data, but only if we balance its classes. In addition we find that the newer transformer-based representations can make better use of target language data, but that a representation based on recurrent neural networks is surprisingly competitive in the zero-shot setting.

SESSION: Workshop: COnSeNT – 2nd International Workshop on Consent Management in Online Services, Networks and Things

COnSeNT 2022: 2nd International Workshop on Consent Management in Online Services, Networks and Things

Consent of the Governed

Informed consent is rooted in human subjects research where it is used to operationalise autonomy. From there it was transplanted into a computing context but in the process it became separated from key assumptions that makes it work for research such an institutional review board to weed out unethical use or rare and voluntary participation. The result is a procedural ghost of consent that undermines personal autonomy irrespective of what user interface constraints are imposed on it, and a hyper-individualistic approach to privacy that is poorly adapted to life in a complex digital society.

All is not dark however and there is a future for consent built for people rather than for homo economicus. We can use the legal basis of consent as an implementation brick and evolve consent towards collective, socialised decision-making: consent of the governed.

A Policy-Oriented Architecture for Enforcing Consent in Solid

The Solid project aims to restore end-users’ control over their data by decoupling services and applications from data storage. To realize data governance by the user, the Solid Protocol 0.9 relies on Web Access Control, which has limited expressivity and interpretability. In contrast, recent privacy and data protection regulations impose strict requirements on personal data processing applications and the scope of their operation. The Web Access Control mechanism lacks the granularity and contextual awareness needed to enforce these regulatory requirements. Therefore, we suggest a possible architecture for relating Solid’s low-level technical access control rules with higher-level concepts such as the legal basis and purpose for data processing, the abstract types of information being processed, and the data sharing preferences of the data subject. Our architecture combines recent technical efforts by the Solid community panels with prior proposals made by researchers on the use of ODRL and SPECIAL policies as an extension to Solid’s authorization mechanism. While our approach appears to avoid a number of pitfalls identified in previous research, further work is needed before it can be implemented and used in a practical setting.

Internalization of Privacy Externalities through Negotiation: Social costs of third-party web-analytic tools and the limits of the legal data protection framework

Tools for web-analytics such as Google Analytics are implemented across the majority websites. For most cases, the usage is free of charge for the website-owners. However, to use those tools to their full potential, it is necessary to share the collected personal data of the users with the tool-provider. This paper examines if this constellation of data collection and sharing can be interpreted as an externality of consumption in the sense of welfare economic theory. As it is shown that this is the case, the further analysis examines if the current technical and legal framework allows for an internalization of this externality through means of negotiation. It is illustrated that an internalization through negotiation is highly unlikely to succeed, because of the existence of information asymmetries, transaction cost and improper means for the enforcement of rights of disposal. It is further argued that even if some of those issues are addressed by data protection laws, the legal framework does not ensure a market situation necessary for a successful internalization. As a result, the externalities caused by the data collection through third-party web-analytic tools continue to exist. This leads to an inefficient high usage of third-party tools for web-analytics by website-owners.

Context, Prioritization, and Unexpectedness: Factors Influencing User Attitudes About Infographic and Comic Consent

Being asked to agree to data disclosure is a ubiquitous experience in digital services - yet it is rare to encounter a well-designed consent experience. Considering the momentum for a European data space where personal information easily flows across organizations, sectors, and nations, solving the thorny issue of ”how to get consent right” cannot be postponed any further. In this paper, we describe the first findings from a study based on 24 semi-structured interviews investigating participants’ expectations and opinions toward a consent form redesigned as a comic and an infographic in a data-sharing scenario. We found that time, information prioritization, tone, and audience fit are crucial when individuals are invited to disclose their information and the infographic is a better fit in biomedical scenarios.

SESSION: Workshop: EMDC – 2nd International Workshop on the Efficiency of Modern Datacenters

Second International Workshop on the Efficiency of Modern Data Centers (EMDC) Chairs’ Welcome

Optimizing Data Layout for Training Deep Neural Networks

The widespread popularity of deep neural networks (DNNs) has made it an important workload in modern datacenters. Training DNNs is both computation-intensive and memory-intensive. While prior works focus on training parallelization (e.g., data parallelism and model parallelism) and model compression schemes (e.g., pruning and quantization) to reduce the training time, choosing an appropriate data layout for input feature maps also plays an important role and is considered to be orthogonal to parallelization and compression in delivering the overall training performance. However, finding an optimal data layout is non-trivial since the preferred data layout varies depending on different DNN models as well as different pruning schemes that are applied. In this paper, we propose a simple-yet-effective data layout arbitration framework that automatically picks up the beneficial data layout for different DNNs under different pruning schemes. The proposed framework is built upon a formulated cache estimation model. Experimental results indicate that our approach is always able to select the most beneficial data layout and achieves the average training performance improvement with 14.3% and 3.1% compared to uniformly using two popular data layouts.

Security Challenges for Modern Data Centers with IoT: A Preliminary Study

The wide deployment of internet of things (IoT) devices makes a profound impact on the data center industry from various perspectives, varying from infrastructure operation, resource management, to end users. This is a double-edged sword – it enables ubiquitous resource monitoring and intelligent management therefore significantly enhances the efficiency of daily operation while introducing new security issues for modern data centers. The emerging security challenges are not only related to detecting new IoT attacks or vulnerabilities but also including the implementations of cybersecurity protection mechanisms (e.g., intrusion detection system, vulnerability management system) to enhance data center security. As the new security challenges with IoT have not been thoroughly explored in the literature, this paper provides a survey on the most recent IoT security issues regarding modern data centers by highlighting IoT attacks and the trend of newly discovered vulnerabilities. We find that vulnerabilities related to data center management system have increased significantly since 2019. Compared to the total amount in 2018 (with 25 vulnerabilities), the number of data center management system vulnerabilities almost increased by a factor of four times (with 98 vulnerabilities) in 2020. This paper also introduces the existing cybersecurity tools and discusses the associated challenges and research issues for enhancing data center security.

Streaming Analytics with Adaptive Near-data Processing

Streaming analytics applications need to process massive volumes of data in a timely manner, in domains ranging from datacenter telemetry and geo-distributed log analytics to Internet-of-Things systems. Such applications suffer from significant network transfer costs to transport the data to a stream processor and compute costs to analyze the data in a timely manner. Pushing the computation closer to the data source by partitioning the analytics query is an effective strategy to reduce resource costs for the stream processor. However, the partitioning strategy depends on the nature of resource bottleneck and resource variability that is encountered at the compute resources near the data source. In this paper, we investigate different issues which affect query partitioning strategies. We first study new partitioning techniques within cloud datacenters which operate under constrained compute conditions varying widely across data sources and different time slots. With insights obtained from the study, we suggest several different ways to improve the performance of stream analytics applications operating in different resource environments, by making effective partitioning decisions for a variety of use cases such as geo-distributed streaming analytics.

Powering Multi-Task Federated Learning with Competitive GPU Resource Sharing

Federated learning (FL) nowadays involves compound learning tasks as cognitive applications’ complexity increases. For example, a self-driving system hosts multiple tasks simultaneously (e.g., detection, classification, etc.) and expects FL to retain life-long intelligence involvement. However, our analysis demonstrates that, when deploying compound FL models for multiple training tasks on a GPU, certain issues arise: (1) As different tasks’ skewed data distributions and corresponding models cause highly imbalanced learning workloads, current GPU scheduling methods lack effective resource allocations; (2) Therefore, existing FL schemes, only focusing on heterogeneous data distribution but runtime computing, cannot practically achieve optimally synchronized federation. To address these issues, we propose a full-stack FL optimization scheme to address both intra-device GPU scheduling and inter-device FL coordination for multi-task training. Specifically, our works illustrate two key insights in this research domain: (1) Competitive resource sharing is beneficial for parallel model executions, and the proposed concept of “virtual resource” could effectively characterize and guide the practical per-task resource utilization and allocation. (2) FL could be further improved by taking architectural level coordination into consideration. Our experiments demonstrate that the FL throughput could be significantly escalated.

SESSION: Workshop: FinWeb – 2nd International Workshop on Financial Technology on the Web

An Overview of Financial Technology Innovation

In this paper, we provide an overview of financial technology (FinTech) innovation based on our experience of organizing multiple FinTech-related events since 2018, including the FinNLP workshop series, FinWeb workshop series, and FinNum shared task series. These event series aim to provide a forum for blending the research of FinTech and artificial intelligence (AI), and further accelerating the development in the FinTech domain. We hope that with the researchers’ sharing in these events, the challenging problems will be identified, and the future research direction will be shaped. Both the development of the technology and the trend of the focused topics in the past four years are discussed in this paper. We also propose a research agenda with the plan of our FinTech-related events.

A Generative Approach for Financial Causality Extraction

Causality represents the foremost relation between events in financial documents such as financial news articles, financial reports. Each financial causality contains a cause span and an effect span. Previous works proposed sequence labeling approaches to solve this task. But sequence labeling models find it difficult to extract multiple causalities and overlapping causalities from the text segments. In this paper, we explore a generative approach for causality extraction using the encoder-decoder framework and pointer networks. We use a causality dataset from the financial domain, FinCausal, for our experiments and our proposed framework achieves very competitive performance on this dataset.

Rayleigh Portfolios and Penalised Matrix Decomposition

Since the development and growth of personalised financial services online, effective tailor-made and fast statistical portfolio allocation techniques have been sought after. In this paper, we introduce a framework called Rayleigh portfolios, that encompasses many well-known approaches, such as the Sharpe Ratio, maximum diversification or minimum concentration. By showing the commonalities amongst these approaches, we are able to provide a solution to all such optimisation problems via matrix decomposition, and principal component analysis in particular. In addition, thanks to this reformulation, we show how to include sparsity upper bounds in such portfolios, thereby catering for two additional requirements in portfolio construction: robustness and low transaction costs. Importantly, modifications to the usual penalised matrix decomposition algorithms can be applied to other problems in statistics. Finally, empirical applications show promising results.

FiNCAT: Financial Numeral Claim Analysis Tool

While making investment decisions by reading financial documents, investors need to differentiate between in-claim and out-of-claim numerals. In this paper, we present a tool which can do this task automatically. It extracts context embeddings of the numerals using a transformer based pre-trained language model – BERT. Subsequently, it uses a Logistic Regression based model to detect whether a numeral is in-claim or out-of-claim. We use the FinNum-3 (English) dataset to train our model. We conducted rigorous experiments and our best model achieved a Macro F1 score of 0.8223 on the validation set. We have open-sourced this tool which can be accessed from https://github.com/sohomghosh/FiNCAT_Financial_Numeral_Claim_Analysis_Tool

Understanding Financial Information Seeking Behavior from User Interactions with Company Filings

Publicly-traded companies are required to regularly file financial statements and disclosures. Analysts, investors, and regulators leverage these filings to support decision making, with high financial and legal stakes. Despite their ubiquity in finance, little is known about the information seeking behavior of users accessing such filings. In this work, we present the first study of this behavior. We analyze 14 years of logs of users accessing company filings of more than 600K distinct companies on the U.S. Securities and Exchange Commission’s (SEC) Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system, the primary resource for accessing company filings. We provide an analysis of the information-seeking behavior for this high-impact domain. We find that little behavioral history is available for the majority of users, while frequent users have rich histories. Most sessions focus on filings belonging to a small number of companies, and individual users are interested in a limited number of companies. Out of all sessions, 66% contain filings from one or two companies and 50% of frequent users are interested in six companies or less. Understanding user interactions with EDGAR can suggest ways to enhance the user journey in browsing filings, e.g., via filing recommendation. Our work provides a stepping stone for the academic community to tackle retrieval and recommendation tasks for the finance domain.

FinRED: A Dataset for Relation Extraction in Financial Domain

Relation extraction models trained on a source domain cannot be applied on a different target domain due to the mismatch between relation sets. In the current literature, there is no extensive open-source relation extraction dataset specific to the finance domain. In this paper, we release FinRED, a relation extraction dataset curated from financial news and earning call transcripts containing relations from the finance domain. FinRED has been created by mapping Wikidata triplets using distance supervision method. We manually annotate the test data to ensure proper evaluation. We also experiment with various state-of-the-art relation extraction models on this dataset to create the benchmark. We see a significant drop in their performance on FinRED compared to the general relation extraction datasets which tells that we need better models for financial relation extraction.

SEBI Regulation Biography

The Securities and Exchange Board of India is the regulatory body for securities and commodity market in India. A growing number of SEBI documents ranging from government regulations to legal case files are now available in the digital form. Advances in natural language processing and machine learning provide opportunities for extracting semantic insights from these documents. We present here a system that performs semantic processing of SEBI documents using state-of-the-art language models to produce enriched regulations containing timelines of amendments and cross references to legal case files.

Numeral Tense Detection in Chinese Financial News

Time information is a very important dimension in information space, which can be shown as tense expressions in natural language. Meanwhile, numerals play an important role in financial texts, which is the embodiment of fine-grained information, and most of financial events contain numerals. We have observed that Chinese does not express the tense of texts intuitively at the lexical level of verbs, but through some adverb or auxiliary tense operators, and there has not further research on numeral tense in Chinese financial texts yet. However, the tense of numerals in the texts is very crucial for the financial fields which pays attention to time series. Therefore, to assist Chinese tense understanding in financial texts, in this paper, we propose a novel task for numeral tense detection in Chinese financial fields. We firstly annotate a numeral tense dataset based on Chinese finance news texts, named CFinNumTense, which defines the numeral tense categories into “past tense”, “future tense”, “static state” and “time”, and then conduct Chinese finance numeral tense detection task on CFinNumTense. We employ RoBERTa (Robustly Optimized BERT Pretraining Approach) pre-trained model as the embedding layer and use four baseline models, i.e., FNN (Feedforward Neural Network), TextCNN (Text Convolutional Neural Networks), RNN (Recurrent Neural Networks) and BiLSTM (Bi-directional Long Short-Term Memory), to detect numeral tenses, respectively. In the ablation experiments, we design NE (Numeral Encoding) to improve the information on target numeral in the texts, and design an auxiliary learning model based on BiLSTM. Experiments show that the multitask learning of target numeral tense detection and tense operator extraction can strengthen the understanding ability of target numeral tense in Chinese financial texts.

Detecting Regulation Violations for an Indian Regulatory Body through Multi Label Classification

The Securities and Exchange Board of India (SEBI) is the regulatory body for securities and commodities in India. SEBI creates, and enforces regulations that must be followed by all listed companies. To the best of our knowledge, this is the first work on identifying the regulation(s) that a SEBI-related case violates, which could be of substantial value to companies, lawyers, and other stakeholders in the regulatory process. We create a dataset for this task by automatically extracting violations from publicly available case-files. Using this data, we explore various multi-label text classification methods to determine the potentially multiple regulations violated by (the facts of) a case. Our experiments demonstrate the importance of employing contextual text representations to understand complex financial and legal concepts. We also highlight the challenges that must be addressed to develop a fully functional system in the real-world.

Improving Operation Efficiency through Predicting Credit Card Application Turnaround Time with Index-based Encoding

This paper presents the successful use of index encoding and machine learning to predict the turnaround time of a complex business process – the credit card application process. Predictions are made on in-progress processes and refreshed when new information is available. The business process is complex, with each individual instance having different steps, sequence, and length. For instances predicted to have higher than normal turnaround time, model explain-ability is employed to identify the top reasons. This allows for intervention in the process to potentially reduce turnaround time before completion.

TweetBoost: Influence of Social Media on NFT Valuation

NFT or Non-Fungible Token is a token that certifies a digital asset to be unique. A wide range of assets including, digital art, music, tweets, memes, are being sold as NFTs. NFT-related content has been widely shared on social media sites such as Twitter. We aim to understand the dominant factors that influence NFT asset valuation. Towards this objective, we create a first-of-its-kind dataset linking Twitter and OpenSea (the largest NFT marketplace) to capture social media profiles and linked NFT assets. Our dataset contains 245,159 tweets posted by 17,155 unique users, directly linking 62,997 NFT assets on OpenSea worth 19 Million USD. We have made the dataset publicly available.1

We analyze the growth of NFTs, characterize the Twitter users promoting NFT assets, and gauge the impact of Twitter features on the virality of an NFT. Further, we investigate the effectiveness of different social media and NFT platform features by experimenting with multiple machine learning and deep learning models to predict an asset’s value. We model the problem as a binary classification as well as an ordinal classification task. Our results show that social media features improve the ordinal classification accuracy by 6% over baseline models that use only NFT platform features. Among social media features, count of user membership lists, number of likes and replies are important features. On the other hand, OpenSea features like offer entered, bids withdrawn, bid entered and is presale turn out to be important predictors.

Graph Representation Learning of Banking Transaction Network with Edge Weight-Enhanced Attention and Textual Information

In this paper, we propose a novel approach to capture inter-company relationships from banking transaction data using graph neural networks with a special attention mechanism and textual industry or sector information. Transaction data owned by financial institutions can be an alternative source of information to comprehend real-time corporate activities. Such transaction data can be applied to predict stock price and miscellaneous macroeconomic indicators as well as to sophisticate credit and customer relationship management. Although the inter-company relationship is important, traditional methods for extracting information have not captured that enough. With the recent advances in deep learning on graphs, we can expect better extraction of inter-company information from banking transaction data. Especially, we analyze common issues that arise when we represent banking transactions as a network and propose an efficient solution to such problems by introducing a novel edge weight-enhanced attention mechanism, using textual information, and designing an efficient combination of existing graph neural networks.

SESSION: Workshop: LocWeb – 12th International Workshop on Location and the Web

LocWeb2022: 12th International Workshop on Location and the Web

The LocWeb2022 workshop at The Web Conference 2022 will run for the 12th time with evolving topics around location-aware information access, Web architecture, spatial social computing, and social good. It is designed as a meeting place for researchers around the location topic at The Web Conference and takes an interdisciplinary perspective.

Anonymous Hyperlocal Communities: What do they talk about?

In this paper, we study what users talk about in a plethora of independent hyperlocal and anonymous online communities in a single country: Saudi Arabia (KSA). We base this perspective on performing a content classification of the Jodel network in the KSA. To do so, we first contribute a content classification schema that assesses both the intent (why) and the topic (what) of posts. We use the schema to label 15k randomly sampled posts and further classify the top 1k hashtags. We observe a rich set of benign (yet at times controversial in conservative regimes) intents and topics that dominantly address information requests, entertainment, or dating/flirting. By comparing two large cities (Riyadh and Jeddah), we further show that hyperlocality leads to shifts in topic popularity between local communities. By evaluating votes (content appreciation) and replies (reactions), we show that the communities react differently to different topics; e.g.,  entertaining posts are much appreciated through votes, receiving the least replies, while beliefs & politics receive similarly few replies but are controversially voted.

Exploiting Geodata to Improve Image Recognition with Deep Learning

Due to the widespread availability of smartphones and digital cameras with GPS functionality, the number of photos associated with geographic coordinates or geoinformation on the internet is continuously increasing. Besides the obvious benefits of geotagged images for the users, geodata can enable a better understanding of the image content and thus facilitate their classification. This work shows the added value of integrating auxiliary geodata during a multi class single label image classification task. Various ways of encoding and extracting auxiliary features from raw coordinates are compared, followed by an investigation of approaches to integrate these features into a convolutional neural network (CNN) by fusion models. We show the classification improvements of adding the raw coordinates and derived auxiliary features such as satellite photos and location-related texts (address information and tags).

The results show that the best performance is achieved by a fusion model, which is incorporating textual features based on address information. It is improving the performance the most while reducing the training time: The accuracy of the considered 25 concepts was increased to 85%, compared to 71% in the baseline, while the training time was reduced by 21%. Adding the satellite photos into the neural network shows significant performance improvements as well, but increases the training time. In contrast, numerical features derived directly from raw coordinates do not yield a convincing improvement in classification performance.

Predicting Spatial Spread on Social Media

Understanding and prediction of spreading phenomena are vital for numerous applications. The massive availability of social network data provides a platform for studying spreading phenomena. Past works studying and predicting spreading phenomena have explored the spread in dimensions of time and volume, such as predicting total infected users, predicting popularity, predicting the time when content receives a threshold number of infected users. However, as the information spreads from user to user, it also spreads from location to location. In this paper, we attempt to predict the spread in the dimension of geographic space. In accordance with the past spreading prediction problems, we design our problem to predict the spatial spread at an early stage. For this, we utilized spatial features, social features, and emotion features. We feed these features into existing classification algorithms and evaluate on three datasets from Twitter.

SESSION: Workshop: MAISoN – 8th International Workshop on Mining Actionable Insights from Social Networks, Special Edition on Mental Health and Social Media

MAISoN’22: 8th International Workshop on Mining Actionable Insights from Social Networks Special Edition on Mental Health and Social Media

The eighth edition of the workshop on Mining Actionable Insights from Social Networks (MAISoN 2022) took place virtually on April 26th, 2022, co-located with the ACM Web Conference 2022 (WWW 2022). This year, we organized a special edition with focus on mental health and social media. The aim of this edition was to bring together researchers from different disciplines to discuss research that goes beyond descriptive analysis of social media data and instead investigate different techniques that use social media data for building diagnostic, predictive and prescriptive analysis models for mental health applications. This topic attracted a lot of interest from the community especially because of all the considerations surrounding the impact of social media during the COVID-19 pandemic which has impacted on people’s mental health issues.

Utilizing Pattern Mining and Classification Algorithms to Identify Risk for Anxiety and Depression in the LGBTQ+ Community During the COVID-19 Pandemic

In this paper, we examine the results of pattern mining and decision trees applied to a dataset of survey responses about life for individuals in the LGBTQ+ community during COVID, which have the potential to be used as a tool to identify those at risk for anxiety and depression. The world was immensely affected by the pandemic in 2020 through 2022, and our study attempts to use the data from this period to analyze the impact on anxiety and depression. First, we used the FP-growth algorithm for frequent pattern mining, which finds groups of items that frequently occur together, and utilized the resulting patterns and measures to determine which features were significant when inspecting anxiety and depression. Then, we trained a decision tree with the selected features to classify if a person has anxiety or depression. The resulting decision trees can be used to identify individuals at risk for these conditions. From our results, we also identified specific risk factors that helped predict whether an individual was likely to experience anxiety and/or depression, such as satisfaction with their sex life, cutting meals, and worries of healthcare discrimination due to their gender identity or sexual orientation.

Supporting People Receiving Substance Use Treatment during COVID-19 through a Professional-Moderated Online Peer Support Group

The COVID-19 pandemic exacerbated the ongoing opioid crisis in the United States. Individuals with a substance use disorder are vulnerable to relapse during times of acute stress. Online peer support communities (OPSCs) have the potential to decrease social isolation and increase social support for participants. In September 2020, we launched a private, professional-moderated OPSC using the Facebook Group platform to study its effects on the mental health wellness of women undergoing substance use treatment. This study was particularly meaningful as the participants were not able to join in-person treatment sessions due to the COVID-19 pandemic. Preliminary findings indicate that study participants reported decreased loneliness and increased online social support three months after initiating the OPSC. They tended to interact with content initiated by a clinical professional more than those generated by peers.

A Large-scale Temporal Analysis of User Lifespan Durability on the Reddit Social Media Platform

Social media platforms thrive upon the intertwined combination of user-created content and social interaction between these users.

In this paper, we aim to understand what early user activity patterns fuel an ultimately durable user lifespan. We do so by analyzing what behavior causes potentially durable contributors to abandon their “social career” at an early stage, despite a strong start. We use a uniquely processed temporal dataset of over 6 billion Reddit user interactions on covering over 14 years, which we make available together with this paper. The temporal data allows us to assess both user content creation activity and the way in which this content is perceived. We do so in three dimensions, being a user’s content a) engagement and perception, b) diversification, and c) contribution.

Our experiments reveal that users who leave the platform quickly may initially receive good feedback on their posts, but in time experience a decrease in the perceived quality of their content. Concerning diversification, we find that early departing users focus on fewer content categories in total, but do “jump” between those content categories more frequently, perhaps in an (unsuccessful) search for recognition or a sense of belonging. Third, we see that users who stay with the platform for a more extended period gradually start contributing, whereas early departing users post their first comments relatively quickly. The findings from this paper may prove crucial for better understanding how social media platforms can in an early stage improve the overall user experience and feeling of belonging within the social ecosystem of the platform.

”I’m always in so much pain and no one will understand” - Detecting Patterns in Suicidal Ideation on Reddit

Social media has become another venue for those struggling with thoughts of suicide. Many turn to social media to express suicidal ideation and look for peer support. In our study we seek to better understand patterns in the behaviors of these users particularly on the social media platform Reddit. This study will explore how Reddit users move or progress between subreddits until they express active suicidal ideation. We also look at these users’ posting pattern in the time leading up to expressing suicidal ideation and the time after. We examined a large dataset of posts from users who created at least one thread on SuicideWatch during January 2019 - August 2019 and collected their posts starting in July 2018 to create a look back period of 6 months. This generated a total of 5,892,310 posts. We defined what it means to progress between subreddits and generated a graph of progressions of all users in our dataset. We found that these users mostly progressed to or from 8 different subreddits and each of these subreddits could point to a particular emotional difficulty that a user was having such as self harm or relationship problems. Furthermore, we examined the volume of posts and the proportion of posts with negative sentiment leading up to the first incident of active suicidal ideation and found that there is an increase in both negative sentiment and volume of posts leading up to the day of the first incident of suicidal ideation on Reddit. However, during the day of first incident of suicidal ideation, there is a precipitous drop in the number of posts which goes back up on the following day. Using this insight, we can better understand these users. This will allow for developing intervention for suicide prevention in social media platforms in the future.

SESSION: Workshop: MUWS'2022 – 1st International Workshop on Multimodal Understanding for the Web and Social Media

MUWS’22: 1st International Workshop on Multimodal Understanding for the Web and Social Media

The 1st International Workshop on Multimodal Understanding for the Web and Social Media (MUWS 2022) is co-located with The Web Conference (WWW) and held on the 26th of April, 20221.

Multimodal learning and analysis is an emerging research area that cuts through several disciplines like Computer Vision, Natural Language Processing (NLP), Speech Processing, and Multimedia. Recently, several multimodal learning techniques have shown the benefit of combining multiple modalities in video representation learning and downstream tasks on videos. At the core, these methods are focused on modelling the modalities and their complex interactions by using large amounts of data, different loss functions and deep neural network architectures. Although these research directions are exciting and challenging, interdisciplinary fields such as semiotics are rarely considered. Literature in semiotics provides a detailed theory and analysis on meaning creation through signs and symbols via multiple modalities. In general, it provides a compelling view of multimodality and perception that can further expand computational research and applications on the web and social media.

The goal of the interdisciplinary MUWS Workshop is to bring together researchers and practitioners from the fields of Information Retrieval, Natural Language Processing, Computer Vision, Human Computation, and Semiotics to discuss and evaluate methods and solutions for effective and efficient analytics of multimodal information present in the Web or social media. We are interested in approaches, tasks, and metrics for effectively analysing multimedia information such as image-text pairs and videos to design methodologies that jointly consider information from multiple modalities. The interdisciplinary nature of processing such multimodal data involves combining ideas and methods from the fields mentioned above. We envision the workshop as a forum for researchers and practitioners from academia and industry for original contributions and practical application on multimodal information processing, mining, retrieval, search, and management.

The workshop features advanced methods for combining visual and textual content for problems such as fake news detection, predicting reliability and popularity of news articles, generating image narrative with emotion, and injecting knowledge graph information to improve visual question answering performance.

We would like to take this opportunity to sincerely thank the authors and presenters for their inspiring contributions to the workshop. Our sincere thanks are due to the program committee members for reviewing the submissions and ensuring the high quality of our workshop program. We also thank Ichiro Ide for his keynote talk, Chiao-I Tseng and Christian Otto for their invited talks in the workshop.

We are also very grateful to the organisers of The Web Conference 2022, and particularly the Workshops Chairs, Nathalie Hernandez and Preslav Nakov, for their support with the workshop organisation.

Visual Persuasion in COVID-19 Social Media Content: A Multi-Modal Characterization

Social media content routinely incorporates multi-modal design to covey information and shape meanings, and sway interpretations toward desirable implications, but the choices and impacts of using both texts and visual images have not been sufficiently studied. This work proposes a computational approach to analyze the impacts of persuasive multi-modal content on popularity and reliability, in COVID-19-related news articles shared on Twitter. The two aspects are intertwined in the spread of misinformation: for example, an unreliable article that aims to misinform has to attain some popularity. This work has several contributions. First, we propose a multi-modal (image and text) approach to effectively identify popularity and reliability of information sources simultaneously. Second, we identify textual and visual elements that are predictive to information popularity and reliability. Third, by modeling cross-modal relations and similarity, we are able to uncover how unreliable articles construct multi-modal meaning in a distorted, biased fashion. Our work demonstrates how to use multi-modal analysis for understanding influential content and has implications to social media literacy and engagement.

Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection

Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single modality text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks. In this work, we empirically study how and whether such methods, applied in a bi-modal setting, can improve an existing VQA system’s performance on the KBVQA task. We experiment with two large publicly available VQA datasets, (1) KVQA which contains mostly rare Wikipedia entities and (2) OKVQA which is less entity-centric and more aligned with common sense reasoning. Both lack explicit entity spans, and we study the effect of different weakly supervised and manual methods for obtaining them. Additionally, we analyze how recently proposed bi-modal and single modal attention explanations are affected by the incorporation of such entity enhanced representations. Our results show substantially improved performance on the KBVQA task without the need for additional costly pre-training, and we provide insights for when entity knowledge injection helps improve a model’s understanding. We provide code and enhanced datasets for reproducibility1.

ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer

Image narrative generation is a task to create a story from an image with a subjective viewpoint. Given the importance of the subjective feelings of writers, readers, and characters in storytelling, an image narrative generation method should consider human emotion. In this study, we propose a novel method of image narrative generation called ViNTER (Visual Narrative Transformer with Emotion arc Representation), which takes “emotion arc” as input to capture a sequence of emotional changes. Since emotion arcs represent the trajectory of emotional change, it is expected that we can include detailed information about the emotional changes in the story to the model. We present experimental results of both automatic and manual evaluations on the Image Narrative dataset and demonstrate the effectiveness of the proposed approach.

Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection

Recent years have witnessed a massive growth in the proliferation of fake news online. User-generated content is a blend of text and visual information leading to producing different variants of fake news. As a result, researchers started targeting multimodal methods for fake news detection. Existing methods capture high-level information from different modalities and jointly model them to decide. Given multiple input modalities, we hypothesize that not all modalities may be equally responsible for decision-making. Hence, this paper presents a novel architecture that effectively identifies and suppresses information from weaker modalities and extracts relevant information from the strong modality on a per-sample basis. We also establish intra-modality relationship by extracting fine-grained image and text features. We conduct extensive experiments on real-world datasets to show that our approach outperforms the state-of-the-art by an average of 3.05% and 4.525% on accuracy and F1-score, respectively. We also release the code, implementation details, and model checkpoints for the community’s interest.1

SESSION: Workshop: Sci-K – 2nd International Workshop on Scientific Knowledge Representation, Discovery, and Assessment

Sci-K 2022 - International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment

In this paper we present the 2nd edition of the Scientific Knowledge: Representation, Discovery, and Assessment (Sci-K 2022) workshop. Sci-K aims to explore innovative solutions and ideas for the generation of approaches, data models, and infrastructures (e.g., knowledge graphs) for supporting, directing, monitoring and assessing the scientific knowledge and progress. This edition is also a reflection point as the community is seeking alternative solutions to the now-defunct Microsoft Academic Graph (MAG).

The Semantic Scholar Academic Graph (S2AG)

The Semantic Scholar Academic Graph, or S2AG (pronounced ”stag”), is a large, open, heterogeneous knowledge graph of scholarly works, authors, and citations that powers the Semantic Scholar discovery service. S2AG currently contains over 205M publications, 121M authors, and nearly 2.5B citation edges. Semantic Scholar integrates metadata from Crossref, PubMed, Unpaywall, and other sources. In addition, through partnerships with academic publishers and through web-crawling, we source and process the full-text of nearly 60M full-text publications in order extract and classify the document structure, including references, citation contexts, figures, tables, and more. S2AG is available via an open API as well as via downloadable monthly snapshots. In this talk, we will describe the S2AG resource as well as the Semantic Scholar Open Research Corpus, or S2ORC (pronounced ”stork”), a general purpose, multi-domain corpus for NLP and text mining research.

Data Models for Annotating Biomedical Scholarly Publications: the Case of CORD-19

Semantic text annotations have been a key factor for supporting computer applications ranging from knowledge graph construction to biomedical question answering. In this systematic review, we provide an analysis of the data models that have been applied to semantic annotation projects for the scholarly publications available in the CORD-19 dataset, an open database of the full texts of scholarly publications about COVID-19. Based on Google Scholar and the screening of specific research venues, we retrieve seventeen publications on the topic mostly from the United States of America. Subsequently, we outline and explain the inline semantic annotation models currently applied on the full texts of biomedical scholarly publications. Then, we discuss the data models currently used with reference to semantic annotation projects on the CORD-19 dataset to provide interesting directions for the development of semantic annotation models and projects.

Sequence-Based Extractive Summarisation for Scientific Articles

This paper presents the results of research on supervised extractive text summarisation for scientific articles. We show that a simple sequential tagging model based only on the text within a document achieves high results against a simple classification model. Improvements can be achieved through additional sentence-level features, though these were minimal. Through further analysis, we show the potential of the sequential model relying on the structure of the document depending on the academic discipline which the document is from.

Assessing Network Representations for Identifying Interdisciplinarity

Many studies have sought to identify interdisciplinary research as a function of the diversity of disciplines identified in an article’s references or citations. However, given the constant evolution of the scientific landscape, disciplinary boundaries are shifting and blurring, making it increasingly difficult to describe research within a strict taxonomy. In this work, we explore the potential for graph learning methods to learn embedded representations for research papers that encode their ‘interdisciplinarity’ in a citation network. This facilitates the identification of interdisciplinary research without the use of disciplinary categories. We evaluate these representations and their ability to identify interdisciplinary research, according to their utility in interdisciplinary citation prediction. We find that those representations which preserve structural equivalence in the citation graph are best able to predict distant, interdisciplinary interactions in the network, according to multiple definitions of citation distance.

Personal Research Knowledge Graphs

Maintaining research-related information in an organized manner can be challenging for a researcher. In this paper, we envision personal research knowledge graphs (PRKGs) as a means to represent structured information about the research activities of a researcher. PRKGs can be used to power intelligent personal assistants, and personalize various applications. We explore what entities and relations could be potentially included in a PRKG, how to extract them from various sources, and how to share a PRKG within a research group.

Quantifying the Topic Disparity of Scientific Articles

Citation count is a popular index for assessing scientific papers. However, it depends on not only the quality of a paper but also various factors, such as conventionality, journal, team size, career age, and gender. Here, we examine the extent to which the conventionality of a paper is related to its citation count by using our measure, topic disparity. The topic disparity is the cosine distance between a paper and its discipline on a neural embedding space. Using this measure, we show that the topic disparity is negatively associated with citation count, even after controlling journal impact, team size, and the career age and gender of the first and last authors. This result indicates that less conventional research tends to receive fewer citations than conventional research. The topic disparity can be used to complement citation count and to recommend papers at the periphery of a discipline because of their less conventional topics.

Beyond Reproduction, Experiments want to be Understood

The content of experiments must be semantically described. This topic has already been largely covered. However, some neglected benefits of such an approach provide more arguments in favour of scientific knowledge graphs. Beyond being searchable through flat metadata, a knowledge graph of experiment descriptions may be able to provide answers to scientific and methodological questions. This includes identifying non experimented conditions or retrieving specific techniques used in experiments. In turn, this is useful for researchers as this information can be used for repurposing experiments, checking claimed results or performing meta-analyses.

GraphCite: Citation Intent Classification in Scientific Publications via Graph Embeddings

Citations are crucial in scientific works as they help position a new publication. Each citation carries a particular intent, for example, to highlight the importance of a problem or to compare against results provided by another method. The authors’ intent when making a new citation has been studied to understand the evolution of a field over time or to make recommendations for further citations. In this work, we address the task of citation intent prediction from a new perspective. In addition to textual clues present in the citation phrase, we also consider the citation graph, leveraging high-level information of citation patterns. In this novel setting, we perform a thorough experimental evaluation of graph-based models for intent prediction. We show that our model, GraphCite, improves significantly upon models that take into consideration only the citation phrase. Our code is available online1.

A Study of Computational Reproducibility using URLs Linking to Open Access Datasets and Software

Datasets and software packages are considered important resources that can be used for replicating computational experiments. With the advocacy of Open Science and the growing interest of investigating reproducibility of scientific claims, including URLs linking to publicly available datasets and software packages has become an institutionalized part of research publications. In this preliminary study, we investigated the disciplinary dependency and chronological trends of including open access datasets and software (OADS) in electronic theses and dissertations (ETDs), based on a hybrid classifier called OADSClassifier, consisting of a heuristic and a supervised learning model. The classifier achieves the best F1 of 0.92. We found that the inclusion of OADS-URLs exhibited a strong disciplinary dependence and the fraction of ETDs containing OADS-URLs has been gradually increasing over the past 20 years. We developed and share a ground truth corpus consisting of 500 manually labeled sentences containing URLs from scientific papers. The dataset and source code are available at https://github.com/lamps-lab/oadsclassifier.

Semi-automated Literature Review for Scientific Assessment of Socioeconomic Climate Change Scenarios

Climate change is now recognized as a global threat, and the literature surrounding it continues to increase exponentially. Expert bodies such as the Intergovernmental Panel on Climate Change (IPCC) are tasked with periodically assessing the literature to extract policy-relevant scientific conclusions that might guide policymakers. However, concerns have been raised that climate change research may be too voluminous for traditional literature review to adequately cover. It has been suggested that practices for literature review for scientific assessment be updated/augmented with semi-automated approaches from bibliometrics or scientometrics. In this study, we explored the feasibility of such recommendations for the scientific assessment of literature around socioeconomic climate change scenarios, so-called Shared Socioeconomic Pathways (SSPs). For automated literature reviews, most methods can be subsumed under two broad categories of classification tasks that use either (1) Natural Language Processing (NLP) or (2) Citation Networks. We performed two levels of classification tasks: (1) identifying SSP articles from a large corpus of climate change research and developing a database of SSP-related articles; (2) classifying SSP articles into different sectoral categories. We applied three machine learning algorithms for the text classification task: Multinomial Naïve Bayes, Logistic Regression, and Linear Support Vector Classification. However, the vocabulary of the SSP literature too closely resembles the vocabulary of broader climate change research for an NLP approach to be effective. We then attempted a citation network approach. We compared two sets of different community detection algorithms (the Louvain algorithm and the Fluid community detection algorithm), with one iteration of each algorithm containing 8 clusters and the next set containing 16. The citation network approach outperformed NLP with respect to false negatives. It also provided the ability to assess the uptake of SSPs across different sectors of climate change research. We concluded that, at the time of the study, the SSP corpus may not yet be large enough or diverse enough from broader climate change research for applying machine learning techniques for automated literature review. However, our research suggests that until there is a critical mass of SSP studies, there is the potential to divide labor between human and machine readers. Some of the data collection tasks currently done by human author teams, such as assessing scenario research, could be semi-automated to ensure and enhance the coverage of the literature. We also drew conclusions about the uptake of the SSP framework over its first 5 years in the broader climate change research literature. We observed that the uptake of SSPs in certain sub-disciplines (e.g., food systems) progressed slowly. Hence, to keep SSPs relevant, it may be fruitful to target SSP studies to particular research communities (e.g., sectors with slower uptake).

SciNoBo: A Hierarchical Multi-Label Classifier of Scientific Publications

Classifying scientific publications according to Field-of-Science (FoS) taxonomies is of crucial importance, allowing funders, publishers, scholars, companies and other stakeholders to organize scientific literature more effectively. Most existing works address classification either at venue level or solely based on the textual content of a research publication. We present SciNoBo, a novel classification system of publications to predefined FoS taxonomies, leveraging the structural properties of a publication and its citations and references organized in a multilayer network. In contrast to other works, our system supports assignments of publications to multiple fields by considering their multidisciplinarity potential. By unifying publications and venues under a common multilayer network structure made up of citing and publishing relationships, classifications at the venue-level can be augmented with publication-level classifications. We evaluate SciNoBo on a dataset of publications extracted from Microsoft Academic Graph, and we perform a comparative analysis against a state-of-the-art neural-network baseline. The results reveal that our proposed system is capable of producing high-quality classifications of publications.

Examining the ORKG towards Representation of Control Theoretic Knowledge – Preliminary Experiences and Conclusions

Control theory is an interdisciplinary academic domain which contains sophisticated elements from various sub-domains of both mathematics and engineering. The issue of knowledge transfer thus poses a considerable challenge w.r.t. transfer between researchers focusing on different niches as well as w.r.t. transfer into potential application domains. The paper investigates the Open Research Knowledge Graph (ORKG) as medium to facilitate such knowledge transfer. In particular it investigates the current state of control theoretic knowledge represented in the ORKG and describes the process of extending that knowledge as well as the observed challenges thereby. The main results are a) a list of best practice suggestions for the ORKG contributions and b) a list of improvement suggestions for the further development of the ORKG and similar platforms. All relevant claims w.r.t. the ORKG are backed by SPARQL queries and some further evaluation code, both publicly available for the sake of reproducibility.

SESSION: Workshop: SeBiLAn – International Workshop on Semantics-enabled Biomedical Literature Analytics

The International Workshop on Semantics-enabled Biomedical Literature Analytics (SeBiLAn)

Powering Semantic Analysis with Bio-ontologies

In the past three decades, bio-ontologies have moved from esoteric artifacts to key resources for the semantic analysis of biomedical text. In this presentation, we will follow the evolution of the creation and integration of bio-ontologies, as well as their role in biomedical applications, including literature analysis.

Bio-ontologies include a variety of resources that provide a source of names for biomedical entities and specify relations among these entities. These resources are generally developed independently by individuals, collectives, institutions, and standard development organizations. Examples of bio-ontologies include the Medical Subject Headings (MeSH), developed by the National Library of Medicine to support the indexing and retrieval of the biomedical literature, the Gene Ontology (GO), developed by the GO Consortium to support consistent annotation and analysis of gene products across organisms, and SNOMED CT, developed by SNOMED International to support clinical documentation and analytics worldwide. While these three examples illustrate resources with a large scope, many other bio-ontologies focus on a specialized subdomain of medicine. Bio-ontologies use different formalisms and various degrees of formality for their representation.

Bio-ontologies play an important role in the semantic analysis of biomedical datasets, including the biomedical literature. Bio-ontologies provide a source of vocabulary for biomedical entities, used for named entity recognition (i.e., finding mentions of biomedical entities in text) and entity resolution (i.e., mapping mentions a specific reference). Bio-ontologies also provide semantic categorization for biomedical entities, which is leveraged for word-sense disambiguation and co-reference resolution (especially when a specific entity is referred to with a broader category). Finally, bio-ontologies provide a source of relations among entities, which can form the basis for relation extraction, hypothesis generation and, more generally, literature-based discovery.

Since most bio-ontologies are developed independently, but often need to be used together, ontology alignment techniques have been developed to identify correspondences among entities across ontologies. Repositories of bio-ontologies, such as the Unified Medical Language System (UMLS) Metathesaurus, the National Center for Biomedical Ontology (NCBO) BioPortal and the Open Biological and Biomedical Ontology (OBO) Foundry are useful sources of ontologies and contribute to the development of tools to support semantic analysis of biomedical text.

Why Bother Enabling Biomedical Literature Analysis with Semantics?

These days, ELMo [3], BERT [1], BART [2] and other similarly cutely-named models appear to have dramatically advanced the state of the art in basically every problem in natural language processing and information retrieval. It can leave a researcher wondering whether there is more to language processing than deploying or fine-tuning contextual word embeddings. What of formal semantics and knowledge representation? What value do these bring to text analysis, either in modelling or in task definitions? In this talk, I will try to explore these questions, from the perspective of my long-running experiences in biomedical information extraction and literature exploration. Perhaps we can shift the academic conversation from a one-model-fits-all solution for individual tasks to a more nuanced consideration of complex, multi-faceted problems in which such models certainly can play a critical role but aren’t necessarily “all you need” [4].

Exploring Representations for Singular and Multi-Concept Relations for Biomedical Named Entity Normalization

Since the rise of the COVID-19 pandemic, peer-reviewed biomedical repositories have experienced a surge in chemical and disease related queries. These queries have a wide variety of naming conventions and nomenclatures from trademark and generic, to chemical composition mentions. Normalizing or disambiguating these mentions within texts provides researchers and data-curators with more relevant articles returned by their search query. Named entity normalization aims to automate this disambiguation process by linking entity mentions onto their appropriate candidate concepts within a biomedical knowledge base or ontology. We explore several term embedding aggregation techniques in addition to how the term’s context affects evaluation performance. We also evaluate our embedding approaches for normalizing term instances containing one or many relations within unstructured texts.

Graph Convolutional Networks for Chemical Relation Extraction

Extracting information regarding novel chemicals and chemical reactions from chemical patents plays a vital role in the chemical and pharmaceutical industry. Due to the increasing volume of chemical patents, there is an urgent need for automated solutions to extract relations between chemical compounds. Several studies have used models that apply attention mechanisms such as Bidirectional Encoder Representations from Transformers (BERT) to capture the contextual information within a text. However, these models do not capture the global information about a specific vocabulary. On the other hand, Graph Convolutional Networks (GCNs) capture global dependencies between terms within a corpus but not the local contextual information. In this work, we propose two novel approaches, GCN-Vanilla and GCN-BERT, for chemical relation extraction. GCN-Vanilla approach builds a single graph for the whole corpus based on word co-occurrence and sentence-word relations. Then, we model the graph with GCN to capture the global information and classify the sentence nodes. GCN-BERT approach combines GCN and BERT to capture both global and local information, and build together a final representation for relation extraction. We evaluate our approaches on the CLEF-2020 dataset. Our results show the combined GCN-BERT approach outperforms standalone BERT and GCN models, and achieves a higher F1 than that reported in our previous studies.

Biomedical Word Sense Disambiguation with Contextualized Representation Learning

Representation learning is an important component in solving most Natural Language Processing (NLP) problems, including Word Sense Disambiguation (WSD). The WSD task tries to find the best meaning in a knowledge base for a word with multiple meanings (ambiguous word). WSD methods choose this best meaning based on the context, i.e., the words around the ambiguous word in the input text document. Thus, word representations may improve the effectiveness of the disambiguation models if they carry useful information from the context and the knowledge base. Most of the current representation learning approaches are that they are mostly trained on the general English text and are not domain specified. In this paper, we present a novel contextual-knowledge base aware sense representation method in the biomedical domain. The novelty in our representation is the integration of the knowledge base and the context. This representation lies in a space comparable to that of contextualized word vectors, thus allowing a word occurrence to be easily linked to its meaning by applying a simple nearest neighbor approach. Comparing our approach with state-of-the-art methods shows the effectiveness of our method in terms of text coherence.

SESSION: Workshop: SocialNLP – 10th International Workshop on Natural Language Processing for Social Media

SocialNLP’22: 10th International Workshop on Natural Language Processing for Social Media

SocialNLP is a new inter-disciplinary area of natural language processing (NLP) and social computing. We consider three plausible directions of SocialNLP: (1) addressing issues in social computing using NLP techniques; (2) solving NLP problems using information from social networks or social media; and (3) handling new problems related to both social computing and natural language processing. The 10th SocialNLP workshop is held at TheWebConf 2022 and NAACL 2022. For SocialNLP @ TheWebConf 2022, eventually we accepted six papers, including five research papers and one data paper. Each submission is reviewed by at least two reviewers. The acceptance ratio is 60%. We will also have a special event to celebrate the success of SocialNLP over the past ten years. Last, we sincerely thank to all authors, program committee members, and workshop chairs, for their great contributions and help in this edition of SocialNLP workshop.

Multi-Context Based Neural Approach for COVID-19 Fake-News Detection

When the world is facing the COVID-19 pandemic, society is also fighting another battle to tackle misinformation. Due to the widespread effect of COVID 19 and increased usage of social media, fake news and rumors about COVID-19 are being spread rapidly. Identifying such misinformation is a challenging and active research problem. The lack of suitable datasets and external world knowledge contribute to the challenges associated with this task. In this paper, we propose MiCNA, a multi-context neural architecture to mitigate the problem of COVID-19 fake news detection. In the proposed model, we leverage the rich information of the three different pre-trained transformer-based models, i.e., BERT, BERTweet and COVID-Twitter-BERT to three different aspects of information (viz. general English language semantics, Tweet semantics, and information related to tweets on COVID 19) which together gives us a single multi-context representation. Our experiments provide evidence that the proposed model outperforms the existing baseline and the candidate models (i.e., three transformer architectures) and becomes a state-of-the-art model on the task of COVID-19 fake-news detection. We achieve new state-of-the-art performance on a benchmark COVID-19 fake-news dataset with 98.78% accuracy on the validation dataset and 98.69% accuracy on the test dataset.

Measuring the Privacy Dimension of Free Content Websites through Automated Privacy Policy Analysis and Annotation

Websites that provide books, music, movies, and other media free of charge are a central piece of the web ecosystem, although they are vastly unexplored, especially for their security and privacy risks. In this paper, we contribute to the understanding of those websites by focusing on the comparative analysis of their privacy policies, a primary channel where service providers inform users about their data collection and use. To better understand the data usage risks associated with such websites, we study 1,562 websites and their privacy policies in contrast to premium websites. We uncover that premium websites are more transparent in reporting their privacy practices, particularly in categories such as “Data Retention” and “Do Not Track”, with premium websites are 85.00% and ≈ 70% more likely to report their practices in comparison to the free content websites. We found the free content websites’ privacy policies to be more similar to one another and generic in comparison to the premium websites’ privacy policies. Our findings raise several concerns, including that the reported privacy policies may not reflect the data collection practices used by service providers, and various pronounced biases across privacy policy categories. This calls for further investigation of the risks associated with the usage of such free content websites and services through active measurements.

KAHAN: Knowledge-Aware Hierarchical Attention Network for Fake News detection on Social Media

In recent years, fake news detection has attracted a great deal of attention due to the myriad amounts of misinformation. Some previous methods have focused on modeling the news content, while others have combined user comments and user information on social media. However, existing methods ignore some important clues for detecting fake news, such as temporal information on social media and external knowledge related to the news. To this end, we propose a Knowledge-Aware Hierarchical Attention Network (KAHAN) that integrates this information into the model to establish fact-based associations with entities in the news content. Specifically, we introduce two hierarchical attention networks to model news content and user comments respectively, in which news content and user comments are represented by different aspects for modeling various degrees of semantic granularity. Besides, to process the random occurrences of user comments at post-level, we further designed a time-based subevent division algorithm to aggregate user comments at subevent-level to learn temporal patterns. Moreover, News towards Entities (N-E) attention and Comments towards Entities (C-E) attention are introduced to measure the importance of external knowledge. Finally, we detected the veracity of the news by combining the three aspects of news: content, user comments, and external knowledge. We conducted extensive experiments and ablation studies on two real-world datasets and showed that our proposed method outperformed the previous methods and empirically validated each component of KAHAN1.

Hoaxes and Hidden agendas: A Twitter Conspiracy Theory Dataset: Data Paper

Hoaxes and hidden agendas make for compelling conspiracy theories. While many of these theories are ultimately innocuous, others have the potential to do real harm, instigating real-world support or disapproval of the theories. This is further fueled by social media which provides a platform for conspiracy theories to spread at unprecedented rates. Thus, there is a need for the development of automated models to detect conspiracy theories from the social media space in order to quickly and effectively identify the topics of the season and the prevailing stance. To support this development, we create ground truth data through human annotation. In this work, we collect and manually annotate a dataset from Twitter, comprising of four conspiracy theories. Each Tweet is annotated with one of the four topics {climate change, COVID-19 origin, COVID-19 vaccine, Epstein-Maxwell trial}, and its stance towards the conspiracy theory {support, neutral, against}. We perform experiments on this multi-topic dataset to demonstrate its usage in conspiracy-detection, stance-detection and topic-detection.

Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency

In this study, we analyze the influence of the English language proficiency of non-native speakers on the readability of the text written by them. In addition, we present multiple approaches for automatically determining the language proficiency levels of non-native English speakers from the review data. To accomplish the above-mentioned tasks, we first introduce an annotated social media corpus of around 1000 reviews written by non-native English speakers of the following five English language proficiency (ELP) groups: very high proficiency (VHP), high proficiency (HP), moderate proficiency (MP), low proficiency (LP), and very low proficiency (VLP). We employ the Flesch Reading Ease (FLE) and Flesch-Kincaid Grade (FKG) tests to compute the readability scores of the reviews written by various ELP groups. We leverage both the classical machine learning (ML) classifiers and transformer-based language models for deciding the language proficiency groups of the reviewers. We observe that distinct ELP groups do not exhibit any noticeable differences in the mean FRE scores, although slight differences are observed in the FKG test. The results imply that the readability measures do not possess high discriminating capabilities to distinguish various ELP groups. In the language proficiency determination task, we notice fine-tuned transformer-based approaches yield slightly better efficacy than the traditional ML classifiers.

Making Adversarially-Trained Language Models Forget with Model Retraining: A Case Study on Hate Speech Detection

Adversarial training has become almost the de facto standard for robustifying Natural Language Processing models against adversarial attacks. Although adversarial training has proven to achieve accuracy gains and boost the performance of algorithms, research has not shown how adversarial training will stand “the test of times” when models are deployed and updated with new non-adversarial data samples. In this study, we aim to quantify the temporal impact of adversarial training on naturally-evolving language models using the hate speech task. We conduct extensive experiments on the Tweet Eval benchmark dataset using multiple hate speech classification models. In particular, our findings indicate that adversarial training is highly task-dependent as well as dataset dependent as models trained on the same dataset achieve high prediction accuracy but fare poorly when tested with new dataset even after retraining models with adversarial examples. We attribute this temporal and limited effect of adversarial training to distribution shift of the training data which implies that models’ quality will degrade over-time as models are deployed in the real world and start serving new data.

SESSION: Workshop: TempWeb – 12th International Workshop on Temporal Web Analytics

12th Temporal Web Analytics Workshop (TempWeb) Overview

TempWeb focuses on investigating infrastructures, scalable methods, and innovative software for aggregating, querying, and analyzing heterogeneous data at Web scale. Emphasis is given to data analysis along the time dimension for web data that has been collected over extended time periods. A major challenge in this regard is the sheer size of the data it exposes and the ability to make sense of it in a useful and meaningful manner for its users. It is worth noting that this trend of using big data to make inferences is not specific to Web content analytics, so work presented here might be useful in other areas, too. As such, longitudinal aspects in Web content analysis become relevant for analysts from from various domains, including, but not limited to sociology, marketing, environmental studies, politics, etc. Studies in this context range from “low-level” structural network log analysis over time, up to “high-level” entity-level Web content analytics and terminology evolution. While both before mentioned aspects represent the extremes of the spectrum, they have one thing in common: Web scale data analytics needs to develop infrastructures and extended analytical tools in order to make use of that data. TempWeb has been created for this purpose and is the ideal venue to discuss about all its facets.

Temporal Question Answering in News Article Collections

The fields of automatic question answering and reading comprehension have been rapidly advancing recently [6]. Open-domain question answering, in particular, assumes answering arbitrary user questions from a large document collection. The existing QA approaches are however generally working on synchronic document collections such as Wikipedia, Web data or short-term news corpora. This talk is about our latest efforts in automatic question answering over temporal news collections which can contain millions of news articles published during several decades’ long time frames. Temporal aspects of both news articles and user questions form an additional challenge for this kind of a temporal question answering task. To correctly answer questions over such collections one usually needs first to find candidate documents that are likely to contain these answers. We will first discuss a re-ranking approach [3] for news articles which works by utilizing temporal information embedded in questions and in the underlying document collection, thus combining methods from Temporal Information Retrieval [1, 2] and Natural Language Processing. Next, we will discuss a dedicated solution for answering ”When” type questions which require finding occurrence dates of events described in input questions based on the underlying news archive [5]. Finally, we will introduce ArchivalQA [4] - a large-scale question answering dataset which has been automatically created from a two decades’ long news article collection, and which contains over 500k question-answer pairs. The dataset has been processed to remove temporally ambiguous questions for which more than one correct answer exist.

Semantic Modelling of Document Focus-Time for Temporal Information Retrieval

An accurate understanding of the temporal dynamics of Web content and user behaviors plays a crucial role during the interactive process between search engine and users. In this work, we focus on how to improve the retrieval performance via a better understanding of the time factor. On the one hand, we proposed a novel method to estimate the focus-time of documents leveraging their semantic information. On the other hand, we introduced query trend time for understanding the temporal intent underlying a search query based on Google Trend. Furthermore, we applied the proposed methods to two search scenarios: temporal information retrieval and temporal diversity retrieval. Our experimental results based on NTCIR Temporalia test collections show that: (1) Semantic information can be used to predict the temporal tendency of documents. (2) The semantic-based model works effectively even when few temporal expressions and entity names are available in documents. (3) The effectiveness of the estimated focus-time was comparable to that of the article’s publication time in relevance modelling, and thus, our method can be used as an alternative or supplementary tool when reliable publication dates are not available. (4) The trend time can improve the representation of temporal intents behind queries over query issue time.

Analytical Models for Motifs in Temporal Networks

Dynamic evolving networks capture temporal relations in domains such as social networks, communication networks, and financial transaction networks. In such networks, temporal motifs, which are repeated sequences of time-stamped edges/transactions, offer valuable information about the networks’ evolution and function. However, calculating temporal motif frequencies is computationally expensive as it requires: First, identifying all instances of the static motifs in the static graph induced by the temporal graph. And second, counting the number of subsequences of temporal edges that correspond to a temporal motif and occur within a time window. Since the number of temporal motifs changes over time, finding interesting temporal patterns involves iterative application of the above process over many consecutive time windows. This makes it impractical to scale to large real temporal networks. Here, we develop a fast and accurate model-based method for counting motifs in temporal networks. We first develop the Temporal Activity State Block Model (TASBM), to model temporal motifs in temporal graphs. Then we derive closed-form analytical expressions that allow us to quickly calculate expected motif frequencies and their variances in a given temporal network. Finally, we develop an efficient model fitting method, so that for a given network, we quickly fit the TASMB model and compute motif frequencies. We apply our approach to two real-world networks: a network of financial transactions and an email network. Experiments show that our TASMB framework (1) accurately counts temporal motifs in temporal networks; (2) easily scales to networks with tens of millions of edges/transactions; (3) is about 50x faster than explicit motif counting methods on networks of about 5 million temporal edges, a factor which increases with network size.

Multi-touch Attribution for Complex B2B Customer Journeys using Temporal Convolutional Networks

Customer journeys in Business-to-Business (B2B) transactions contain long and complex sequences of interactions between different stakeholders from the buyer and seller companies. On the seller side, there is significant interest in the multi-touch attribution (MTA) problem, which aims to identify the most influential stage transitions (in the B2B customer funnel), channels, and touchpoints. We design a novel deep learning-based framework, which solves these attribution problems by modeling the conversion of journeys as functions of stage transitions that occur in them. Each stage transition is modeled as a Temporal Convolutional Network (TCN) on the touchpoints that precede it. Further, a global conversion model Stage-TCN is built by combining these individual stage transition models in a non-linear fashion. We apply Layer-wise Relevance Propagation (LRP) based techniques to compute the relevance of all nodes and inputs in our network and use these to compute the required attribution scores. We run extensive experiments on two real-world B2B datasets and demonstrate superior accuracy of the conversion model compared to prior works. We validate the attribution scores using perturbation-based techniques that measure the change in model output when parts of the input having high attribution scores are deleted.

Diachronic Analysis of Time References in News Articles

Time expressions embedded in text are important for many downstream tasks in NLP and IR. They have been, for example, utilized for timeline summarization, named entity recognition, temporal information retrieval, question answering and others. In this paper, we introduce a novel analytical approach to analyzing characteristics of time expressions in diachronic text collections. Based on a collection of news articles published over a 33-years’ long time span, we investigate several aspects of time expressions with a focus on their interplay with publication dates of containing documents. We utilize a graph-based representation of temporal expressions to represent them through their co-occurring named entities. The proposed approach results in several observations that could be utilized in automatic systems that rely on processing temporal signals embedded in text. It could be also of importance for professionals (e.g., historians) who wish to understand fluctuations in collective memories and collective expectations based on large-scale, diachronic document collections.

Detection of Infectious Disease Outbreaks in Search Engine Time Series Using Non-Specific Syndromic Surveillance with Effect-Size Filtering

Novel infectious disease outbreaks, including most recently that of the COVID-19 pandemic, could be detected by non-specific syndromic surveillance systems. Such systems, utilizing a variety of data sources ranging from Electronic Health Records to internet data such as aggregated search engine queries, create alerts when unusually high rates of symptom reports occur. This is especially important for the detection of novel diseases, where their manifested symptoms are unknown.

Here we improve upon a set of previously-proposed non-specific syndromic surveillance methods by taking into account both how unusual a preponderance of symptoms is and their effect size.

We demonstrate that our method is as accurate as previously-proposed methods for low dimensional data and show its effectiveness for high-dimensional aggregated data by applying it to aggregated time-series health-related search engine queries. We find that in 2019 the method would have raised alerts related to several disease outbreaks earlier than health authorities did. During the COVID-19 pandemic the system identified the beginning of pandemic waves quickly, through combinations of symptoms which varied from wave to wave.

Thus, the proposed method could be used as a practical tool for decision makers to detect new disease outbreaks using time series derived from search engine data even in the absence of specific information on the diseases of interest and their symptoms.

A Bi-level Assessment of Twitter Data for Election Prediction: Delhi Assembly Elections 2020

Elections are the backbone of any democratic country, where voters elect the candidates as their representatives. The emergence of social networking sites has provided a platform for political parties and their candidates to connect with voters in order to spread their political ideas. Our study aims to use Twitter in assessing the outcome of the Delhi Assembly elections held in 2020, using a bi-level approach, i.e., concerning political parties and their candidates. We analyze the correlation of election results with the activities of different candidates and parties on Twitter, and the response of voters on them, especially the mentions and sentiment of voters towards a party over time. The Twitter profiles of the candidates are compared both at the party level as well as the candidate level to evaluate their association with the outcome of the election. We observe that the number of followers and the replies to candidates’ tweets are good indicators for predicting actual election outcomes. However, we observe that the number of tweets mentioning a party and the temporal analysis of voters’ sentiment towards the party shown in tweets are not aligned with the election result. Moreover, the variations in the activeness of candidates and political parties on Twitter with time also not very helpful in identifying the winner. Thus, merely using temporal data from Twitter is not sufficient to make accurate predictions, especially for countries like India.

SESSION: Workshop: DLWoT – The 2nd International Workshop on Deep Learning for the Web of Things

DLWoT’22: 2nd International Workshop on Deep Learning for the Web of Things

In recent years, deep learning and web of things (WoT) have become hot topics. The relevant research issues in deep learning have been in increasingly investigated and published. Therefore, the title of this workshop is ”the 2nd International Workshop on Deep Learning for the Web of Things (DLWoT’22)” for the Web Conference 2022 (WWW’22). This workshop solicits papers on various disciplines, which include but are not limited to: (1) deep learning for massive IoT, (2) deep learning for critical IoT, (3) deep learning for enhancing IoT security, (4) deep learning for enhancing IoT privacy, (5) preprocessing of IoT data for AI modeling, and (6) deep learning for IoT applications (e.g., smart home, smart agriculture, interactive art, and so on). DLWoT’22 includes two tracks: (1) keynote speaker and (2) workshop papers. One keynote speaker is invited to give a talk, and 8 accepted workshop papers are presented.

Adaptively Offloading the Software for Mobile Edge Computing

In mobile edge computing (MEC), computation offloading is a promising way to support those resource-constrained mobile devices, since it moves some time-consuming computation activities to nearby edge servers. Owing to the geographical distribution of edge servers and mobility of mobile devices, the runtime environment of MEC is highly complex and dynamic, so it is challenging to efficiently support computation offloading in MEC. In the face of a highly complex and dynamic runtime environment, an “adaptive” model has become the inevitable demand for computation offloading in MEC, in which software perceives the environment, changes behavior and improves performance. However, it involves two main challenges:

(1) Adaptability: The software often faces changes of runtime environment in MEC, so that the adaptation on offloading is needed. However, due to the inherent architecture of software, it is hard to utilize the dispersed computing resources and change the offloading scheme.

(2) Effectiveness: When the environment changes, it needs to calculate which parts are worth offloading, and the reduced execution time must be greater than the network delay caused by computation offloading. In addition, the decision-making of offloading is expected to be made at runtime.

This report introduces our research on adaptively offloading the software for MEC. First, the code of software can be automatically refactored to implement a special program structure supporting dynamic offloading in MEC [2, 3]. Second, the execution costs of offloading schemes can be accurately estimated without runtime execution [1, 3]. Third, new offloading strategies are proposed to reduce decision time to an acceptable level [4].

Word Embedding based Heterogeneous Entity Matching on Web of Things

Web of Things (WoT) is capable of promoting the knowledge discovery and address interoperability problems of diverse Internet of Things (IoT) applications. However, due to the dynamic and diverse features of data entities on WoT, the heterogeneous entity matching has become arguably the greatest “new frontier” for WoT advancements. Currently, the data entities and the corresponding knowledge on WoT are generally modelled with the ontology, and therefore, matching heterogeneous data entities on WoT can be converted to the problem of matching ontologies. Ontology matching is a complex cognitive process, it is usually initially done manually by domain experts. To effectively distinguish the heterogeneous entities and determine high-quality ontology alignment, this work proposes a word embedding based matching technique. Our approach models the word’s semantic in the vector space, and use two vectors’ cosine angle to measure the corresponding words’ similarity. In addition, the word embedding approach does not depend on a specific knowledge base and retain the rich semantic information of words, which makes our proposal more robust. The experiment uses Ontology Alignment Evaluation Initiative (OAEI)’s benchmark for testing, and the experimental results show that our approach outperforms other advanced matching methods.

A Spatio-Temporal Data-Driven Automatic Control Method for Smart Home Services

With the rapid development of smart home technologies, various smart devices have entered and brought convenience to people’s daily life. Meanwhile, higher demands for smart home services have gradually emerged, which cannot be well satisfied by using traditional service provisioning manners. This is because traditional smart home control systems commonly rely on manual operations and fixed rules, which cannot satisfy changeable user demands and may seriously degrade the user experience. Therefore, it is necessary to capture user preferences based on their historical behavior data. To address the above problems, a temporal knowledge graph is first proposed to support the acquisition of user-perceived environmental data and user behavior data. Next, a user-oriented smart home service prediction model is designed based on the temporal knowledge graph, which can predict the service status and automatically perform the corresponding service for each user. Finally, a prototype system is built according to a real-world smart home environment. The experimental results show that the proposed method can provide personalized smart home services and well satisfy user demands.

Discovering Top-k Profitable Patterns for Smart Manufacturing

In the past, many studies were developed to discover useful knowledge from rich data for decision-making in wide applications in Internet of Things and Web of Things, such as smart manufacturing. Utility-driven pattern mining (UPM) technology is famous in knowledge discovering domain. However, one of the biggest issues of UPM is the setting of a suitable minimum utility threshold (minUtil). The higher minUtil is, the fewer interesting patterns are obtained. Conversely, the lower minUtil is, the more useless patterns are discovered. In this paper, we propose a solution for discovering top-k profitable patterns with average-utility measure, which can be applied to manufacturing. The average-utility of a pattern, w.r.t its corresponding length, can be used to fairly measure the pattern. The proposed new upper-bounds on average-utility are tighter than previous upper-bounds. Moreover, based on these upper-bounds, the novel PPT algorithm utilizes merging and projection techniques to greatly reduce the search space. By adopting several threshold raising strategies, the PPT algorithm can discover correct top-k patterns in a short time. We also implemented the efficiency and effectiveness of the algorithm on real and synthetic datasets. The experimental results reveal that the algorithm not only get a complete set of top-k interesting patterns, but also works better than the state-of-the-art algorithm in terms of runtime, memory consumption and scalability. Especially, the proposed algorithm performs very well on dense datasets.

Fast RFM Model for Customer Segmentation

With booming e-commerce and World Wide Web (WWW), a powerful tool in customer relationship management (CRM), called the RFM analysis model, has been used to ensure that major enterprises make more profit. Combined with data mining technologies, the CRM system can automatically predict the future behavior of customers to raise customer retention rate. However, a key issue is that the existing RFM analysis models are not efficient enough. Thus, in this study, a fast algorithm based on a compact list-based data structure is proposed along with several efficient pruning strategies to address this issue. The new algorithm considers recency (R), frequency (F), and monetary/utility (M) as three different thresholds to discover interesting patterns where the R, F, and M thresholds combined are no less than the user-specified minimum values. More significantly, the downward-closure property of frequency and monetary metrics are utilized to discover super-itemsets. Then, an extensive experimental study demonstrated that the algorithm outperforms state-of-the-art algorithms on various datasets. It is also demonstrated that the proposed algorithm performs well when considering the frequency metric alone.

Mining with Rarity for Web Intelligence

Mining with rarity makes sense to take advantage of data mining for Web intelligence. In some scenarios, the rare patterns are meaningful in data intelligent systems. Interesting pattern discovery plays an important role in real-world applications. In this field, a great deal of work has been done. In general, a high-utility pattern may include frequent items and also rare items. Rare pattern discovery emerges gradually and helps policy-makers making related marketing strategies. However, the existing Apriori-like methods for discovering high-utility rare itemsets (HURIs) are not efficient. In this paper, we address the problem of mining with rarity and propose an efficient algorithm, named HURI-Miner, which uses the data structure called revised utility-list to find HURIs from a transaction database. Furthermore, we utilize several powerful pruning strategies to prune the search space and save the computational complexity. In the process of rare pattern mining, the HURIs are directly generated without the generate-and-test method. Finally, a series of experimental results show that this proposed method has superior effectiveness and efficiency.

A Hand Over and Call Arrival Cellular Signals-based Traffic Density Estimation Method

The growing number of vehicles has put a lot of pressure on the transportation system. Intelligent Transportation System (ITS) faces a great challenge of traffic congestion. Traffic density displays the congestion of current traffic which reflects explicitly about traffic status. With the development of communication technology, people use mobile stations (MSs) at any time and cellular signals are everywhere. Different from traditional traffic information estimation methods based global positioning system (GPS) and vehicle detector (VD), this paper resorts to Cellular Floating Vehicle Data (CFVD) to estimate the traffic density. In this paper, Hand over (HO) and call arrival (CA) cellular signals are essentials to estimate traffic flow and traffic speed. In addition, mixture probability density distribution generator is adopted to assist estimating the probabilities HO and CA events. Through accurate traffic flow and traffic speed estimations, precise traffic density is achieved. In the simulation experiments, the proposed method achieves estimation MAPEs 11.92%, 13.97% and 16.47% for traffic flow, traffic speed and traffic density, respectively.

Fraship: A Framework to Support End-User Personalization of Smart Home Services with Runtime Knowledge Graph

With the continuous popularization of smart home devices, people often anticipate using different smart devices through natural language instructions and require personalized smart home services. However, existing challenges include the interoperability of smart devices and a comprehensive understanding of the user environment. This study proposes Fraship, a framework supporting smart home service personalization for end-users. It incorporates a runtime knowledge graph acting as a bridge between users’ language instructions and the corresponding operations of smart devices. The runtime knowledge graph is used to reflect contextual information in a specific smart home, based on which a language-instruction parser is proposed to allow users to manage smart home devices and services in natural language. We evaluated Fraship on a real-world smart home. Our results show that Fraship can effectively manage smart home devices and services based on the runtime knowledge graph, and it recognizes instructions more accurately than other approaches.

MI-GCN: Node Mutual Information-based Graph Convolutional Network

Graph Neural Networks (GNNs) have been widely used in various processing tasks for processing graphs and complex network data. However, in recent studies, GNNs cannot effectively process the structural topology information and characteristics of the nodes in the graph, or even fail to deal with the information of the nodes. For optimal node embedding aggregation and delivery, this weakness may severely affect the ability of GNNs to classify nodes. In order to overcome this issue, we propose a novel node Mutual Information-based Graph Convolutional Network (MI-GCN) for semi-supervised node classification. First, we analyze the node information entropy that measures the importance of nodes in the complex network, and further define the node joint information entropy and node mutual information in the graph data. Then, we use node mutual information to strengthen the ability of GNNs to fuse node structural information. Extensive experiments demonstrate that our MI-GCN not only retains the advantages of the most advanced GNNs, but also improves the ability to fuse node structural information. MI-GCN can achieve superior performance on node classification compared to several baselines on real-world multi-type datasets, including fixed data splits and random data splits.

SESSION: Workshop: GraphLearning – The First International Workshop on Graph Learning

GraphLearning’22: 1st International Workshop on Graph Learning

The First Workshop on Graph Learning aims to bring together researchers and practitioners from academia and industry to discuss recent advances and core challenges of graph learning. This workshop will be established as a platform for multiple disciplines such as computer science, applied mathematics, physics, social sciences, data science, complex networks, and systems engineering. Core challenges in regard to theory, methodology, and applications of graph learning will be the main center of discussions at the workshop.

Structure-based Large-scale Dynamic Heterogeneous Graphs Processing: Applications, Challenges and Solutions

Wenjie Zhang has given a Keynote Talk at the First Workshop on Graph Learning, associated with The ACM Web Conference 2022, on Monday 25th April 2022. This paper provides a summary of the topics she addressed during her talk.

Graphs in Computer Vision then and now: how Deep Learning has reinvigorated Structural Pattern Recognition

Computer Vision Problems, such as object detection, object tracking, action recognition and so on, have been, in the past, usually addressed through Statistical Pattern Recognition techniques. SVM, Regression or Neural Networks, are some examples of classical statistical techniques that have been used, quite effectively, in many application contexts of computer vision.

Nevertheless, some attempts have been proposed using more complex data structures (notably graphs) for solving Computer Vision Tasks. However, in terms of performances, their use did not have the same success as techniques based on vector representations. First part of this talk will present some of these proposals, in the context of object tracking ([1]), people re-identification ([3]) and action recognition ([2]). An graph representation is proposed in [1] to deal with occlusion problem. The representation is based on a graph pyramid, namely, each moving region is represented at different levels of resolution using a graph for each level. The algorithm compares the topmost levels of each pyramid in the association phase between moving objects in two consecutive frames. If the comparison outcome is sufficient to assign a label to each node the tracking algorithm stops. Instead, if some ambiguities arise (as it is the case when two objects over- lap), the algorithm is repeated using the next levels of the pyramids, until either a consistent labelling is found. The purpose of re-identification (re-id) is to identify people coming back into the field of view of a camera or to recognize an individual through different cameras in a distributed network. At the heart of the process there is a comparison between signatures given probe and gallery sets. In [3] graphs are used to represent people appearance and comparison is done by means of Graph Kernels. Finally, action recognition is a classification problem in which each video representing an action has to be classified with the correct action label. In [2] we proposed to represent videos using graph sequences and proposed a model inspired from bag-of-words techniques to classify a sequence.

Recently, graphs have gained a lot of attention in the Computer Vision community thanks to the use of this kind of data within deep learning techniques. Graph Neural Networks have demonstrated their effectiveness in solving Computer Vision problems, and in some cases recent proposals have bridged the gap between statistical and structural pattern recognition. Second part of the talk will be devoted to illustrate some of these examples ([4, 5, 6]). Starting from the already mentioned applications in Computer Vision (object tracking, action recognition), we will discuss the new proposals based on Deep Learning with graphs and the open problems in this context.

Revisiting Neighborhood-based Link Prediction for Collaborative Filtering

Collaborative filtering (CF) is one of the most successful and fundamental techniques in recommendation systems. In recent years, Graph Neural Network (GNN)-based CF models, such as NGCF [31], LightGCN [10] and GTN [9] have achieved tremendous success and significantly advanced the state-of-the-art. While there is a rich literature of such works using advanced models for learning user and item representations separately, item recommendation is essentially a link prediction problem between users and items. Furthermore, while there have been early works employing link prediction for collaborative filtering [5, 6], this trend has largely given way to works focused on aggregating information from user and item nodes, rather than modeling links directly.

In this paper, we propose a new linkage (connectivity) score for bipartite graphs, generalizing multiple standard link prediction methods. We combine this new score with an iterative degree update process in the user-item interaction bipartite graph to exploit local graph structures without any node modeling. The result is a simple, non-deep learning model with only six learnable parameters. Despite its simplicity, we demonstrate our approach significantly outperforms existing state-of-the-art GNN-based CF approaches on four widely used benchmarks. In particular, on Amazon-Book, we demonstrate an over 60% improvement for both Recall and NDCG. We hope our work would invite the community to revisit the link prediction aspect of collaborative filtering, where significant performance gains could be achieved through aligning link prediction with item recommendations.

MarkovGNN: Graph Neural Networks on Markov Diffusion

Most real-world networks contain well-defined community structures where nodes are densely connected internally within communities. To learn from these networks, we develop MarkovGNN that captures the formation and evolution of communities directly in different convolutional layers. Unlike most Graph Neural Networks (GNNs) that consider a static graph at every layer, MarkovGNN generates different stochastic matrices using a Markov process and then uses these community-capturing matrices in different layers. MarkovGNN is a general approach that could be used with most existing GNNs. We experimentally show that MarkovGNN outperforms other GNNs for clustering, node classification, and visualization tasks. The source code of MarkovGNN is publicly available at https://github.com/HipGraph/MarkovGNN.

Multi-view Omics Translation with Multiplex Graph Neural Networks

The rapid development of high-throughput experimental technologies for biological sampling has made the collection of omics data (e.g., genomics, epigenomics, transcriptomics and metabolomics) possible at a small cost. While multi-view approaches to omics data have a long history, omics-to-omics translation is a relatively new strand of research with useful applications such as recovering missing or censored data and finding new correlations between samples. As the relations between omics can be non-linear and exhibit long-range dependencies between parts of the genome, deep neural networks can be an effective tool. Graph neural networks have been applied successfully in many different areas of research, especially in problems where annotated data is sparse, and have recently been extended to the heterogeneous graph case, allowing for the modelling of multiple kinds of similarities and entities. Here, we propose a meso-scale approach to construct multiplex graphs from multi-omics data, which can construct several graphs per omics and cross-omics graphs. We also propose a neural network architecture for omics-to-omics translation from these multiplex graphs, featuring a graph neural network encoder, coupled with an attention layer. We evaluate the approach on the open The Cancer Genome Atlas dataset (N=3023), showing that for MicroRNA expression prediction our approach has lower prediction error than regularized linear regression or modern generative adversarial networks.

Improving Bundles Recommendation Coverage in Sparse Product Graphs

In e-commerce, a group of similar or complementary products is recommended as a bundle based on the product category. Existing work on modeling bundle recommendations consists of graph-based approaches. In these methods, user-product interactions provide a more personalized experience. Moreover, these approaches require robust user-product interactions and cannot be applied to cold start scenarios. When a new product is launched or for products with limited purchase history, the lack of user-product interactions will render these algorithms inaccessible. Hence, no bundles recommendations will be provided to users for such product categories. These scenarios are frequent for retailers like Target, where much of the stock is seasonal, and new brands are launched throughout the year. This work alleviates this problem by modeling product bundles recommendation as a supervised graph link prediction problem. A graph neural network (GNN) based product bundles recommendation system, BundlesSEAL is presented. First, we build a graph using add-to-cart data and then use BundlesSEAL to predict the link representing bundles relation between products represented as nodes. We also propose a heuristic to identify relevant pairs of products for efficient inference. Further, we also apply BundlesSEAL for predicting the edge weights instead of just link existence. BundlesSEAL based link prediction leads to amelioration of the above-mentioned cold start problem by increasing the coverage of product bundles recommendations in various categories by 50% while achieving a 35% increase in revenue over behavioral baseline. The model was also validated over the Amazon product metadata dataset.

Unsupervised Superpixel-Driven Parcel Segmentation of Remote Sensing Images Using Graph Convolutional Network

Accurate parcel segmentation of remote sensing images plays an important role in ensuring various downstream tasks. Traditionally, parcel segmentation is based on supervised learning using precise parcel-level ground truth information, which is difficult to obtain. In this paper, we propose an end-to-end unsupervised Graph Convolutional Network (GCN)-based framework for superpixel-driven parcel segmentation of remote sensing images. The key component is a novel graph-based superpixel aggregation model, which effectively learns superpixels’ latent affinities and better aggregates similar ones in spatial and spectral spaces. We construct a multi-temporal multi-location testing dataset using Sentinel-2 images and the ground truth annotations in four different regions. Extensive experiments are conducted to demonstrate the efficacy and robustness of our proposed model. The best performance is achieved by our model compared with the competing methods.

Deep Partial Multiplex Network Embedding

Network embedding is an effective technique to learn the low-dimensional representations of nodes in networks. Real-world networks are usually with multiplex or having multi-view representations from different relations. Recently, there has been increasing interest in network embedding on multiplex data. However, most existing multiplex approaches assume that the data is complete in all views. But in real applications, it is often the case that each view suffers from the missing of some data and therefore results in partial multiplex data.

In this paper, we present a novel Deep Partial Multiplex Network Embedding approach to deal with incomplete data. In particular, the network embeddings are learned by simultaneously minimizing the deep reconstruction loss with the autoencoder neural network, enforcing the data consistency across views via common latent subspace learning, and preserving the data topological structure within the same network through graph Laplacian. We further prove the orthogonal invariant property of the learned embeddings and connect our approach with the binary embedding techniques. Experiments on four multiplex benchmarks demonstrate the superior performance of the proposed approach over several state-of-the-art methods on node classification, link prediction and clustering tasks.

Graph Augmentation Learning

Graph Augmentation Learning (GAL) provides outstanding solutions for graph learning in handling incomplete data, noise data, etc. Numerous GAL methods have been proposed for graph-based applications such as social network analysis and traffic flow forecasting. However, the underlying reasons for the effectiveness of these GAL methods are still unclear. As a consequence, how to choose optimal graph augmentation strategy for a certain application scenario is still in black box. There is a lack of systematic, comprehensive, and experimentally validated guideline of GAL for scholars. Therefore, in this survey, we in-depth review GAL techniques from macro (graph), meso (subgraph), and micro (node/edge) levels. We further detailedly illustrate how GAL enhance the data quality and the model performance. The aggregation mechanism of augmentation strategies and graph learning models are also discussed by different application scenarios, i.e., data-specific, model-specific, and hybrid scenarios. To better show the outperformance of GAL, we experimentally validate the effectiveness and adaptability of different GAL strategies in different downstream tasks. Finally, we share our insights on several open issues of GAL, including heterogeneity, spatio-temporal dynamics, scalability, and generalization.

Scaling R-GCN Training with Graph Summarization

Training of Relational Graph Convolutional Networks (R-GCN) is a memory intense task. The amount of gradient information that needs to be stored during training for real-world graphs is often too large for the amount of memory available on most GPUs. In this work, we experiment with the use of graph summarization techniques to compress the graph and hence reduce the amount of memory needed. After training the R-GCN on the graph summary, we transfer the weights back to the original graph and attempt to perform inference on it. We obtain reasonable results on the AIFB, MUTAG and AM datasets. Our experiments show that training on the graph summary can yield a comparable or higher accuracy to training on the original graphs. Furthermore, if we take the time to compute the summary out of the equation, we observe that the smaller graph representations obtained with graph summarization methods reduces the computational overhead. However, further experiments are needed to evaluate additional graph summary models and whether our findings also holds true for very large graphs.

RePS: Relation, Position and Structure aware Entity Alignment

Entity Alignment (EA) is the task of recognizing the same entity present in different knowledge bases. Recently, embedding-based EA techniques have established dominance where alignment is done based on closeness in latent space. Graph Neural Networks (GNN) gained popularity as the embedding module due to its ability to learn entities’ representation based on their local sub-graph structures. Although GNN shows promising results, limited works have aimed to capture relations while considering their global importance and entities’ relative position during EA. This paper presents Relation, Position and Structure aware Entity Alignment (RePS), a multi-faceted representation learning-based EA method that encodes local, global, and relation information for aligning entities. To capture relations and neighborhood structure, we propose a relation-based aggregation technique – Graph Relation Network (GRN) that incorporates relation importance during aggregation. To capture the position of an entity, we propose Relation-aware Position Aggregator (RPA) to capitalize entities’ position in a non-Euclidean space using training labels as anchors, which provides a global view of entities. Finally, we introduce a Knowledge Aware Negative Sampling (KANS) that generates harder to distinguish negative samples for the model to learn optimal representations. We perform exhaustive experimentation on four cross-lingual datasets and report an ablation study to demonstrate the effectiveness of GRN, KANS, and position encodings.

CCGG: A Deep Autoregressive Model for Class-Conditional Graph Generation

Graph data structures are fundamental for studying connected entities. With an increase in the number of applications where data is represented as graphs, the problem of graph generation has recently become a hot topic. However, despite its significance, conditional graph generation that creates graphs with desired features is relatively less explored in previous studies. This paper addresses the problem of class-conditional graph generation that uses class labels as generation constraints by introducing the Class Conditioned Graph Generator (CCGG). We built CCGG by injecting the class information as an additional input into a graph generator model and including a classification loss in its total loss along with a gradient passing trick. Our experiments show that CCGG outperforms existing conditional graph generation methods on various datasets. It also manages to maintain the quality of the generated graphs in terms of distribution-based evaluation metrics.

JGCL: Joint Self-Supervised and Supervised Graph Contrastive Learning

Semi-supervised and self-supervised learning on graphs are two popular avenues for graph representation learning. We demonstrate that no single method from semi-supervised and self-supervised learning works uniformly well for all settings in the node classification task. Self-supervised methods generally work well with very limited training data, but their performance could be further improved using the limited label information. We propose a joint self-supervised and supervised graph contrastive learning (JGCL) to capture the mutual benefits of both learning strategies. JGCL utilizes both supervised and self-supervised data augmentation and a joint contrastive loss function. Our experiments demonstrate that JGCL and its variants are one of the best performers across various proportions of labeled data when compared with state-of-the-art self-supervised, unsupervised, and semi-supervised methods on various benchmark graphs.

Surj: Ontological Learning for Fast, Accurate, and Robust Hierarchical Multi-label Classification

We consider multi-label classification in the context of complex hierarchical relationships organized into an ontology. These situations are ubiquitous in learning problems on the web and in science, where rich domain models are developed but labeled data is rare. Most existing solutions model the problem as a sequence of simpler problems: one classifier for each level in the hierarchy, or one classifier for each label. These approaches require more training data, which is often unavailable in practice: as the ontology grows in size and complexity, it becomes unlikely to find training examples for all expected combinations. In this paper, we learn offline representations of the ontology using a graph autoencoder and separately learn to classify input records, reducing dependence on training data: Since the relationships between labels are encoded independently of training data, the model can make predictions even for underrepresented labels, naturally generalize to DAG-structured ontologies, remain robust to low-data regimes, and, with minor offline retraining, tolerate evolving ontologies. We show empirically that our label predictions respect the hierarchy (predicting a descendant implies predicting its ancestors) and propose a method of evaluating hierarchy violations that properly ignores irrelevant violations. Our main result is that our model outperforms all state-of-the-art models on 17 of 20 datasets across multiple domains by a significant margin, even with limited training data.

A Triangle Framework among Subgraph Isomorphism, Pharmacophore and Structure-function Relationship

Coronavirus disease 2019 (COVID-19) has gained utmost attention in the current time from academic research and industrial practices because it continues to rage in many countries. Pharmacophore models exploit molecule topological similarity as well as functional compound similarity so that they can be reliable via the application of the concept of bioisosterism. In this work, we analyze the targets for coronavirus protein and the structure of RNA virus variation, thereby complete the safety and pharmacodynamic action evaluation of small-molecule anti-coronavirus oral drugs. Common pharmacophore identifications could be converted into subgraph querying problems, due to chemical structures can also be converted to graphs, which is a knotty problem pressing for a solution. We adopt simplified representation pharmacophore graphs by reducing complete molecular structures to abstracts to detect isomorphic topological patterns and further to improve the substructure retrieval efficiency. Our threefold architecture subgraph isomorphism-based method retrieves query subgraphs over large graphs. First, by means of extracting a sequence of subgraphs to be matched and then comparing the number of vertex and edge between the potential isomorphic subgraphs and the query graph, we lower the computational scaling markedly. Afterwards, the directed vertex and edge matrix recording vertex and edge positional relation, directional relation and distance relation has been created. Then, on the basis of permutation theorem, we calculate the row sum of vertex and edge adjacency matrix of query graph and potential sample. Finally, according to equinumerosity theorem, we check the eigenvalues of the vertex and edge adjacency matrices of the two graphs are equinumerous. The topological distance could be calculated based on the graph isomorphism and the subgraph isomorphism can be implemented after the combination of the subgraph. The proposed quantitative structure–function relationships (QSFR) approach can be effectively applied for pharmacophoric abstract patterns identification. The framework of new drug development for covid-19 has been established based on this triangle.

Understanding Dropout for Graph Neural Networks

Graph neural network (GNN) has demonstrated superior performance on graph learning tasks. GNN captures the data dependencies via message passing amid neural networks. Hence the prediction of a node label can utilize information from its neighbors in a graph. Dropout is a regularization as well as an ensemble method for convolutional neural network (CNN), which has been carefully studied. However, there are few existing works that focused on dropout schemes for GNN. Although GNN and CNN share similar model architecture, both with convolutional layers and fully connected layers, the input data structure for GNN and CNN are different and convolution operation differs. This suggests the dropout schemes for CNN should not be directly applied to GNN without a good understanding of the impact. In this paper, we divide the existing dropout schemes for GNN into two categories: (1) dropout on feature maps and (2) dropout on graph structure. Based on the drawbacks of current GNN dropout models, we propose a novel layer compensation dropout and a novel adaptive heteroscadestic Gaussian dropout, which can be applied to any type of GNN models and outperforms their corresponding baselines in shallow GNNs. Then an experimental study shows Bernoulli dropout generalize better while Gaussian dropout is slightly stronger in transductive performance. At last, we theoretically study how different dropout schemes mitigate over-smoothing problems and experimental results shows that layer compensation dropout allows a GNN model to maintain or slightly improve its performance as the GNN model adds more layers while all the other dropout models suffer from performance degradation when GNN goes deep.

Mining Homophilic Groups of Users using Edge Attributed Node Embedding from Enterprise Social Networks

We develop a method to identify groups of similarly behaving users with similar work contexts from their activity on enterprise social media. This would allow organizations to discover redundancies and increase efficiency. To better capture the network structure and communication characteristics, we model user communications with directed attributed edges in a graph. Communication parameters including engagement frequency, emotion words, and post lengths act as edge weights of the multiedge. Upon the resultant adjacency tensor, we develop a node embedding algorithm using higher order singular value tensor decomposition and convolutional autoencoder. We develop a peer group identification algorithm using the cluster labels obtained from the node embedding and show its results on Enron emails and StackExchange Workplace community. We observe that people of the same roles in enterprise social media are clustered together by our method. We provide a comparison with existing node embedding algorithms as a reference indicating that attributed social networks and our formulations are an efficient and scalable way to identify peer groups in an enterprise social network that aids in professional social matching.

Mining Multivariate Implicit Relationships in Academic Networks

Multivariate cooperative relations exist widely in the academic society. In-depth research on multivariate relationships can effectively promote the integration of disciplines and advance scientific and technological progress. The mining and analysis of advisor-advisee relationships in the cooperative network, as a hot research issue in sociology and other disciplines, is still facing various challenges such as the lack of universal models and the difficulty in identifying multivariate relationships. The traditional advisor-advisee relationship mining methods only focus on the binary relationship, and require secondary processing of node attributes and edge attributes. Therefore, based on the attributes of the node, edge, and network, we transfered the Capsule Network to multivariate relation analysis. The experimental results proved the simplicity and reliability of this model. And we studied the effects of the network feature vectors’ dimension, routing iterations, and normalization on the performance of the Capsule Network. Considering that Capsule Network takes a long training time, we adopted Warm Restarts method to speed up the training process. In addition, we also used the model to generate a large-scale multivariate academic genealogy.

SchemaWalk: Schema Aware Random Walks for Heterogeneous Graph Embedding

Heterogeneous Information Network (HIN) embedding has been a prevalent approach to learn representations off semantically-rich heterogeneous networks. Most HIN embedding methods exploit meta-paths to retain high-order structures, yet, their performance is conditioned on the quality of the (generated/manually-defined) meta-paths and their suitability for the specific label set. Whereas other methods adjust random walks to harness or skip certain heterogeneous structures (e.g. node type(s)), in doing so, the adjusted random walker may casually omit other node/edge types. Our key insight is with no domain knowledge, the random walker should hold no assumptions about heterogeneous structure (i.e. edge types). Thus, aiming for a flexible and general method, we utilize network schema as a unique blueprint of HIN, and propose SchemaWalk, a random walk to uniformly sample all edge types within the network schema. Moreover, we identify the starvation phenomenon which induces random walkers on HINs to under- or over-sample certain edge types. Accordingly, we design SchemaWalkHO to skip local deficient connectivity to preserve uniform sampling distribution. Finally, we carry out node classification experiments on four real-world HINs, and provide in-depth qualitative analysis. The results highlight the robustness of our method regardless to the graph structure in contrast with the state-of-the-art baselines.

Multi-Graph based Multi-Scenario Recommendation in Large-scale Online Video Services

Recently, industrial recommendation services have been boosted by the continual upgrade of deep learning methods. However, they still face de-biasing challenges such as exposure bias and cold-start problem, where circulations of machine learning training on human interaction history leads algorithms to repeatedly suggest exposed items while ignoring less-active ones. Additional problems exist in multi-scenario platforms, e.g. appropriate data fusion from subsidiary scenarios, which we observe could be alleviated through graph structured data integration via message passing.

In this paper, we present a multi-graph structured multi-scenario recommendation solution, which encapsulates interaction data across scenarios with multi-graph and obtains representation via graph learning. Extensive offline and online experiments on real-world datasets are conducted where the proposed method demonstrates an increase of 0.63% and 0.71% in CTR and Video Views per capita on new users over deployed set of baselines and outperforms regular method in increasing the number of outer-scenario videos by 25% and video watches by 116%, validating its superiority in activating cold videos and enriching target recommendation.

SESSION: Workshop: UserNLP – International Workshop on User-centered Natural Language Processing

UserNLP’22: 2022 International Workshop on User-centered Natural Language Processing

We report goals, paper submissions, keynotes, and organizations of this UserNLP workshop. User-centered NLP can fill these gaps by explicitly considering stylistic variations across individuals or groups of individuals and focusing on user-level modeling tasks. While traditional NLP tasks tend to focus on single documents (e.g., sentiment analysis), user-centered NLP aims to make inferences for individual users, on the basis of one or more documents associated with that user. This workshop aims to create a platform where researchers can present rising challenges in building user-centered NLP models and discuss shared issues across multidisciplinary fields. We have received 11 submissions and accepted 6 of the submissions, which were reviewed by our 19 program committee members. The program invited four keynote talks from both academia and industry. We appreciate the valuable contributions from the organizing committee, program committee, keynote speakers, and the manuscript authors.

Personalization and Relevance in NLG

Despite the recent advances in language modeling techniques, personalization remains a challenge for many NLP tasks. In this talk, we will explore personalization through several different lens to understand how we can make progress on this front, and emphasize why human-centered approach is a crucial part of the solution. First, I will challenge the ground-truth assumption in the context of user or situation sensitive language tasks. In other words, I will argue that the same question might be addressed differently by a system, depending on the user or the situation they are currently facing. Then we’ll discuss what are our user needs, and how can we design these tasks to produce useful and relevant responses, but also what potential harms we should be aware of working on personalization [5]. Next, we will look into personalization in the augmentative and alternative communication (AAC) world. Specifically, how through an icon-based language, individuals with compromised language abilities (that may arise due to Traumatic Brain injury (TBI) or Cerebral Palsy (CP)), we can accommodate their needs and what are the challenges in developing icon-based language models [3]. Third, in the process of personalization, models are expected to accommodate and adapt to the specific language and jargon spoken by the user. What are remaining challenges for deep learning architectures in the process of adapting user data or new domains [2, 4]. Finally, I will share work-in-progress where through Wizard-of-Oz (WoZ) experiments [1] we identify and learn useful actions of social conversational systems in classroom setting.

Stylistic Control for Neural Natural Language Generation

With the rise of conversational assistants, it has become more critical for dialog systems to keep users engaged by responding in a natural, interesting, and often personalized way, even in a task-oriented setting. Recent work has thus focused on stylistic control for natural language generation (NLG) systems in order to jointly control response semantics and style. In this talk, I will describe our work on automatic data curation and modeling approaches to facilitate style control for both personality-specific attributes of style (based on Big-Five personality traits), and other style attributes that are helpful for personalization, e.g., response length, descriptiveness, point-of-view, and sentiment. I will present work that incorporates these attributes into the training and generation pipelines for different NLG architectures, and will show how our data curation and modeling approaches are generalizable to new domains and style choices. Finally, I will describe how we use a combination of automatic and human evaluation methods to measure how well models successfully hit multiple style targets without sacrificing semantics.

Concept Annotation from Users Perspective: A New Challenge

Text data is highly unstructured and can often be viewed as a complex representation of different concepts, entities, events, sentiments etc. For a wide variety of computational tasks, it is thus very important to annotate text data with the associated concepts / entities, which can put some initial structure / index on raw text data. However, It is not feasible to manually annotate a large amount of text, raising the need for automatic text annotation.

In this paper, we focus on concept annotation in text data from the perspective of real world users. Concept annotation is not a trivial task and its utility often highly relies on the preference of the user. Despite significant progress in natural language processing research, we still lack a general purpose concept annotation tool which can effectively serve users from a wide range of application domains. Thus, further investigation is needed from a user-centric point of view to design an automated concept annotation tool that will ensure maximum utility to its users. To achieve this goal, we created a benchmark corpus of two real world data-sets, i.e., “News Concept Data-set” and “Medical Concept Data-set”, to introduce the notion of user-oriented concept annotation and provide a way to evaluate this task. The term “user-centric” means that the desired concepts are defined as well as characterized by the users themselves. Throughout the paper, we describe the details about how we created the data-sets, what are the unique characteristics of each data-set, how these data-sets reflect real users perspective for the concept annotation task, and finally, how they can serve as a great resource for future research on user-centric concept annotation.

Detecting Addiction, Anxiety, and Depression by Users Psychometric Profiles

Detecting and characterizing people with mental disorders is an important task that could help the work of different healthcare professionals. Sometimes, a diagnosis for specific mental disorders requires a long time, possibly causing problems because being diagnosed can give access to support groups, treatment programs, and medications that might help the patients. In this paper, we study the problem of exploiting supervised learning approaches, based on users’ psychometric profiles extracted from Reddit posts, to detect users dealing with Addiction, Anxiety, and Depression disorders. The empirical evaluation shows an excellent predictive power of the psychometric profile and that features capturing the post’s content are more effective for the classification task than features describing the user writing style. We achieve an accuracy of 96% using the entire psychometric profile and an accuracy of 95% when we exclude from the user profile linguistic features.

Expressing Metaphorically, Writing Creatively: Metaphor Identification for Creativity Assessment in Writing

Metaphor, which can implicitly express profound meanings and emotions, is a unique writing technique frequently used in human language. In writing, meaningful metaphorical expressions can enhance the literariness and creativity of texts. Therefore, the usage of metaphor is a significant impact factor when assessing the creativity and literariness of writing. However, little to no automatic writing assessment system considers metaphorical expressions when giving the score of creativity. For improving the accuracy of automatic writing assessment, this paper proposes a novel creativity assessment model that imports a token-level metaphor identification method to extract metaphors as the indicators for creativity scoring. The experimental results show that our model can accurately assess the creativity of different texts with precise metaphor identification. To the best of our knowledge, we are the first to apply automatic metaphor identification to assess writing creativity. Moreover, identifying features (e.g., metaphors) that influence writing creativity using computational approaches can offer fair and reliable assessment methods for educational settings.

A Decision Model for Designing NLP Applications

Among the NLP models’ usages, some applications provide multiple output options, and some offer only a single result to the end-users. However, there is little research about which situations providing multiple outputs from NLP models will benefit the user experience. Therefore, in this position paper, we summarize the progress of NLP applications, which shows parallel outputs from the NLP model at once to users. Then a decision model is presented that can assist in deciding whether a given condition is suitable to show multiple outputs at once from the NLP model. We hope developers and UX designers can examine the decision model and create an easy-to-use interface that can present numerous results from the NLP model at once. Moreover, we hope future researchers can reference the decision model from this paper to explore the potential of other NLP models’ usage that can show parallel outputs at once to create a more satisfactory user experience.

Do Not Read the Same News! Enhancing Diversity and Personalization of News Recommendation

Personalized news recommendation by machine is one of the widely studied areas. As the production of news articles increases and topics are diversified, it is impractical to read all the articles available to users. Therefore, the purpose of the news recommendation system should be to provide relevant news based on the user’s interest. Unlike other recommendation systems, explicit feedback from users on each item such as ratings is rarely provided in news recommendation systems. Most news recommendation systems use implicit feedback such as click histories to profile user interest, which leads to biased recommendation results towards generally popular articles. In this paper, we suggest a novel news recommendation model for more personalized recommendations. If a user reads news not widely clicked by others, the news reflects the user’s personal interest rather than other popular news clicked. We implement two user encoders, one to encode the general interest of the set of users and another one to encode the user’s individual interest. We also propose regularization methods that induce two encoders to encode different types of user interest. The experiment on real-world data shows that our proposed method improves the diversity and the quality of recommendations for different click histories without any significant performance drops.

SESSION: Workshop: WebAndTheCity – 8th International Workshop on Web and Smart Cities

WebAndTheCity'22: 8th International Workshop on The Web and Smart Cities

This is the 8th edition of the workshop series labeled “AW4City – Web Applications and Smart Cities”, which started back in Florence in 2015 and kept on taking place every year in conjunction with the WWW conference series. Last year the workshop was held virtually in Ljubljana, Slovenia. The workshop series aims to investigate the Web and Web applications’ role in establishing smart city (SC) promises. The workshop series aim to investigate the role of the Web and of Web applications in SC growth. This year, the workshop focuses on the role of the web in smart environment. In the era of cities and under the UN 2030 Agenda and the European Green deal for sustainable growth, cities appear to play crucial role in securing humanity against environmental threats and generate sustainable and circular cities. In this regard, cities attempt to improve their forms (e.g., more compact, and eco-friendlier) and performance to become friendlier and able to host their increasing populations. Additionally, new types of business appear (e.g., that utilize IoT and data, manage e-waste and recycle), while the co-existence of autonomous things and people generate another challenge that cities have been started phasing. This workshop aims to demonstrate how web applications, Apps and Web Intelligence can serve smart environment in general.

Human Centric Design in Smartcity Technologies: Implications for the Governance, Control and Performance Evaluation of Mobility Ecosystems

Governance can be understood as the system by which actors in society are directed and controlled. Given the trinity ”Institution, Market and Organization” a pressing question is: which governance structure minimizes the transaction costs in governing and controlling in the build and service design for organizations similar to a city? We investigate the notion of the governance of common goods and the problem of organizations - how or when to balance control mechanisms. As Tirole observes there is a need to explicate incentives for all stakeholders on the basis of some measure of aggregate welfare of all stakeholders. In this paper we develop a mathematical model to explicate information needs in a bilateral contract and use these insights in the case study, initialized and inspired by the procurement process of transport services at the care institution in the Netherlands in 2020 and 2021. Our results show that the information problem emerges when the object of what is exchanged between two parties is not considered as the unit of analysis. Once we understand the nature of the bilateral exchange relationship then we are able to consider the consequences of the control loss causing transaction costs due to conflicting objectives, moral hazard, adverse selection, opportunism and so on.

Citizens as Developers and Consumers of Smart City Services: A Drone Tour Guide Case

The trend of urbanization has started over two centuries ago and is no longer limited to high-income countries. As a result, city population growth has led to the emergence of applications that manage complex processes within cities by utilizing recent technological advances, thereby transforming them into smart cities. Besides automating complex processes within a city, technology also enables a simplified integration of citizens into identifying problems and creating corresponding solutions. This paper discusses an approach that enables citizens to design and later execute their own services within a smart city environment by employing conceptual modeling and microservices. The overall aim is to establish the role of a citizen developer. The proposed approach is then discussed within our proof of concept environment based on a drone tour guide case.

A Human-Centered Design Approach for the Development of a Digital Care Platform in a Smart City Environment: Implications for Business Models

Digital solutions are being sought increasingly in the care sector for making services more efficient and to be prepared for demographic change and the future care service and staff shortage. One possibility here is the implementation of a digital care platform that is target group-oriented and built according to the needs of the individual stakeholders. To build this platform successfully, it is also necessary to take a closer look at the business model. This paper examines the mentioned points by applying a human-centered design approach that focuses on all perspectives and allows a deep understanding of the opportunities and challenges of a digital care platform in a smart city environment. A digital care platform was found to be promising for various involved stakeholders due to the value proposition. Consequently, the stakeholders benefit from e.g. the simplification of processes and data management, less bureaucratic effort and intensified care services for the elderly.

Sensor Network Design for Uniquely Identifying Sources of Contamination in Water Distribution Networks

Sensors are being extensively adopted for use in smart cities in order to monitor various parameters, so that any anomalous behaviours manifesting in the deployment area, can be easily detected. Sensors in a deployment area have two functions, sensing/coverage and communication, with this paper focusing on the former. Over the years, several coverage models have been proposed which utilizes the Set Cover based problem formulation. This formulation unfortunately has a drawback, in the sense that it lacks unique identification capability for the location where anomalous behavior is sensed. This limitation can be overcome through utilization of Identifying Code. The optimal solution of the Identifying Code problem provides the minimum number of sensors that will be needed to uniquely identify the location where anomalous behavior is sensed. In this paper, we introduce a novel budget constrained version of the problem, whose goal is to find the largest number of locations that can be uniquely identified with the sensors that can be deployed within the specified budget. We provide an Integer Linear Programming formulation and a Maximum Set-Group Cover (MSGC) formulation for the problem and prove that the MSGC problem cannot have a polynomial time approximation algorithm with a 1/k factor performance guarantee unless P = NP.

Enhancing Crowd Flow Prediction in Various Spatial and Temporal Granularities

The diffusion of the Internet of Things allows nowadays to sense human mobility in great detail, fostering human mobility studies and their applications in various contexts, from traffic management to public security and computational epidemiology. A mobility task that is becoming prominent is crowd flow prediction, i.e., forecasting aggregated incoming and outgoing flows in the locations of a geographic region. Although several deep learning approaches have been proposed to solve this problem, their usage is limited to specific types of spatial tessellations and cannot provide sufficient explanations of their predictions. We propose CrowdNet, a solution to crowd flow prediction based on graph convolutional networks. Compared with state-of-the-art solutions, CrowdNet can be used with regions of irregular shapes and provide meaningful explanations of the predicted crowd flows. We conduct experiments on public data varying the spatio-temporal granularity of crowd flows to show the superiority of our model with respect to existing methods, and we investigate CrowdNet’s reliability to missing or noisy input data. Our model is a step forward in the design of reliable deep learning models to predict and explain human displacements in urban environments.

A Framework to Enhance Smart Citizen Science in Coastal Areas

Life quality in a city can be affected by the way citizens interact with the city. Under a smart city concept, citizens are acting as human sensors reporting natural hazards, generating real-time data and enhancing awareness about environmental issues. This crowdsourcing knowledge supports the city’s sustainability and tourism. Specifically, smart seaside cities can fully utilize citizen science data to improve the efficiency of city services, such as smart tourism, smart transportation etc. Environmental assistance and awareness is a beach monitoring issue that could be enhanced through crowdsourcing knowledge. Especially for coastal areas which are under the Natura 2000 network and are characterized as blue flag seas, it is important to identify and map citizens’ knowledge. To facilitate this, we introduce a novel framework aimed at: i) utilizing biodiversity data from open source platforms and organizational observations, ii) collecting the knowledge generated from citizens, iii) enhancing citizens’ awareness, and iv) reporting environmental issues in their city’s coastal areas. The proposed framework exploits these aspects and through the creation of a novel knowledge platform, it aims to provide geospatial, collective awareness applications as an output to support the sustainability of smart coastal spaces.

Multi-Tenancy in Smart City Platforms

Multi-tenancy emerged as a software architecture pattern in an effort to optimize the use of compute resources and minimize the operational cost of large scale deployments. Its applicability, however, needs to take into account each particular context as the challenges of this software architecture pattern may not make it an optimal choice in every situation. A Smart City Platform is by definition a type of software that is also expected to be deployed at a large scale. The applicability of the multi-tenancy architecture pattern in this context is debatable as the benefits it brings may not outweigh the challenges.

SESSION: Workshop: Wiki – 9th International Wiki Workshop

WIKI’2022: 9th Annual Wiki Workshop

Rows from Many Sources: Enriching row completions from Wikidata with a pre-trained Language Model

Row completion is the task of augmenting a given table of text and numbers with additional, relevant rows. The task divides into two steps: subject suggestion, the task of populating the main column; and gap filling, the task of populating the remaining columns. We present state-of-the-art results for subject suggestion and gap filling measured on a standard benchmark (WikiTables).

Our idea is to solve this task by harmoniously combining knowledge base table interpretation and free text generation. We interpret the table using the knowledge base to suggest new rows and generate metadata like headers through property linking. To improve candidate diversity, we synthesize additional rows using free text generation via GPT-3, and crucially, we exploit the metadata we interpret to produce better prompts for text generation. Finally, we verify that the additional synthesized content can be linked to the knowledge base or a trusted web source such as Wikipedia.

Are democratic User Groups More Inclusive?

User groups form an important part of the Wikimedia movement and ecosystem. They gather a big number of free knowledge enthusiasts both online and offline, having either a shared geographical location or thematic interest, or both. At their creation, these structures and their volunteer founders did not receive any formal introduction nor guidelines on ways of working. Governance of these groups lies then between the hands of their founders, who can have different proceeding methods and approaches, depending on their background and vision. In the past, several conflicts related to user group management have occurred, where words such as “transparency” and “democracy” were mentioned as solutions to the current gaps and lack of inclusivity in the groups. The current paper intends to investigate the veracity of this claim, through detailing and understanding the concept of democracy, most specifically in digital online communities, such as Wikimedia volunteer user groups. Since the question remains open, the paper's addition is the elements provided to enrich the discussion, highlighting the concerns raised in different communities.

A Map of Science in Wikipedia

In recent decades, the rapid growth of Internet adoption is offering opportunities for convenient and inexpensive access to scientific information. Wikipedia, one of the largest encyclopedias worldwide, has become a reference in this respect, and has attracted widespread attention from scholars. However, a clear understanding of the scientific sources underpinning Wikipedia’s contents remains elusive. In this work, we rely on an open dataset of citations from Wikipedia to map the relationship between Wikipedia articles and scientific journal articles. We find that most journal articles cited from Wikipedia belong to STEM fields, in particular biology and medicine (47.6% of citations; 46.1% of cited articles). Furthermore, Wikipedia’s biographies play an important role in connecting STEM fields with the humanities, especially history. These results contribute to our understanding of Wikipedia’s reliance on scientific sources, and its role as knowledge broker to the public.

Improving Linguistic Bias Detection in Wikipedia using Cross-Domain Adaptive Pre-Training

Wikipedia is a collective intelligence platform that helps contributors to collaborate efficiently for creating and disseminating knowledge and content. A key guiding principle of Wikipedia is to maintain a neutral point of view (NPOV), which can be challenging for new contributors and experienced editors alike. Hence, several previous studies have proposed automated systems to detect biased statements on Wikipedia with mixed results. In this paper, we investigate the potential of cross-domain pre-training to learn bias features from multiple sources, including Wikipedia, news articles, and ideological statements from political figures in an effort to learn richer cross-domain indicators of bias that may be missed by existing methods. Concretely, we study the effectiveness of bias detection via cross-domain pre-training of deep transformer models. We find that the cross-domain bias classifier with continually pre-trained RoBERTa model achieves a precision of 89% with an F1 score of 87%, and can detect subtle forms of bias with higher accuracy than existing methods.

Anchor Prediction: A Topic Modeling Approach

Networks of documents connected by hyperlinks, such as Wikipedia, are ubiquitous. Hyperlinks are inserted by the authors to enrich the text and facilitate the navigation through the network. However, authors tend to insert only a fraction of the relevant hyperlinks, mainly because this is a time consuming task. In this paper we address an annotation, which we refer to as anchor prediction. Even though it is conceptually close to link prediction or entity linking, it is a different task that require developing a specific method to solve it. Given a source document and a target document, this task consists in automatically identifying anchors in the source document, i.e words or terms that should carry a hyperlink pointing towards the target document. We propose a contextualized relational topic model, CRTM, that models directed links between documents as a function of the local context of the anchor in the source document and the whole content of the target document. The model can be used to predict anchors in a source document, given the target document, without relying on a dictionary of previously seen mention or title, nor any external knowledge graph. Authors can benefit from CRTM, by letting it automatically suggest hyperlinks, given a new document and the set of target document to connect to. It can also benefit to readers, by dynamically inserting hyperlinks between the documents they’re reading. Experiments conducted on several Wikipedia corpora (in English, Italian and German) highlight the practical usefulness of anchor prediction and demonstrate the relevancy of our approach.

The Gender Perspective in Wikipedia: A Content and Participation Challenge

Wikipedia is one of the most widely used information sources in the world. Although one of the guiding pillars of this digital platform is ensuring access to the diversity of human knowledge from a neutral point of view, there is a clear and persistent gender bias in terms of content about or contributed by women. The challenge is to include women as equal partners in the public sphere, in which Wikipedia is developing a central role as the most used educational resource among students, professionals, and many other profiles. In this paper, we introduce the gender perspective in the analysis of the gender gap in the content and participation of women in Wikipedia. While most studies focus on one of the two dimensions in which the gender gap has been observed, we review both approaches to provide an overview of the available evidence. Firstly we introduce how the gender gap is framed by the Wikimedia Movement strategy, then we evaluate the gender gap on content and participation, especially regarding editor practices. Finally, we provide some insights to broaden the discussion about the consequences of not addressing the gender gap in Wikipedia, and we provide some research topics that can support the generation of recommendations and guidelines for a community that needs both equity and diversity.

Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions

“Wiki rabbit holes” are informally defined as navigation paths followed by Wikipedia readers that lead them to long explorations, sometimes involving unexpected articles. Although wiki rabbit holes are a popular concept in Internet culture, our current understanding of their dynamics is based on anecdotal reports only. To bridge this gap, this paper provides a large-scale quantitative characterization of the navigation traces of readers who fell into a wiki rabbit hole. First, we represent user sessions as navigation trees and operationalize the concept of wiki rabbit holes based on the depth of these trees. Then, we characterize rabbit hole sessions in terms of structural patterns, time properties, and topical exploration. We find that article layout influences the structure of rabbit hole sessions and that the fraction of rabbit hole sessions is higher during the night. Moreover, readers are more likely to fall into a rabbit hole starting from articles about entertainment, sports, politics, and history. Finally, we observe that, on average, readers tend to stay focused on one topic by remaining in the semantic neighborhood of the first articles even during rabbit hole sessions. These findings contribute to our understanding of Wikipedia readers’ information needs and user behavior on the Web.

Building a Public Domain Voice Database for Odia

Projects like Mozilla Common Voice were born to address the challenges of unavailability of voice data or the high cost of available data for use in speech technology such as Automatic Speech Recognition (ASR) research and application development. The pilot detailed in this paper is about creating a large freely-licensed public repository of transcribed speech in the Odia language as such a repository was not known to be available. The strategy and methodology behind this process are based on the OpenSpeaks project. Licensed under a Public Domain Dedication (CC0 1.0), the repository currently includes audio recordings of pronunciations for more than 55,000 unique words in Odia, including more than 5,600 recordings of words in the northern Odia dialect Baleswari. No known public listing of words in this dialect was found by the author prior to this pilot. This repository is arguably the most extensive transcribed speech corpus in Odia that is also available publicly under any free and open license. This paper details the strategy, approach, and process behind building both the text and the speech corpus using many open source tools such as Lingua Libre, which can be helpful in building text and speech data for different low-medium-resource languages.