Prabhakar Raghavan has given a Keynote Talk at The ACM Web Conference 2022 on Wednesday 27th April 2022. This paper provides a summary of the topics he addressed during his talk.
Jaime Teevan has given a Keynote Talk at The Web Conference on Friday 29th April 2022. This page provides a summary of the topics she addressed during her talk.
Virginia Dignum has given a Keynote Talk at The ACM Web Conference 2022 on Thursday 28th April 2022. This paper provides a summary of the topics she addressed during her talk.
Personalized pricing is a business strategy to charge different prices to individual consumers based on their characteristics and behaviors. It has become common practice in many industries nowadays due to the availability of a growing amount of high granular consumer data. The discriminatory nature of personalized pricing has triggered heated debates among policymakers and academics on how to design regulation policies to balance market efficiency and equity. In this paper, we propose two sound policy instruments, i.e., capping the range of the personalized prices or their ratios. We investigate the optimal pricing strategy of a profit-maximizing monopoly under both regulatory constraints and the impact of imposing them on consumer surplus, producer surplus, and social welfare. We theoretically prove that both proposed constraints can help balance consumer surplus and producer surplus at the expense of total surplus for common demand distributions, such as uniform, logistic, and exponential distributions. Experiments on both simulation and real-world datasets demonstrate the correctness of these theoretical results1. Our findings and insights shed light on regulatory policy design for the increasingly monopolized business in the digital era.
We study the problem of financial assistance (bailouts, stimulus payments, or subsidy allocations) in a network where individuals experience income shocks. These questions are pervasive both in policy domains and in the design of new Web-enabled forms of financial interaction. We build on the financial clearing framework of Eisenberg and Noe that allows the incorporation of a bailout policy that is based on discrete bailouts motivated by stimulus programs in both off-line and on-line settings. We show that optimally allocating such bailouts on a financial network in order to maximize a variety of social welfare objectives of this form is a computationally intractable problem. We develop approximation algorithms to optimize these objectives and establish guarantees for their approximation ratios. Then, we incorporate multiple fairness constraints in the optimization problems and study their boundedness. Finally, we apply our methodology to data, both in the context of a system of large financial institutions with real-world data, as well as in a realistic societal context with financial interactions between people and businesses for which we use semi-artificial data derived from mobility patterns. Our results suggest that the algorithms we develop and study have reasonable results in practice and outperform other network-based heuristics. We argue that the presented problem through the societal-level lens could assist policymakers in making informed decisions on issuing subsidies.
We propose a general Variational Embedding Learning Framework (VELF) for alleviating the severe cold-start problem in CTR prediction. VELF addresses the cold start problem via alleviating over-fits caused by data-sparsity in two ways: learning probabilistic embedding, and incorporating trainable and regularized priors which utilize the rich side information of cold start users and advertisements (Ads). The two techniques are naturally integrated into a variational inference framework, forming an end-to-end training process. Abundant empirical tests on benchmark datasets well demonstrate the advantages of our proposed VELF. Besides, extended experiments confirmed that our parameterized and regularized priors provide more generalization capability than traditional fixed priors.
Multi-round competitions often double or triple the points awarded in the final round, calling it a bonus, to maximize spectators’ excitement. In a two-player competition with n rounds, we aim to derive the optimal bonus size to maximize the audience’s overall expected surprise (as defined in ). We model the audience’s prior belief over the two players’ ability levels as a beta distribution. Using a novel analysis that clarifies and simplifies the computation, we find that the optimal bonus depends greatly upon the prior belief and obtain solutions of various forms for both the case of a finite number of rounds and the asymptotic case. In an interesting special case, we show that the optimal bonus approximately and asymptotically equals to the “expected lead”, the number of points the weaker player will need to come back in expectation. Moreover, we observe that priors with a higher skewness lead to a higher optimal bonus size, and in the symmetric case, priors with a higher uncertainty also lead to a higher optimal bonus size. This matches our intuition since a highly asymmetric prior leads to a high “expected lead”, and a highly uncertain symmetric prior often leads to a lopsided game, which again benefits from a larger bonus.
We analyze the optimal information design in a click-through auction with stochastic click-through rates and known valuations per click. The auctioneer takes as given the auction rule of the click-through auction, namely the generalized second-price auction. Yet, the auctioneer can design the information flow regarding the click-through rates among the bidders. We require that the information structure to be calibrated in the learning sense. With this constraint, the auction needs to rank the ads by a product of the value and a calibrated prediction of the click-through rates. The task of designing an optimal information structure is thus reduced to the task of designing an optimal calibrated prediction.
We show that in a symmetric setting with uncertainty about the click-through rates, the optimal information structure attains both social efficiency and surplus extraction. The optimal information structure requires private (rather than public) signals to the bidders. It also requires correlated (rather than independent) signals, even when the underlying uncertainty regarding the click-through rates is independent. Beyond symmetric settings, we show that the optimal information structure requires partial information disclosure, and achieves only partial surplus extraction.
First-price auctions have many desirable properties, including uniquely possessing some, like credibility. However, first-price auctions are also inherently non-truthful, and non-truthfulness may result in instability and inefficiencies. Given these pros and cons, we seek to quantify the extent to which first-price auctions are susceptible to manipulation.
In this work we adopt a metric that was introduced in the context of bitcoin fee design markets: the percentage change in payment that can be achieved by being strategic. We study the behavior of this metric for single-unit and k-unit auction environments with n i.i.d. buyers, and seek conditions under which the percentage change tends to zero as n grows large.
To the best of our knowledge, ours is the first rigorous study of the extent to which large multi-unit first price auctions are susceptible to manipulation. We provide an almost complete picture of the conditions under which they are “truthful in the large,” and exhibit some surprising boundaries.
This paper studies equilibrium quality of semi-separable position auctions (known as the Ad Types setting ) with greedy or optimal allocation combined with generalized second-price (GSP) or Vickrey-Clarke-Groves (VCG) pricing. We make three contributions: first, we give upper and lower bounds on the Price of Anarchy (PoA) for auctions which use greedy allocation with GSP pricing, greedy allocation with VCG pricing, and optimal allocation with GSP pricing. Second, we give Bayes-Nash equilibrium characterizations for two-player, two-slot instances (for all auction formats) and show that there exists both a revenue hierarchy and revenue equivalence across some formats. Finally, we use no-regret learning algorithms and bidding data from a large online advertising platform to evaluate the performance of the mechanisms under semi-realistic conditions. We find that the VCG mechanism tends to obtain revenue and welfare comparable to or better than that of the other mechanisms. We also find that in practice, each of the mechanisms obtains significantly better welfare than our worst-case bounds might suggest.
We study a market of investments on networks, where each agent (vertex) can invest in any enterprise linked to her, and at the same time, raise capital for her firm’s enterprise from other agents she is linked to. Failing to raise sufficient capital results with the firm defaulting, being unable to invest in others. Our main objective is to examine the role of collateral contracts in handling the strategic risk that can propagate to a systemic risk throughout the network in a cascade of defaults. We take a mechanism-design approach and solve for the optimal scheme of collateral contracts that capital raisers offer their investors. These contracts aim at sustaining the efficient level of investment as a unique Nash equilibrium, while minimizing the total collateral.
Our main results contrast the network environment with its non-network counterpart (where the sets of investors and capital raisers are disjoint). We show that for acyclic investment networks, the network environment does not necessitate any additional collaterals, and systemic risk can be fully handled by optimal bilateral collateral contracts between capital raisers and their investors. This is, unfortunately, not the case for cyclic investment networks. We show that bilateral contracting will not suffice to resolve systemic risk, and the market will need an external entity to design a global collateral scheme for all capital raisers. Furthermore, the minimum total collateral that will sustain the efficient level of investment as a unique equilibrium may be arbitrarily higher, even in simple cyclic investment networks, compared with the corresponding non-network environment. Additionally, we prove computational-complexity results, both for a single enterprise and for networks.
For the scalability of industrial online advertising systems, a two-stage auction architecture is widely used to enable efficient ad allocation on a large set of corpus within a limited response time. The current deployed two-stage ad auction usually retrieves an ad subset by a coarse ad quality metric in a pre-auction stage, and then determines the auction outcome by a refined metric in the subsequent stage. However, this simple and greedy solution suffers from performance degradation, as it regards the decision in each stage separately, leading to an improper ad selection metric for the pre-auction stage. In this work, we explicitly investigate the relation between the coarse and refined ad quality metrics, and design a two-stage ad auction by taking the decision interaction between the two stages into account. We decouple the design of the two-stage auction by solving a stochastic subset selection problem in the pre-auction stage and conducting a general second price (GSP) auction in the second stage. We demonstrate that this decouple still preserves the incentive compatibility of the auction mechanism. As the proposed formulation of the pre-auction stage is an NP-hard problem, we propose a scalable approximation solution by defining a new subset selection metric, namely Pre-Auction Score (PAS). Experiment results on both public and industrial dataset demonstrate the improvement on social welfare and revenue of the proposed two-stage ad auction, than the intuitive greedy two-stage auction and other baselines.
We analyze a scenario in which software agents implemented as regret-minimizing algorithms engage in a repeated auction on behalf of their users. We study first-price and second-price auctions, as well as their generalized versions (e.g., as those used for ad auctions). Using both theoretical analysis and simulations, we show that, surprisingly, in second-price auctions the players have incentives to misreport their true valuations to their own learning agents, while in first-price auctions it is a dominant strategy for all players to truthfully report their valuations to their agents.
Time-series prediction is of high practical value in a wide range of applications such as econometrics and meteorology, where the data are commonly formed by temporal patterns. Most prior works ignore the diversity of dynamic pattern frequency, i.e., different granularities, suffering from insufficient information exploitation. Thus, multi-granularity learning is still under-explored for time-series prediction. In this paper, we propose a Multi-granularity Residual Learning Framework (MRLF) for more effective time series prediction. For a given time series, intuitively, there are more or less semantic overlaps and validity differences among its representations of different granularities. Due to the information redundancy, straightforward methods that leverage multi-granularity data, such as concatenation or ensemble, can easily lead to the model being dominated by the redundant coarse-grained trend information. Therefore, we design a novel residual learning net to model the prior knowledge of the fine-grained data’s distribution through the coarse-grained one. Then, by calculating the residual between multi-granularity data, the redundant information be removed. Furthermore, to alleviate the side effect of validity differences, we introduce a self-supervised objective for confidence estimation, which delivers more effective optimization without the requirement of additional annotation efforts. Extensive experiments on the real-world datasets indicate that multi-granular information significantly improves the time series prediction performance, and our model is superior in capturing such information.
In this paper, we study how to fairly allocate a set of indivisible chores to a number of (asymmetric) agents with additive cost functions. We consider the fairness notion of (weighted) proportionality up to any item (PROPX), and show that a (weighted) PROPX allocation always exists and can be computed efficiently. We also consider the partial information setting, where the algorithms can only use agents’ ordinal preferences. We design algorithms that achieve 2-approximate (weighted) PROPX, and the approximation ratio is optimal. We complement the algorithmic results by investigating the relationship between (weighted) PROPX and other fairness notions such as maximin share and AnyPrice share, and bounding the social welfare loss by enforcing the allocations to be (weighted) PROPX.
Understanding the value of acquiring or retaining subscribers is crucial for subscription-based businesses. While customer lifetime value (LTV) is commonly used to do so, we demonstrate that LTV likely over-states the true value of acquisition or retention. We establish a methodology to estimate the monetary value of acquired or retained subscribers based on estimating both on and off-service LTV. To overcome the lack of data on off-service households, we use an approach based on Markov chains that recovers off-service LTV from minimal data on non-subscriber transitions. Furthermore, we demonstrate how the methodology can be used to (i) forecast aggregate subscriber numbers that respect both aggregate market constraints and account-level dynamics, (ii) estimate the impact of price changes on revenue and subscription growth and (iii) provide optimal policies, such as price discounting, that maximize expected lifetime revenue.
Understanding the convergence properties of learning dynamics in repeated auctions is a timely and important question in the area of learning in auctions, with numerous applications in, e.g., online advertising markets. This work focuses on repeated first price auctions where bidders with fixed values for the item learn to bid using mean-based algorithms – a large class of online learning algorithms that include popular no-regret algorithms such as Multiplicative Weights Update and Follow the Perturbed Leader. We completely characterize the learning dynamics of mean-based algorithms, in terms of convergence to a Nash equilibrium of the auction, in two senses: (1) time-average: the fraction of rounds where bidders play a Nash equilibrium approaches 1 in the limit; (2) last-iterate: the mixed strategy profile of bidders approaches a Nash equilibrium in the limit. Specifically, the results depend on the number of bidders with the highest value: Our discovery opens up new possibilities in the study of convergence dynamics of learning algorithms.
Cloud computing customers often submit repeating jobs and computation pipelines on approximately regular schedules, with arrival and running times that exhibit variance. This pattern, typical of training tasks in machine learning, allows customers to partially predict future job requirements. We develop a model of cloud computing platforms that receive statements of work (SoWs) in an online fashion. The SoWs describe future jobs whose arrival times and durations are probabilistic, and whose utility to the submitting agents declines with completion time. The arrival and duration distributions, as well as the utility functions, are considered private customer information and are reported by strategic agents to a scheduler that is optimizing for social welfare.
We design pricing, scheduling, and eviction mechanisms that incentivize truthful reporting of SoWs. An important challenge is maintaining incentives despite the possibility of the platform becoming saturated. We introduce a framework to reduce scheduling under uncertainty to a relaxed scheduling problem without uncertainty. Using this framework, we tackle both adversarial and stochastic submissions of statements of work, and obtain logarithmic and constant competitive mechanisms, respectively.
Budget-management systems are one of the key components of modern auction markets. Internet advertising platforms typically offer advertisers the possibility to pace the rate at which their budget is depleted, through budget-pacing mechanisms. We focus on multiplicative pacing mechanisms in an online setting in which a bidder is repeatedly confronted with a series of advertising opportunities. After collecting bids, each item is then allocated through a single-item, second-price auction. If there were no budgetary constraints, bidding truthfully would be an optimal choice for the advertiser. However, since their budget is limited, the advertiser may want to shade their bid downwards in order to preserve their budget for future opportunities, and to spread expenditures evenly over time. The literature on online pacing problems mostly focuses on the setting in which the bidder optimizes an additive separable objective, such as the total click-through rate or the revenue of the allocation. In many settings, however, bidders may also care about other objectives which oftentimes are non-separable. We study the frequent case in which the utility of a (proxy) bidder depends on the rewards obtained from items they are allocated, and on the distance of the realized distribution of impressions from a target distribution. We introduce a novel regularizer which can describe those distributional preferences, while keeping the problem tractable. We show that this regularizer can be integrated into an existing online mirror descent scheme with minor modifications, attaining the optimal order of sub-linear regret compared to the optimal allocation in hindsight when inputs are drawn independently, from an unknown distribution. Moreover, we show that our approach can easily be incorporated in standard existing pacing systems that are not usually built for this objective. The effectiveness of our algorithm in internet advertising applications is confirmed by numerical experiments on real-world data.
Auto-bidding is an area of increasing importance in the domain of online advertising. We study the problem of designing auctions in an auto-bidding setting with the goal of maximizing welfare at system equilibrium. Previous results showed that the price of anarchy (PoA) under VCG is 2 and also that this is tight even with two bidders. This raises an interesting question as to whether VCG yields the best efficiency in this setting, or whether the PoA can be improved upon. We present a prior-free randomized auction in which the PoA is approx. 1.896 for the case of two bidders, proving that one can achieve an efficiency strictly better than that under VCG in this setting. We also provide a stark impossibility result for the problem in general as the number of bidders increases – we show that no (randomized) anonymous truthful auction can have a PoA strictly better than 2 asymptotically as the number of bidders per query increases. While it was shown in previous work that one can improve on the PoA of 2 if the auction is allowed to use the bidder’s values for the queries in addition to the bidder’s bids, we note that our randomized auction is prior-free and does not use such additional information; our impossibility result also applies to auctions without additional value information.
Two-sided marketplace platforms often run experiments (or A/B tests) to test the effect of an intervention before launching it platform-wide. A typical approach is to randomize users into a treatment group, which receives the intervention, and a control group, which does not. The platform then compares the performance in the two groups to estimate the effect if the intervention were launched to everyone. We focus on two common experiment types, where the platform randomizes users either on the supply side or on the demand side. For these experiments, it is known that the resulting estimates of the treatment effect are typically biased: individuals in the market compete with each other, which creates interference and leads to a biased estimate. Here, we observe that economic interactions (competition among demand and supply) lead to statistical phenomenon (biased estimates).
We develop a simple, tractable market model to study bias and variance in these experiments with interference. We focus on two choices available to the platform: (1) Which side of the platform should it randomize on (supply or demand)? (2) What proportion of individuals should be allocated to treatment? We find that both choices affect the bias and variance of the resulting estimators, but in different ways. The bias-optimal choice of experiment type depends on the relative amounts of supply and demand in the market, and we discuss how a platform can use market data to select the experiment type. Importantly, we find that in many circumstances choosing the bias-optimal experiment type has little effect on variance, and in some cases coincide with the variance-optimal type. On the other hand, we find that the choice of treatment proportion can induce a bias-variance tradeoff, where the bias-minimizing proportion increases variance. We discuss how a platform can navigate this tradeoff and best choose the proportion, using a combination of modeling as well as contextual knowledge about the market, the risk of the intervention, and reasonable effect sizes of the intervention.
The core objective of modelling recommender systems from implicit feedback is to maximize the positive sample score sp and minimize the negative sample score sn, which can usually be summarized into two paradigms: the pointwise and the pairwise. The pointwise approaches fit each sample with its label individually, which is flexible in weighting and sampling on instance-level but ignores the inherent ranking property. By qualitatively minimizing the relative score sn − sp, the pairwise approaches capture the ranking of samples naturally but suffer from training efficiency. Additionally, both approaches are hard to explicitly provide a personalized decision boundary to determine if users are interested in items unseen. To address those issues, we innovatively introduce an auxiliary score bu for each user to represent the User Interest Boundary(UIB) and individually penalize samples that cross the boundary with pairwise paradigms, i.e., the positive samples whose score is lower than bu and the negative samples whose score is higher than bu. In this way, our approach successfully achieves a hybrid loss of the pointwise and the pairwise to combine the advantages of both. Analytically, we show that our approach can provide a personalized decision boundary and significantly improve the training efficiency without any special sampling strategy. Extensive results show that our approach achieves significant improvements on not only the classical pointwise or pairwise models but also state-of-the-art models with complex loss function and complicated feature encoding.
User preference modeling is a vital yet challenging problem in personalized product search. In recent years, latent space based methods have achieved state-of-the-art performance by jointly learning semantic representations of products, users, and text tokens. However, existing methods are limited in their ability to model user preferences. They typically represent users by the products they visited in a short span of time using attentive models and lack the ability to exploit relational information such as user-product interactions or item co-occurrence relations. In this work, we propose to address the limitations of prior arts by exploring local and global user behavior patterns on a user successive behavior graph, which is constructed by utilizing short-term actions of all users. To capture implicit user preference signals and collaborative patterns, we use an efficient jumping graph convolution to explore high-order relations to enrich product representations for user preference modeling. Our approach can be seamlessly integrated with existing latent space based methods and be potentially applied in any product retrieval method that uses purchase history to model user preferences. Extensive experiments on eight Amazon benchmarks demonstrate the effectiveness and potential of our approach. The source code is available at https://github.com/floatSDSDS/SBG .
Click models are widely used for user simulation, relevance inference, and evaluation in Web search. Most existing click models implicitly assume that users’ relevance judgment and behavior patterns are homogeneous. However, previous studies have shown that different users interact with search engines in rather different ways. Therefore, a unified click model can hardly capture the heterogeneity in users’ click behavior. To shed light on this research question, we propose a Click Model Personalization framework (CMP) that adaptively selects from global and local models for individual users. Different adaptive strategies are designed to personalize click behavior modeling only for specific users and queries. We also reveal that capturing personalized behavior patterns is more important than modeling personalized relevance assessments in constructing personalized click models. To evaluate the performance of the proposed CMP framework, we build a large-scale practical Personalized Web Search (PWS) dataset, which consists of the search logs of 1,249 users from a commercial search engine over six months. Experimental results show that the proposed CMP framework achieves significant performance improvements than the non-personalized click models in click prediction.
Machine-learning based recommender system(RS) has become an effective means to help people automatically discover their interests. Existing models often represent the rich information for recommendation, such as items, users, and contexts, as embedding vectors and leverage them to predict users’ feedback. In the view of causal analysis, the associations between these embedding vectors and users’ feedback are a mixture of the causal part that describes why an item is preferred by a user, and the non-causal part that merely reflects the statistical dependencies between users and items, for example, the exposure mechanism, public opinions, display position, etc. However, existing RSs mostly ignored the striking differences between the causal parts and non-causal parts when using these embedding vectors. In this paper, we propose a model-agnostic framework named IV4Rec that can effectively decompose the embedding vectors into these two parts, hence enhancing recommendation results. Specifically, we jointly consider users’ behaviors in search scenarios and recommendation scenarios. Adopting the concepts in causal analysis, we embed users’ search behaviors as instrumental variables (IVs), to help decompose original embedding vectors in recommendation, i.e., treatments. IV4Rec then combines the two parts through deep neural networks and uses the combined results for recommendation. IV4Rec is model-agnostic and can be applied to a number of existing RSs such as DIN and NRHUB. Experimental results on both public and proprietary industrial datasets demonstrate that IV4Rec consistently enhances RSs and outperforms a framework that jointly considers search and recommendation.
Myriads of web applications in the Big Data era demand an effective measure of similarity based on billion-scale network structures, e.g., collaborative filtering. Recently, CoSimRank has been devised as a promising graph-theoretic similarity model, which iteratively captures the notion that “two distinct nodes are evaluated as similar if they are connected with similar nodes”. However, the existing CoSimRank model for assessing similarities may either yield unsatisfactory results or rather cost-inhibitive, rendering it impractical in massive graphs. In this paper, we propose CoSimHeat, a novel scalable graph-theoretic similarity model based on heat diffusion. Specifically, we first formulate CoSimHeat model by taking advantage of heat diffusion to emulate the activities of similarity propagations on the Web. Then, we show that the similarities produced by CoSimHeat are more satisfactory than those from CoSimRank families since CoSimHeat fulfils four axioms that an ideal similarity model should satisfy while circumventing the “dead-loop” problem of CoSimRank. Next, we propose a fast algorithm to substantially accelerate CoSimHeat computations on billion-sized graphs, with guarantees of accuracy. Our experiments on various datasets validate that CoSimHeat achieves higher accuracy and is order-of-magnitude faster than state-of-the-art competitors.
Click-through rate (CTR) prediction plays a crucial role in sponsored search advertising (search ads). User click behavior usually showcases strong comparison patterns among relevant/competing items within the user awareness. Explicit user awareness could be characterized by user behavior sequence modeling, which however suffers from issues such as cold start, behavior noise and hidden channels. Instead, in this paper, we study the problem of modeling implicit user awareness about relevant/competing items. We notice that candidate items of the CTR prediction model could play as surrogates for relevant/competing items within the user awareness. Motivated by this finding, we propose a novel framework, named CIM (Candidate Item Modeling), to characterize users’ awareness on candidate items. CIM introduces an additional module to encode candidate items into a context vector and therefore is plug-and-play for existing neural network-based CTR prediction models. Offline experiments on a ten-billion-scale production dataset collected from the real traffic of a search advertising system, together with the corresponding online A/B testing, demonstrate CIM’s superior performance. Notably, CIM has been deployed in production at JD.com, serving the main traffic of hundreds of millions of users, which shows great application value. Our code and dataset are available at https://github.com/kaifuzheng/cim.
A good personalized product search (PPS) system should not only focus on retrieving relevant products, but also consider user personalized preference. Recent work on PPS mainly adopts the representation learning paradigm, e.g., learning representations for each entity (including user, product and query) from historical user behaviors (aka. user-product-query interactions). However, we argue that existing methods do not sufficiently exploit the crucial collaborative signal, which is latent in historical interactions to reveal the affinity between the entities. Collaborative signal is quite helpful for generating high-quality representation, exploiting which would benefit the representation learning of one node from its connected nodes.
To tackle this limitation, in this work, we propose a new model IHGNN for personalized product search. IHGNN resorts to a hypergraph constructed from the historical user-product-query interactions, which could completely preserve ternary relations and express collaborative signal based on the topological structure. On this basis, we develop a specific interactive hypergraph neural network to explicitly encode the structure information (i.e., collaborative signal) into the embedding process. It collects the information from the hypergraph neighbors and explicitly models neighbor feature interaction to enhance the representation of the target entity. Extensive experiments on three real-world datasets validate the superiority of our proposal over the state-of-the-arts.
Neural document ranking approaches, specifically transformer models, have achieved impressive gains in ranking performance. However, query processing using such over-parameterized models is both resource and time intensive. In this paper, we propose the Fast-Forward index – a simple vector forward index that facilitates ranking documents using interpolation of lexical and semantic scores – as a replacement for contextual re-rankers and dense indexes based on nearest neighbor search. Fast-Forward indexes rely on efficient sparse models for retrieval and merely look up pre-computed dense transformer-based vector representations of documents and passages in constant time for fast CPU-based semantic similarity computation during query processing. We propose index pruning and theoretically grounded early stopping techniques to improve the query processing throughput. We conduct extensive large-scale experiments on TREC-DL datasets and show improvements over hybrid indexes in performance and query processing efficiency using only CPUs. Fast-Forward indexes can provide superior ranking performance using interpolation due to the complementary benefits of lexical and semantic similarities.
Selecting reliable negative training instances is the challenging task in the implicit feedback-based recommendation, which is optimized by pairwise learning on user feedback data. The existing methods usually exploit various negative samplers (i.e., heuristic-based or GAN-based sampling) on user feedback data to improve the quality of negative samples. However, these methods usually focused on maintaining the hard negative samples with a high gradient for training, causing the false negative samples to be selected preferentially. The limitation of the false negative noise amplification may lead to overfitting and further poor generalization of the model. To address this issue, we propose a Gain-Tuning Dynamic Negative Sampling GDNS to make the recommendation more robust and effective. Our proposed model designs an expectational gain sampler, concerning the expectation of user’ preference gap between the positive and negative samples in training, to guide the negative selection dynamically. This gain-tuning negative sampler can effectively identify the false negative samples and further diminish the risk of introducing false negative instances. Moreover, for improving the training efficiency, we construct positive and negative groups for each user in each iteration, and develop a group-wise optimizer to optimize them in a cross manner. Experiments on two real-world datasets show our approach significantly outperforms state-of-the-art negative sampling baselines.
Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, the embedding-based retrieval (EBR) becomes a promising solution, where deep learning based document representation and ANN search techniques are allied to handle this task. However, a major challenge is that the ANN index can be too large to fit into memory, given the considerable size of answer corpus. In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification. For the best of retrieval accuracy, a Progressive Optimization framework is designed. The sparse embeddings are learned ahead for high-quality search of candidates. Conditioned on the candidate distribution induced by the sparse embeddings, the dense embeddings are continuously learned to optimize the discrimination of ground-truth from the shortlisted candidates. Besides, two techniques: the contrastive quantization and the locality-centric sampling are introduced for the learning of sparse and dense embeddings, which substantially contribute to their performances. Thanks to the above features, our method effectively handles massive-scale EBR with strong advantages in accuracy: with up to recall gain on million-scale corpus, and up to recall gain on billion-scale corpus. Besides, Our method is applied to a major sponsored search platform with substantial gains on revenue (), Recall () and CTR (). Our code is available at https://github.com/microsoft/BiDR.
Ranking algorithms in recommender systems influence people to make decisions. Conventional ranking algorithms based on implicit feedback data aim to maximize the utility to users by capturing users’ preferences over items. However, these utility-focused algorithms tend to cause fairness issues that require careful consideration in online platforms. Existing fairness-focused studies does not explicitly consider the problem of lacking negative feedback in implicit feedback data, while previous utility-focused methods ignore the importance of fairness in recommendations. To fill this gap, we propose a Generative Adversarial Networks (GANs) based learning algorithm FairGAN mapping the exposure fairness issue to the problem of negative preferences in implicit feedback data. FairGAN does not explicitly treat unobserved interactions as negative, but instead, adopts a novel fairness-aware learning strategy to dynamically generate fairness signals. This optimizes the search direction to make FairGAN capable of searching the space of the optimal ranking that can fairly allocate exposure to individual items while preserving users’ utilities as high as possible.
Similarity search over a bipartite graph aims to retrieve from the graph the nodes that are similar to each other, which finds applications in various fields such as online advertising, recommender systems etc. Existing similarity measures either (i) overlook the unique properties of bipartite graphs, or (ii) fail to capture high-order information between nodes accurately, leading to suboptimal result quality. Recently, Hidden Personalized PageRank (HPP) is applied to this problem and found to be more effective compared with prior similarity measures. However, existing solutions for HPP computation incur significant computational costs, rendering it inefficient especially on large graphs.
In this paper, we first identify an inherent drawback of HPP and overcome it by proposing bidirectional HPP (BHPP). Then, we formulate similarity search over bipartite graphs as the problem of approximate BHPP computation, and present an efficient solution Approx-BHPP. Specifically, Approx-BHPP offers rigorous theoretical accuracy guarantees with optimal computational complexity by combining deterministic graph traversal with matrix operations in an optimized and non-trivial way. Moreover, our solution achieves significant gain in practical efficiency due to several carefully-designed optimizations. Extensive experiments, comparing BHPP against 8 existing similarity measures over 7 real bipartite graphs, demonstrate the effectiveness of BHPP on query rewriting and item recommendation. Moreover, Approx-BHPP outperforms baseline solutions often by up to orders of magnitude in terms of computational time on both small and large datasets.
In Information Retrieval (IR) evaluation, preference judgments are collected by presenting to the assessors a pair of documents and asking them to select which of the two, if any, is the most relevant. This is an alternative to the classic relevance judgment approach, in which human assessors judge the relevance of a single document on a scale; such an alternative allows to make relative rather than absolute judgments of relevance. While preference judgments are easier for human assessors to perform, the number of possible document pairs to be judged is usually so high that it makes it unfeasible to judge them all. Thus, following a similar idea to pooling strategies for single document relevance judgments where the goal is to sample the most useful documents to be judged, in this work we focus on analyzing alternative ways to sample document pairs to judge, in order to maximize the value of a fixed number of preference judgments that can feasibly be collected. Such value is defined as how well we can evaluate IR systems given a budget, that is, a fixed number of human preference judgments that may be collected. By relying on several datasets featuring relevance judgments gathered by means of experts and crowdsourcing, we experimentally compare alternative strategies to select document pairs and show how different strategies lead to different IR evaluation result quality levels. Our results show that, by using the appropriate procedure, it is possible to achieve good IR evaluation results with a limited number of preference judgments, thus confirming the feasibility of using preference judgments to create IR evaluation collections.
Based on the success of recommender systems in e-commerce and entertainment, there is growing interest in their use in matching markets like job search. While this holds potential for improving market fluidity and fairness, we show in this paper that naively applying existing recommender systems to matching markets is sub-optimal. Considering the standard process where candidates apply and then get evaluated by employers, we present a new recommendation framework to model this interaction mechanism and propose efficient algorithms for computing personalized rankings in this setting. We show that the optimal rankings need to not only account for the potentially divergent preferences of candidates and employers, but they also need to account for capacity constraints. This makes conventional ranking systems that merely rank by some local score (e.g., one-sided or reciprocal relevance) highly sub-optimal — not only for an individual user, but also for societal goals (e.g., low unemployment). To address this shortcoming, we propose the first method for jointly optimizing the rankings for all candidates in the market to explicitly maximize social welfare. In addition to the theoretical derivation, we evaluate the method both on simulated environments and on data from a real-world networking-recommendation system that we built and fielded at a large computer science conference.
Utilizing pre-trained language models has achieved great success for neural document ranking. Limited by the computational and memory requirements, long document modeling becomes a critical issue. Recent works propose to modify the full attention matrix in Transformer by designing sparse attention patterns. However, most of them only focus on local connections of terms within a fixed-size window. How to build suitable remote connections between terms to better model document representation remains underexplored. In this paper, we propose the model Socialformer, which introduces the characteristics of social networks into designing sparse attention patterns for long document modeling in document ranking. Specifically, we consider several attention patterns to construct a graph like social networks. Endowed with the characteristic of social networks, most pairs of nodes in such a graph can reach with a short path while ensuring the sparsity. To facilitate efficient calculation, we segment the graph into multiple subgraphs to simulate friend circles in social scenarios. Experimental results confirm the effectiveness of our model on long document modeling.
User cold-start recommendation is a serious problem that limits the performance of recommender systems (RSs). Recent studies have focused on treating this issue as a few-shot problem and seeking solutions with model-agnostic meta-learning (MAML). Such methods regard making recommendations for one user as a task and adapt to new users with a few steps of gradient updates on the meta-model. However, none of those methods consider the limitation of user representation learning imposed by the special task setting of MAML-based RSs. And they learn a common meta-model for all users while ignoring the implicit grouping distribution induced by the correlation differences among users. In response to the above problems, we propose a pretrained network modulation and task adaptation approach (PNMTA) for user cold-start recommendation. In the pretraining stage, a pretrained model is obtained with non-meta-learning methods to achieve better user representation and generalization, which can also transfer the learned knowledge to the meta-learning stage for modulation. During the meta-learning stage, an encoder modulator is utilized to realize the memorization and correction of prior parameters for the meta-learning task, and a predictor modulator is introduced to condition the model initialization on the task identity for adaptation steps. In addition, PNMTA can also make use of the existing non-cold-start users for pretraining. Comprehensive experiments on two benchmark datasets demonstrate that our model can achieve significant and consistent improvements against other state-of-the-art methods.
Product search has been an important way for people to find products on online shopping platforms. Existing approaches in personalized product search mainly embed user preferences into one single vector. However, this simple strategy easily results in sub-optimal representations, failing to model and disentangle user’s multiple preferences. To overcome this problem, we proposed a category-aware multi-interest model to encode users as multiple preference embeddings to represent user-specific interests. Specifically, we also capture the category indications for each preference to indicate the distribution of categories it focuses on, which is derived from rich relations between users, products, and attributes. Based on these category indications, we develop a category attention mechanism to aggregate these various preference embeddings considering current queries and items as the user’s comprehensive representation. By this means, we can use this representation to calculate matching scores of retrieved items to determine whether they meet the user’s search intent. Besides, we introduce a homogenization regularization term to avoid the redundancy between user interests. Experimental results show that the proposed method significantly outperforms existing approaches.
Alleviating the delayed feedback problem is of crucial importance for the conversion rate(CVR) prediction in online advertising. Previous delayed feedback modeling methods using an observation window to balance the trade-off between waiting for accurate labels and consuming fresh feedback. Moreover, to estimate CVR upon the freshly observed but biased distribution with fake negatives, the importance sampling is widely used to reduce the distribution bias. While effective, we argue that previous approaches falsely treat fake negative samples as real negative during the importance weighting and have not fully utilized the observed positive samples, leading to suboptimal performance.
In this work, we propose a new method, DElayed Feedback modeling with UnbiaSed Estimation, (DEFUSE), which aim to respectively correct the importance weights of the immediate positive, the fake negative, the real negative, and the delay positive samples at finer granularity. Specifically, we propose a two-step optimization approach that first infers the probability of fake negatives among observed negatives before applying importance sampling. To fully exploit the ground-truth immediate positives from the observed distribution, we further develop a bi-distribution modeling framework to jointly model the unbiased immediate positives and the biased delay conversions. Experimental results on both public and our industrial datasets validate the superiority of DEFUSE. Codes are available at https://github.com/ychen216/DEFUSE.git.
Reading comprehension is a complex cognitive process involving many human brain activities. However, little is known about what happens in human brain during reading comprehension and how these cognitive activities can affect information retrieval process. Additionally, with the advances in brain imaging techniques such as electroencephalogram (EEG), it is possible to collect brain signals in almost real time and explore whether it can be utilized as feedback to facilitate information acquisition performance.
In this paper, we carefully design a lab-based user study to investigate brain activities during reading comprehension. Our findings show that neural responses vary with different types of reading contents, i.e., contents that can satisfy users’ information needs and contents that cannot. We suggest that various cognitive activities, e.g., cognitive loading, semantic-thematic understanding, and inferential processing, underpin these neural responses at the micro-time scale during reading comprehension. From these findings, we illustrate several insights for information retrieval tasks, such as ranking models construction and interface design. Besides, with the emerging of portable EEG-based applications, we suggest the possibility of detecting reading comprehension status for a proactive real-world system. To this end, we propose a Unified framework for EEG-based Reading Comprehension Modeling (UERCM). To verify its effectiveness, we conduct extensive experiments based on EEG features for two reading comprehension tasks: answer sentence classification and answer extraction. Results show that it is feasible to improve the performance of two tasks with brain signals. These findings imply that brain signals are valuable feedback for enhancing human-computer interactions during reading comprehension.
Research on click models usually focuses on developing effective approaches to reduce biases in user clicks. However, one of the major drawbacks of existing click models is the lack of scalability. In this work, we tackle the scalability of Expectation-Maximization (EM)-based click models by introducing ParClick, a new parallel algorithm designed by following the Partitioning-Communication-Aggregation-Mapping (PCAM) method. To this end, we first provide a generic formulation of EM-based click models. Then, we design an efficient parallel version of this generic click model following the PCAM approach: we partition user click logs and model parameters into separate tasks, analyze communication among them, and aggregate these tasks to reduce communication overhead. Finally, we provide a scalable, parallel implementation of the proposed design, which maps well on a multi-core machine. Our experiments on the Yandex relevance prediction dataset show that ParClick scales well when increasing the amount of training data and computational resources. In particular, ParClick is 24.7 times faster to train with 40 million search sessions and 40 threads compared to the standard sequential version of the Click Chain Model (CCM) without any degradation in effectiveness.
E-commerce platforms usually display a mixed list of ads and organic items in feed. One key problem is to allocate the limited slots in the feed to maximize the overall revenue as well as improve user experience, which requires a good model for user preference. Instead of modeling the influence of individual items on user behaviors, the arrangement signal models the influence of the arrangement of items and may lead to a better allocation strategy. However, most of previous strategies fail to model such a signal and therefore result in suboptimal performance. In addition, the percentage of ads exposed (PAE) is an important indicator in ads allocation. Excessive PAE hurts user experience while too low PAE reduces platform revenue. Therefore, how to constrain the PAE within a certain range while keeping personalized recommendation under the PAE constraint is a challenge.
In this paper, we propose Cross Deep Q Network (Cross DQN) to extract the crucial arrangement signal by crossing the embeddings of different items and modeling the crossed sequence by multi-channel attention. Besides, we propose an auxiliary loss for batch-level constraint on PAE to tackle the above-mentioned challenge. Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on Meituan feed to serve more than 300 millions of customers.
In spite of the tremendous development of recommender system owing to the progressive capability of machine learning recently, the current recommender system is still vulnerable to the distribution shift of users and items in realistic scenarios, leading to the sharp decline of performance in testing environments. It is even more severe in many common applications where only the implicit feedback from sparse data is available. Hence, it is crucial to promote the performance stability of recommendation method in different environments. In this work, we first make a thorough analysis of implicit recommendation problem from the viewpoint of out-of-distribution (OOD) generalization. Then under the guidance of our theoretical analysis, we propose to incorporate the recommendation-specific DAG learner into a novel causal preference-based recommendation framework named CausPref, mainly consisting of causal learning of invariant user preference and anti-preference negative sampling to deal with implicit feedback. Extensive experimental results from real-world datasets clearly demonstrate that our approach surpasses the benchmark models significantly under types of out-of-distribution settings, and show its impressive interpretability.
In many classical e-commerce platforms, personalized recommendation has been proven to be of great business value, which can improve user satisfaction and increase the revenue of platforms. In this paper, we present a new recommendation problem, Trigger-Induced Recommendation (TIR), where users’ instant interest can be explicitly induced with a trigger item and follow-up related target items are recommended accordingly. TIR has become ubiquitous and popular in e-commerce platforms. In this paper, we figure out that although existing recommendation models are effective in traditional recommendation scenarios by mining users’ interests based on their massive historical behaviors, they are struggling in discovering users’ instant interests in the TIR scenario due to the discrepancy between these scenarios, resulting in inferior performance. To tackle the problem, we propose a novel recommendation method named Deep Interest Highlight Network (DIHN) for Click-Through Rate (CTR) prediction in TIR scenarios. It has three main components including 1) User Intent Network (UIN), which responds to generate a precise probability score to predict user’s intent on the trigger item; 2) Fusion Embedding Module (FEM), which adaptively fuses trigger item and target item embeddings based on the prediction from UIN; and (3) Hybrid Interest Extracting Module (HIEM), which can effectively highlight users’ instant interest from their behaviors based on the result of FEM. Extensive offline and online evaluations on a real-world e-commerce platform demonstrate the superiority of DIHN over state-of-the-art methods. Our code is available 1.
Existing online learning to rank (OL2R) solutions are limited to linear models, which are incompetent to capture possible non-linear relations between queries and documents. In this work, to unleash the power of representation learning in OL2R, we propose to directly learn a neural ranking model from users’ implicit feedback (e.g., clicks) collected on the fly. We focus on RankNet and LambdaRank, due to their great empirical success and wide adoption in offline settings, and control the notorious explore-exploit trade-off based on the convergence analysis of neural networks using neural tangent kernel. Specifically, in each round of result serving, exploration is only performed on document pairs where the predicted rank order between the two documents is uncertain; otherwise, the ranker’s predicted order will be followed in result ranking. We prove that under standard assumptions our OL2R solution achieves a gap-dependent upper regret bound of O(log 2(T)), in which the regret is defined on the total number of mis-ordered pairs over T rounds. Comparisons against an extensive set of state-of-the-art OL2R baselines on two public learning to rank benchmark datasets demonstrate the effectiveness of the proposed solution.
A table is composed of data values that are organized in rows and columns providing implicit structural information. A table is usually accompanied by secondary information such as the caption, page title, etc., that form the textual information. Understanding the connection between the textual and structural information is an important, yet neglected aspect in table retrieval, as previous methods treat each source of information independently. In this paper, we propose StruBERT, a structure-aware BERT model that fuses the textual and structural information of a data table to produce context-aware representations for both textual and tabular content of a data table. We introduce the concept of horizontal self-attention, which extends the idea of vertical self-attention introduced in TaBERT and allows us to treat both dimensions of a table equally. StruBERT features are integrated in a new end-to-end neural ranking model to solve three table-related downstream tasks: keyword- and content-based table retrieval, and table similarity. We evaluate our approach using three datasets, and we demonstrate substantial improvements in terms of retrieval and classification metrics over state-of-the-art methods.
Tree-based models underpin many modern semantic search engines and recommender systems due to their sub-linear inference times. In industrial applications, these models operate at extreme scales, where every bit of performance is critical. Memory constraints at extreme scales also require that models be sparse, hence tree-based models are often back-ended by sparse matrix algebra routines. However, there are currently no sparse matrix techniques specifically designed for the sparsity structure one encounters in tree-based models for extreme multi-label ranking/classification (XMR/XMC) problems. To address this issue, we present the masked sparse chunk multiplication (MSCM) technique, a sparse matrix technique specifically tailored to XMR trees. MSCM is easy to implement, embarrassingly parallelizable, and offers a significant performance boost to any existing tree inference pipeline at no cost. We perform a comprehensive study of MSCM applied to several different sparse inference schemes and benchmark our methods on a general purpose extreme multi-label ranking framework. We observe that MSCM gives consistently dramatic speedups across both the online and batch inference settings, single- and multi-threaded settings, and on many different tree models and datasets. To demonstrate its utility in industrial applications, we apply MSCM to an enterprise-scale semantic product search problem with 100 million products and achieve sub-millisecond latency of 0.88 ms per query on a single thread — an 8x reduction in latency over vanilla inference techniques. The MSCM technique requires absolutely no sacrifices to model accuracy as it gives exactly the same results as standard sparse matrix techniques. Therefore, we believe that MSCM will enable users of XMR trees to save a substantial amount of compute resources in their inference pipelines at very little cost. Our code is publicly available at https://github.com/amzn/pecos , as well as our complete benchmarks and code for reproduction at https://github.com/UniqueUpToPermutation/pecos/tree/benchmark.
With the increasing demands on e-commerce platforms, numerous user action history is emerging. Those enriched action records are vital to understand users’ interests and intents. Recently, prior works for user behavior prediction mainly focus on the interactions with product-side information. However, the interactions with search queries, which usually act as a bridge between users and products, are still under investigated. In this paper, we explore a new problem named temporal event forecasting, a generalized user behavior prediction task in a unified query product evolutionary graph, to embrace both query and product recommendation in a temporal manner. To fulfill this setting, there involves two challenges: (1) the action data for most users is scarce; (2) user preferences are dynamically evolving and shifting over time. To tackle those issues, we propose a novel Retrieval-Enhanced Temporal Event (RETE) forecasting framework. Unlike existing methods that enhance user representations via roughly absorbing information from connected entities in the whole graph, RETE efficiently and dynamically retrieves relevant entities centrally on each user as high-quality subgraphs, preventing the noise propagation from the densely evolutionary graph structures that incorporate abundant search queries. And meanwhile, RETE autoregressively accumulates retrieval-enhanced user representations from each time step, to capture evolutionary patterns for joint query and product prediction. Empirically, extensive experiments on both the public benchmark and four real-world industrial datasets demonstrate the effectiveness of the proposed RETE method.
Ranking has been one of the most important tasks in information retrieval. With the development of deep representation learning, many researchers propose to encode both the query and items into embedding vectors and rank the items according to the inner product or distance measures in the embedding space. However, the ranking models based on vector embeddings may have shortages in effectiveness and efficiency. For effectiveness, they lack the intrinsic ability to model the diversity and uncertainty of queries and items in ranking. For efficiency, nearest neighbor search in a large collection of item vectors can be costly. In this work, we propose to use the recently proposed probabilistic box embeddings for effective and efficient ranking, in which queries and items are parameterized as high-dimensional axis-aligned hyper-rectangles. For effectiveness, we utilize probabilistic box embeddings to model the diversity and uncertainty with the overlapping relations of the hyper-rectangles, and prove that such overlapping measure is a kernel function which can be adopted in other kernel-based methods. For efficiency, we propose a box embedding-based indexing method, which can safely filter irrelevant items and reduce the retrieval latency. We further design a training strategy to increase the proportion of irrelevant items that can be filtered by the index. Experiments on public datasets show that the box embeddings and the box embedding-based indexing approaches are effective and efficient in two ranking tasks: ad hoc retrieval and product recommendation.
Conversational recommender systems (CRSs) have been proposed recently to mitigate the cold-start problem suffered by the traditional recommender systems. By introducing conversational key-terms, existing conversational recommenders can effectively reduce the need for extensive exploration and elicit the user preferences faster and more accurately. However, existing conversational recommenders leveraging key-terms heavily rely on the availability and quality of the key-terms, and their performances might degrade significantly when the key-terms are incomplete or not well labeled, which usually happens when there are new items being consistently incorporated into the systems and involving lots of human efforts to acquire well-labeled key-terms is costly. Besides, existing CRS methods leverage the feedback to different conversational key-terms separately, without considering the underlying relations between the key-terms. In this case, the learning of the conversational recommenders is sample inefficient, especially when there is a large number of candidate conversational key-terms.
In this paper, we propose a knowledge-aware conversational preference elicitation framework and a bandit-based algorithm GraphConUCB. To achieve efficient preference elicitation given items with incompletely labeled key-terms, our algorithm leverage the underlying relations between the key-terms, guided by the knowledge graph. Being knowledge-aware, our algorithm propagates the user preferences via a pseudo graph feedback module, which also accelerates the exploration in the large action space of key-terms and improves the conversational sample efficiency. To select the most informative conversational key-terms in the graphs to conduct conversations, we further devise a graph-based optimal design module which leverages the graph structure. We provide the theoretical analysis of the regret upper bound for GraphConUCB. With extensive experiments, we show that our algorithm can effectively handle the items with incompletely labeled key-terms, and improves over the state-of-the-art baselines significantly.
Product ranking is a crucial component for many e-commerce services. One of the major challenges in product search is the vocabulary mismatch between query and products, which may be a larger vocabulary gap problem compared to other information retrieval domains. While there is a growing collection of neural learning to match methods aimed specifically at overcoming this issue, they do not leverage the recent advances of large language models for product search. On the other hand, product ranking often deals with multiple types of engagement signals such as clicks, add-to-cart, and purchases, while most of the existing works are focused on optimizing one single metric such as click-through rate, which may suffer from data sparsity. In this work, we propose a novel end-to-end multi-task learning framework for product ranking with BERT to address the above challenges. The proposed model utilizes domain-specific BERT with fine-tuning to bridge the vocabulary gap and employs multi-task learning to optimize multiple objectives simultaneously, which yields a general end-to-end learning framework for product search. We conduct a set of comprehensive experiments on a real-world e-commerce dataset and demonstrate significant improvement of the proposed approach over the state-of-the-art baseline methods.
A BERT-based Neural Ranking Model (NRM) can be either a cross-encoder or a bi-encoder. Between the two, bi-encoder is highly efficient because all the documents can be pre-processed before the actual query time. In this work, we show two approaches for improving the performance of BERT-based bi-encoders. The first approach is to replace the full fine-tuning step with a lightweight fine-tuning. We examine lightweight fine-tuning methods that are adapter-based, prompt-based, and hybrid of the two. The second approach is to develop semi-Siamese models where queries and documents are handled with a limited amount of difference. The limited difference is realized by learning two lightweight fine-tuning modules, where the main language model of BERT is kept common for both query and document. We provide extensive experiment results for monoBERT, TwinBERT, and ColBERT where three performance metrics are evaluated over Robust04, ClueWeb09b, and MS-MARCO datasets. The results confirm that both lightweight fine-tuning and semi-Siamese are considerably helpful for improving BERT-based bi-encoders. In fact, lightweight fine-tuning is helpful for cross-encoder, too.1
Recent advancements in web-based multimedia technologies, such as face recognition web services powered by deep learning, have been significant. As a result, companies such as Microsoft, Amazon, and Naver provide highly accurate commercial face recognition web services for a variety of multimedia applications. Naturally, such technologies face persistent threats, as virtually anyone with access to deepfakes can quickly launch impersonation attacks. These attacks pose a serious threat to authentication services, which rely heavily on the performance of their underlying face recognition technologies. Despite its gravity, deepfake abuse involving commercial web services and their robustness have not been thoroughly measured and investigated. By conducting a case study on celebrity face recognition, we examine the robustness of black-box commercial face recognition web APIs and open-source tools against Deepfake Impersonation (DI) attacks. While the majority of APIs do not make specific claims of deepfake robustness, we find that authentication mechanisms may get built one top of them, nonetheless. We demonstrate the vulnerability of face recognition technologies to DI attacks, achieving respective success rates of 78.0% for targeted (TA) attacks; we also propose mitigation strategies, lowering respective attack success rates to as low as 1.26% for TA attacks with adversarial training.
Recent years have seen increased attacks targeting embedded web applications of IoT devices. An important target of such attacks is the hidden interface of embedded web applications, which employs no protection but exposes security-critical actions and sensitive information to illegitimate users. With the severity and the pervasiveness of this issue, it is crucial to identify the vulnerable hidden interfaces, shed light on best practices and raise public awareness.
In this paper, we present, a new approach that automatically exposes hidden web interfaces of IoT devices. Specifically, constructs probing requests through firmware analysis to test physical devices, and narrows down the scope of identification by filtering out irrelevant requests and interfaces through differential analysis. It pinpoints hidden interfaces by attaching various device-setting parameters in the probing requests and matching keywords of sensitive information. Evaluated on 17 IoT devices, successfully identified 44 vulnerabilities, including 43 previously unknown ones. also demonstrates surprising efficiency: on average, it delivered 151438 probing requests, taking only 47 minutes on each target device.
Web measurement studies can shed light on not yet fully understood phenomena and thus are essential for analyzing how the modern Web works. This often requires building new and adjusting existing crawling setups, which has led to a wide variety of analysis tools for different (but related) aspects. If these efforts are not sufficiently documented, the reproducibility and replicability of the measurements may suffer—two properties that are crucial to sustainable research. In this paper, we survey 117 recent research papers to derive best practices for Web-based measurement studies and specify criteria that need to be met in practice. When applying these criteria to the surveyed papers, we find that the experimental setup and other aspects essential to reproducing and replicating results are often missing. We underline the criticality of this finding by performing a large-scale Web measurement study on 4.5 million pages with 24 different measurement setups to demonstrate the influence of the individual criteria. Our experiments show that slight differences in the experimental setup directly affect the overall results and must be documented accurately and carefully.
Socialbots are software-driven user accounts on social platforms, acting autonomously (mimicking human behavior), with the aims to influence the opinions of other users or spread targeted misinformation for particular goals. As socialbots undermine the ecosystem of social platforms, they are often considered harmful. As such, there have been several computational efforts to auto-detect the socialbots. However, to our best knowledge, the adversarial nature of these socialbots has not yet been studied. This begs a question “can adversaries, controlling socialbots, exploit AI techniques to their advantage?” To this question, we successfully demonstrate that indeed it is possible for adversaries to exploit computational learning mechanism such as reinforcement learning (RL) to maximize the influence of socialbots while avoiding being detected. We first formulate the adversarial socialbot learning as a cooperative game between two functional hierarchical RL agents. While one agent curates a sequence of activities that can avoid the detection, the other agent aims to maximize network influence by selectively connecting with right users. Our proposed policy networks train with a vast amount of synthetic graphs and generalize better than baselines on unseen real-life graphs both in terms of maximizing network influence (up to +18%) and sustainable stealthiness (up to +40% undetectability) under a strong bot detector (90% detection accuracy). During inference, the complexity of our approach scales linearly, independent of a network’s structure and the virality of news. This makes our attack very practical in a real-life setting.
Social media platforms are driven by user engagement metrics. Unfortunately, such metrics are susceptible to manipulation and expose the platforms to abuse. Video view fraud is a unique class of fake engagement abuse on video-sharing platforms, such as YouTube, where the view count of videos is artificially inflated. There exists limited research on such abuse, and prior work focused on automated or bot-driven approaches. In this paper, we explore organic or human-driven approaches to view fraud, conducting a case study on a long-running YouTube view fraud campaign operated on a popular free video streaming service, 123Movies. Before 123Movies users are allowed to access a stream on the service, they must watch an unsolicited YouTube video displayed as a pre-roll advertisement. Due to 123Movies’ popularity, this activity drives large-scale YouTube view fraud. In this study, we reverse-engineer how 123Movies distributes these YouTube videos as pre-roll advertisements, and track the YouTube videos involved over a 9-month period. For a subset of these videos, we monitor their view counts and metrics for their respective YouTube channels over the same period. Our analysis reveals the characteristics of YouTube channels and videos participating in this view fraud, as well as the efficacy of such view fraud efforts. Ultimately, our study provides empirical grounding on organic YouTube view fraud.
Past privacy measurement studies on web tracking focused on high-ranked commercial websites, as user tracking is extensively used for monetization on those sites. Conversely, governments across the globe now offer services online, which unlike commercial sites, are funded by public money, and do not generally make it to the top million website lists. As such, web tracking on those services has not been comprehensively studied, even though these services deal with privacy and security-sensitive user data, and used by a significant number of users. In this paper, we perform privacy and security measurements on government websites and Android apps: 150,244 unique websites (from 206 countries) and 1166 Android apps (from 71 countries). We found numerous commercial trackers on these services—e.g., 17% of government websites and 37% of government Android apps host Google trackers; 13% of government sites contain YouTube cookies with an expiry date in the year of 9999. 27% of government Android apps leak sensitive information (e.g., user/device identifiers, passwords, API keys) to third parties, or any network attacker (when sent over HTTP). We also found 304 government sites and 40 apps are flagged by VirusTotal as malicious. We hope our findings to help improve privacy and security of online government services, given that governments are now apparently taking Internet privacy/security seriously and imposing strict regulations on commercial sites.
Ad blockers heavily rely on filter lists to block ad domains, which can serve advertisements and trackers. However, recent research has reported that some advertisers keep registering replica ad domains (RAD domains)—new domains that serve the same purpose as the original ones—which tend to slip through ad-blocker filter lists. Although this phenomenon might negatively affect ad blockers’ effectiveness, no study to date has thoroughly investigated its prevalence and the issues caused by RAD domains. In this work, we proposed methods to discover RAD domains and categorized their change patterns. From a crawl of 50,000 websites, we identified 1,748 unique RAD domains, 1,096 of which survived for an average of 410.5 days before they were blocked; the rest have not been blocked as of February 2021. Notably, we found that non-blocked RAD domains could extend the timespan of ad or tracker distribution by more than two years. Our analysis further revealed a taxonomy of four techniques used to create RAD domains, including two less-studied ones. Additionally, we discovered that the RAD domains affected 10.2% of the websites we crawled, and 23.7% of the RAD domains exhibiting privacy-intrusive behaviors, undermining ad blockers’ privacy protection.
Digital media (including websites and online social networks) facilitate the broadcasting of news via flexible and personalized channels. Unlike conventional newspapers which become “read-only” upon publication, online news sources are free to arbitrarily modify news headlines after their initial release. The motivation, frequency, and effect of post-publication headline changes are largely unknown, with no offline equivalent from where researchers can draw parallels.
In this paper, we collect and analyze over 41K pairs of altered news headlines by tracking ∼ 411K articles from major US news agencies over a six month period (March to September 2021), identifying that 7.5% articles have at least one post-publication headline edit with a wide range of types, from minor updates, to complete rewrites. We characterize the frequency with which headlines are modified and whether certain outlets are more likely to be engaging in post-publication headline changes than others. We discover that 49.7% of changes go beyond minor spelling or grammar corrections, with 23.13% of those resulting in drastically disparate information conveyed to readers. Finally, to better understand the interaction between post-publication headline edits and social media, we conduct a temporal analysis of news popularity on Twitter. We find that an effective headline post-publication edit should occur within the first ten hours after the initial release to ensure that the previous, potentially misleading, information does not fully propagate over the social network.
Recent years, local differential privacy (LDP) has been adopted by many web service providers like Google , Apple  and Microsoft  to collect and analyse users’ data privately. In this paper, we consider the problem of discrete distribution estimation under local differential privacy constraints. Distribution estimation is one of the most fundamental estimation problems, which is widely studied in both non-private and private settings. In the local model, private mechanisms with provably optimal sample complexity are known. However, they are optimal only in the worst-case sense; their sample complexity is proportional to the size of the entire universe, which could be huge in practice. In this paper, we consider sparse or approximately sparse (e.g. highly skewed) distribution, and show that the number of samples needed could be significantly reduced. This problem has been studied recently , but they only consider strict sparse distributions and the high privacy regime. We propose new privatization mechanisms based on compressive sensing. Our methods work for approximately sparse distributions and medium privacy, and have optimal sample and communication complexity.
Given a stream of entries over time in a multi-dimensional data setting where concept drift is present, how can we detect anomalous activities? Most of the existing unsupervised anomaly detection approaches seek to detect anomalous events in an offline fashion and require a large amount of data for training. This is not practical in real-life scenarios where we receive the data in a streaming manner and do not know the size of the stream beforehand. Thus, we need a data-efficient method that can detect and adapt to changing data trends, or concept drift, in an online manner. In this work, we propose MemStream, a streaming anomaly detection framework, allowing us to detect unusual events as they occur while being resilient to concept drift. We leverage the power of a denoising autoencoder to learn representations and a memory module to learn the dynamically changing trend in data without the need for labels. We prove the optimum memory size required for effective drift handling. Furthermore, MemStream makes use of two architecture design choices to be robust to memory poisoning. Experimental results show the effectiveness of our approach compared to state-of-the-art streaming baselines using 2 synthetic datasets and 11 real-world datasets.
We explore the problem of selectively forgetting categories from trained CNN classification models in federated learning (FL). Given that the data used for training cannot be accessed globally in FL, our insights probe deep into the internal influence of each channel. Through the visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification.
Inspired by this, we propose a method for scrubbing the model cleanly of information about particular categories. The method does not require retraining from scratch, nor global access to the data used for training. Instead, we introduce the concept of Term Frequency Inverse Document Frequency (TF-IDF) to quantize the class discrimination of channels. Channels with high TF-IDF scores have more discrimination on the target categories and thus need to be pruned to unlearn. The channel pruning is followed by a fine-tuning process to recover the performance of the pruned model.
Evaluated on CIFAR10 dataset, our method accelerates the speed of unlearning by 8.9× for the ResNet model, and 7.9× for the VGG model under no degradation in accuracy, compared to retraining from scratch. For CIFAR100 dataset, the speedups are 9.9× and 8.4×, respectively. We envision this work as a complementary block for FL towards compliance with legal and ethical criteria.
Encrypted traffic classification requires discriminative and robust traffic representation captured from content-invisible and imbalanced traffic data for accurate classification, which is challenging but indispensable to achieve network security and network management. The major limitation of existing solutions is that they highly rely on the deep features, which are overly dependent on data size and hard to generalize on unseen data. How to leverage the open-domain unlabeled traffic data to learn representation with strong generalization ability remains a key challenge. In this paper, we propose a new traffic representation model called Encrypted Traffic Bidirectional Encoder Representations from Transformer (ET-BERT), which pre-trains deep contextualized datagram-level representation from large-scale unlabeled data. The pre-trained model can be fine-tuned on a small number of task-specific labeled data and achieves state-of-the-art performance across five encrypted traffic classification tasks, remarkably pushing the F1 of ISCX-VPN-Service to 98.9% (5.2%↑), Cross-Platform (Android) to 92.5% (5.4%↑), CSTNET-TLS 1.3 to 97.4% (10.0%↑). Notably, we provide explanation of the empirically powerful pre-training model by analyzing the randomness of ciphers. It gives us insights in understanding the boundary of classification ability over encrypted traffic. The code is available at: https://github.com/linwhitehat/ET-BERT.
OpenStreetMap (OSM), a collaborative, crowdsourced Web map, is a unique source of openly available worldwide map data, increasingly adopted in Web applications. Vandalism detection is a critical task to support trust and maintain OSM transparency. This task is remarkably challenging due to the large scale of the dataset, the sheer number of contributors, various vandalism forms, and the lack of annotated data. This paper presents Ovid - a novel attention-based method for vandalism detection in OSM. Ovid relies on a novel neural architecture that adopts a multi-head attention mechanism to summarize information indicating vandalism from OSM changesets effectively. To facilitate automated vandalism detection, we introduce a set of original features that capture changeset, user, and edit information. Furthermore, we extract a dataset of real-world vandalism incidents from the OSM edit history for the first time and provide this dataset as open data. Our evaluation conducted on real-world vandalism data demonstrates the effectiveness of Ovid.
Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to help developers implement safe and effective code with powerful intelligence, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale open-source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. Here, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.
In recent years, phishing scams have become the most serious type of crime involved in Ethereum, the second-largest blockchain platform. The existing phishing scams detection technology on Ethereum mostly uses traditional machine learning or network representation learning to mine the key information from the transaction network to identify phishing addresses. However, these methods adopt the last transaction record or even completely ignore these records, and only manual-designed features are taken for the node representation. In this paper, we propose a Temporal Transaction Aggregation Graph Network (TTAGN) to enhance phishing scams detection performance on Ethereum. Specifically, in the temporal edges representation module, we model the temporal relationship of historical transaction records between nodes to construct the edge representation of the Ethereum transaction network. Moreover, the edge representations around the node are aggregated to fuse topological interactive relationships into its representation, also named as trading features, in the edge2node module. We further combine trading features with common statistical and structural features obtained by graph neural networks to identify phishing addresses. Evaluated on real-world Ethereum phishing scams datasets, our TTAGN (92.8% AUC, and 81.6% F1-score) outperforms the state-of-the-art methods, and the effectiveness of temporal edges representation and edge2node module is also demonstrated.
Smart Voice Assistants are transforming the way users interact with technology. This transformation is mostly fostered by the proliferation of voice-driven applications (called skills) offered by third-party developers through an online market. We see how the number of skills has rocked in recent years, with the Amazon Alexa skill ecosystem growing from just 135 skills in early 2016 to about 125k skills in early 2021. Along with the growth in skills, there is increasing concern over the risks that third-party skills pose to users’ privacy. In this paper, we perform a systematic and longitudinal measurement study of the Alexa marketplace. We shed light on how this ecosystem evolves using data collected across three years between 2019 and 2021. We demystify developers’ data disclosure practices and present an overview of the third-party ecosystem. We see how the research community continuously contribute to the market’s sanitation, but the Amazon vetting process still requires significant improvement. We perform a responsible disclosure process reporting 675 skills with privacy issues to both Amazon and all affected developers, out of which 246 skills suffer from important issues (i.e., broken traceability). We see that 107 out of the 246 (43.5%) skills continue to display broken traceability almost one year after being reported. As a result, the overall state of affairs has improved in the ecosystem over the years. Yet, newly submitted skills and unresolved known issues pose an endemic risk.
Email authentication protocols such as SPF, DKIM, and DMARC are used to detect spoofing attacks, but they face key challenges when handling email forwarding scenarios. Recently in 2019, a new Authenticated Received Chain (ARC) protocol was introduced to support mail forwarding applications to preserve the authentication records. After 2 years, it is still not well understood how ARC is implemented, deployed, and configured in practice. In this paper, we perform an empirical analysis on ARC usage and examine how it affects spoofing detection decisions on popular email provides that support ARC. After analyzing an email dataset of 600K messages, we show that ARC is not yet widely adopted, but it starts to attract adoption from major email providers (e.g., Gmail, Outlook). Our controlled experiment shows that most email providers’ ARC implementations are done correctly. However, some email providers (Zoho) have misinterpreted the meaning of ARC results, which can be exploited by spoofing attacks. Finally, we empirically investigate forwarding-based “Hide My Email” services offered by iOS 15 and Firefox, and show their implementations break ARC and can be leveraged by attackers to launch more successful spoofing attacks against otherwise well-configured email receivers (e.g., Gmail).
Human labeling is time-consuming and costly. This problem is further exacerbated in extremely imbalanced class label scenarios, such as detecting fraudsters in online websites. Active learning selects the most relevant example for human labelers to improve the model performance at a lower cost. However, existing methods for active learning for graph data often assumes that both data and label distributions are balanced. These assumptions fail in extreme rare-class classification scenarios, such as classifying abusive reviews in an e-commerce website.
We propose a novel framework ALLIE to address this challenge of active learning in large-scale imbalanced graph data. In our approach, we efficiently sample from both majority and minority classes using a reinforcement learning agent with imbalance-aware reward function. We employ focal loss in the node classification model in order to focus more on rare class and improve the accuracy of the downstream model. Finally, we use a graph coarsening strategy to reduce the search space of the reinforcement learning agent. We conduct extensive experiments on benchmark graph datasets and real-world e-commerce datasets. ALLIE out-performs state-of-the-art graph-based active learning methods significantly, with up to 10% improvement of F1 score for the positive class. We also validate ALLIE on a proprietary e-commerce graph data by tasking it to detect abuse. Our coarsening strategy reduces the computational time by up to 38% in both proprietary and public datasets.
Different techniques have been recommended to detect fraudulent responses in online surveys, but little research has been taken to systematically test the extent to which they actually work in practice. In this paper, we conduct an empirical evaluation of 22 anti-fraud tests in two complementary online surveys. The first survey recruits Rust programmers on public online forums and social media networks. We find that fraudulent respondents involve both bot and human characteristics. Among different anti-fraud tests, those designed based on domain knowledge are the most effective. By combining individual tests, we can achieve a detection performance as good as commercial techniques while making the results more explainable. To explore these tests under a broader context, we ran a different survey on Amazon Mechanical Turk (MTurk). The results show that for a generic survey without requiring users to have any domain knowledge, it is more difficult to distinguish fraudulent responses. However, a subset of tests still remain effective.
Despite active privacy research on sophisticated web tracking techniques (e.g., fingerprinting, cache collusion, bounce tracking, CNAME cloaking), most tracking on the web is basic “stateful” tracking enabled by classical browser storage policies sharing per-site storage across all HTTP contexts. Alternative, privacy-preserving storage policies, especially for third-party contexts, have been proposed and even deployed, but these can break websites that presume traditional, non-partitioned storage. Such breakage discourages privacy-preserving experimentation, cementing the dismal status quo. Our work measures the privacy vs. compatibility trade-offs of representative third-party storage policies to enable design of browsers that are both compatible and privacy respecting. Our contributions include web-scale measurements of page behaviors under multiple third-party storage policies inspired by production browsers. We define metrics for measuring aggregate effects on web privacy and compatibility, including a novel system for quantitatively estimating aggregate website breakage under different policies. We find that making third-party storage partitioned by first-party, and lifetimes by site-session achieves the best privacy and compatibility trade-off. We provide complete measurement datasets and storage policy implementations.
While vast amounts of personal data are shared daily on public online platforms and used by companies and analysts to gain valuable insights, privacy concerns are also on the rise: Modern authorship attribution techniques have proven effective at identifying individuals from their data, such as their writing style or behavior of picking and judging movies. It is hence crucial to develop data sanitization methods that allow sharing of users’ data while protecting their privacy and preserving quality and content of the original data.
In this paper, we tackle anonymization of textual data and propose an end-to-end differentially private variational autoencoder architecture. Unlike previous approaches that achieve differential privacy on a per-word level through individual perturbations, our solution works at an abstract level by perturbing the latent vectors that provide a global summary of the input texts. Decoding an obfuscated latent vector thus not only allows our model to produce coherent, high-quality output text that is human-readable, but also results in strong anonymization due to the diversity of the produced data. We evaluate our approach on IMDb movie and Yelp business reviews, confirming its anonymization capabilities and preservation of the semantics and utility of the original sentences.
Although federated learning improves privacy of training data by exchanging local gradients or parameters rather than raw data, the adversary still can leverage local gradients and parameters to obtain local training data by launching reconstruction and membership inference attacks. To defend against such privacy attacks, many noises perturbed methods (like differential privacy or CountSketch matrix) have been widely designed. However, the strong defence ability and high learning accuracy of these schemes cannot be ensured at the same time, which will impede the wide application of FL in practice (especially for medical or financial institutions that require both high accuracy and strong privacy guarantee). To overcome this issue, we propose an efficient model perturbation method for federated learning to defend against reconstruction and membership inference attacks launched by curious clients. On the one hand, similar to the differential privacy, our method also selects random numbers as perturbed noises added to the global model parameters, and thus it is very efficient and easy to be integrated in practice. Meanwhile, the random selected noises are positive real numbers and the corresponding value can be arbitrarily large, and thus the strong defence ability can be ensured. On the other hand, unlike differential privacy or other perturbation methods that cannot eliminate added noises, our method allows the server to recover the true aggregated gradients by eliminating the added noises. Therefore, our method does not hinder learning accuracy at all. Extensive experiments demonstrate that for both regression and classification tasks, our method achieves the same accuracy as non-private approaches and outperforms the state-of-the-art defence schemes. Besides, the defence ability of our method against reconstruction and membership inference attack is significantly better than the state-of-the-art related defence schemes.
Black-box web scanners have been a prevalent means of performing penetration testing to find reflected cross-site scripting (XSS) vulnerabilities. Unfortunately, off-the-shelf black-box web scanners suffer from unscalable testing as well as false negatives that stem from a testing strategy that employs fixed attack payloads, thus disregarding the exploitation of contexts to trigger vulnerabilities. To this end, we propose a novel method of adapting attack payloads to a target reflected XSS vulnerability using reinforcement learning (RL). We present Link, a general RL framework whose states, actions, and a reward function are designed to find reflected XSS vulnerabilities in a black-box and fully automatic manner. Link finds 45, 213, and 60 vulnerabilities with no false positives in Firing-Range, OWASP, and WAVSEP benchmarks, respectively, outperforming state-of-the-art web scanners in terms of finding vulnerabilities and ending testing campaigns earlier. Link also finds 43 vulnerabilities in 12 real-world applications, demonstrating the promising efficacy of using RL in finding reflected XSS vulnerabilities.
A code property graph (CPG) is a joint representation of syntax, control flows, and data flows of a target application. Recent studies have demonstrated the promising efficacy of leveraging CPGs for the identification of vulnerabilities. It recasts the problem of implementing a specific static analysis for a target vulnerability as a graph query composition problem. It requires devising coarse-grained graph queries that model vulnerable code patterns. Unfortunately, such coarse-grained queries often leave vulnerabilities due to faulty input sanitization undetected. In this paper, we propose, a scalable system designed to identify various web vulnerabilities, including bugs that stem from incorrect sanitization. We designed to find a subgraph in a target CPG that matches a given CPG query having a known vulnerability, which is known as the subgraph isomorphism problem. To address the scalability challenge that stems from the NP-complete nature of this problem, leverages optimization techniques designed to boost the efficiency of matching vulnerable subgraphs. found confirmed vulnerabilities including CVEs among 2,464 potential vulnerabilities in real-world CPGs having a combined total of 1 billion nodes and 1.2 billion edges.
Since the users of open source software (OSS) projects may not use the latest version all the time, OSS development teams often support code maintenance for old versions through maintaining multiple stable branches. Typically, the developers create a stable branch for each old stable version, deploy security patches on the branch, and release fixed versions at regular intervals. As such, old-version applications in production environments are protected from the disclosed vulnerabilities in a long time. However, the rapidly growing number of OSS vulnerabilities has greatly strained this patch deployment model, and a critical need has arisen for the security community to understand the practice of security patch management across stable branches. In this work, we conduct a large-scale empirical study of stable branches in OSS projects and the security patches deployed on them via investigating 608 stable branches belonging to 26 popular OSS projects as well as more than 2,000 security fixes for 806 CVEs deployed on stable branches.
Our study distills several important findings: (i) more than 80% affected CVE-Branch pairs are unpatched; (ii) the unpatched vulnerabilities could pose a serious security risk to applications in use, with 47.39% of them achieving a CVSS score over 7 (High or Critical Severity); and (iii) the patch porting process requires great manual efforts and takes an average of 40.46 days, significantly extending the time window for N-day vulnerability attacks. Our results reveal the worrying state of security patch management across stable branches. We hope our study can shed some light on improving the practice of patch management in OSS projects.
Few-shot Learning (FSL) is aimed to make predictions based on a limited number of samples. Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks. However, the priors adopted by the existing methods suffer from challenging knowledge missing, knowledge noise, and knowledge heterogeneity, which hinder the performance for few-shot learning. In this study, we explore knowledge injection for FSL with pre-trained language models and propose ontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop the ontology transformation based on the external knowledge graph to address the knowledge missing issue, which fulfills and converts structure knowledge to text. We further introduce span-sensitive knowledge injection via a visible matrix to select informative knowledge to handle the knowledge noise issue. To bridge the gap between knowledge and text, we propose a collective training algorithm to optimize representations jointly. We evaluate our proposed OntoPrompt in three tasks, including relation extraction, event extraction, and knowledge graph completion, with eight datasets. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.
Knowledge graph (KG) alignment is to match entities in different KGs, which is important to knowledge fusion and integration. Temporal KGs (TKGs) extend traditional Knowledge Graphs (KGs) by associating static triples with specific timestamps (e.g., temporal scopes or time points). Moreover, open-world KGs (OKGs) are dynamic with new emerging entities and timestamps. While entity alignment (EA) between KGs has drawn increasing attention from the research community, EA between TKGs and OKGs still remains unexplored. In this work, we propose a novel Temporal Relational Entity Alignment method (TREA) which is able to learn alignment-oriented TKG embeddings and represent new emerging entities. We first map entities, relations and timestamps into an embedding space, and the initial feature of each entity is represented by fusing the embeddings of its connected relations and timestamps as well as its neighboring entities. A graph neural network (GNN) is employed to capture intra-graph information and a temporal relational attention mechanism is utilized to integrate relation and time features of links between nodes. Finally, a margin-based full multi-class log-loss is used for efficient training and a sequential time regularizer is used to model unobserved timestamps. We use three well-established TKG datasets, as references for evaluating temporal and non-temporal EA methods. Experimental results show that our method outperforms the state-of-the-art EA methods.
Graph convolutional networks (GCNs)—which are effective in modeling graph structures—have been increasingly popular in knowledge graph completion (KGC). GCN-based KGC models first use GCNs to generate expressive entity representations and then use knowledge graph embedding (KGE) models to capture the interactions among entities and relations. However, many GCN-based KGC models fail to outperform state-of-the-art KGE models though introducing additional computational complexity. This phenomenon motivates us to explore the real effect of GCNs in KGC. Therefore, in this paper, we build upon representative GCN-based KGC models and introduce variants to find which factor of GCNs is critical in KGC. Surprisingly, we observe from experiments that the graph structure modeling in GCNs does not have a significant impact on the performance of KGC models, which is in contrast to the common belief. Instead, the transformations for entity representations are responsible for the performance improvements. Based on the observation, we propose a simple yet effective framework named LTE-KGE, which equips existing KGE models with linearly transformed entity embeddings. Experiments demonstrate that LTE-KGE models lead to similar performance improvements with GCN-based KGC methods, while being more computationally efficient. These results suggest that existing GCNs are unnecessary for KGC, and novel GCN-based KGC models should count on more ablation studies to validate their effectiveness. The code of all the experiments is available on GitHub at https://github.com/MIRALab-USTC/GCN4KGC.
Developing ontologies for the Semantic Web is a time-consuming and error-prone task that typically requires the investment of considerable manpower and resources, as well as collaborative efforts. A potentially better idea is to reuse the “off-the-shelf” ontologies, whenever possible, somehow as per certain demands and requirements. A promising way to achieve ontology reuse is through creating views of ontologies, analogous to creating views of databases, with the resulting views focusing on specific topics and content of the original ontologies. This paper explores the problem of creating views of ontologies using a uniform interpolation approach. In particular, we develop a novel and practical uniform interpolation method for creating signature-based views for ontologies specified in the description logic , a very expressive description logic for which uniform interpolation has not been fully addressed. The method is terminating and sound, and computes uniform interpolants of -ontologies by eliminating from the input ontologies the names not used in the view using a forgetting procedure. This makes it the first and so far the only approach to eliminate both concept and (non-transitive) role names from -ontologies. Despite the inherent difficulty of uniform interpolation for this level of expressivity, an empirical evaluation with a prototypical implementation show very good success rates on a corpus of real-world ontologies, and demonstrates clear algorithmic advantage over the state-of-the-art system LETHE. This is extremely useful from the semantic web perspective, as it provides knowledge engineers with a powerful tool to create views of ontologies for ontology reuse.
Classifying nodes in knowledge graphs is an important task, e.g., for predicting missing types of entities, predicting which molecules cause cancer, or predicting which drugs are promising treatment candidates. While black-box models often achieve high predictive performance, they are only post-hoc and locally explainable and do not allow the learned model to be easily enriched with domain knowledge. Towards this end, learning description logic concepts from positive and negative examples has been proposed. However, learning such concepts often takes a long time and state-of-the-art approaches provide limited support for literal data values, although they are crucial for many applications. In this paper, we propose EvoLearner—an evolutionary approach to learn concepts in , which is the attributive language with complement () paired with qualified cardinality restrictions () and data properties (). We contribute a novel initialization method for the initial population: starting from positive examples, we perform biased random walks and translate them to description logic concepts. Moreover, we improve support for data properties by maximizing information gain when deciding where to split the data. We show that our approach significantly outperforms the state of the art on the benchmarking framework SML-Bench for structured machine learning. Our ablation study confirms that this is due to our novel initialization method and support for data properties.
Entity alignment (EA), which aims to discover equivalent entities in knowledge graphs (KGs), bridges heterogeneous sources of information and facilitates the integration of knowledge. Recently, based on translational models, EA has achieved impressive performance in utilizing graph structures or by adopting auxiliary information. However, existing entity alignment methods mainly rely on manually labeled entity alignment seeds, limiting their applicability in real scenarios. In this paper, a simple but effective Uncertainty-aware Pseudo Label Refinery (UPLR) framework is proposed without manually labeling requirement and is capable of learning high-quality entity embeddings from pseudo-labeled data sets containing noisy data. Our proposed model relies on two key factors: First, a non-sampling calibration strategy is provided that does not require artificially designed thresholds to reduce the influence of noise labels. Second, the entity alignment model achieves goal-oriented uncertainty correction through a gradual enhancement strategy. Experimental results on benchmark datasets demonstrate that our proposed model outperforms the existing supervised methods in cross-lingual knowledge graph tasks. Our source code is available at: https://github.com/Jia-Li2/UPLR/.
Knowledge graph embedding (KGE) has shown great potential in automatic knowledge graph (KG) completion and knowledge-driven tasks. However, recent KGE models suffer from high training cost and large storage space, thus limiting their practicality in real-world applications. To address this challenge, based on the latest findings in the field of Contrastive Learning, we propose a novel KGE training framework called Hardness-aware Low-dimensional Embedding (HaLE). Instead of the traditional Negative Sampling, we design a new loss function based on query sampling that can balance two important training targets, Alignment and Uniformity. Furthermore, we analyze the hardness-aware ability of recent low-dimensional hyperbolic models and propose a lightweight hardness-aware activation mechanism, which can help the KGE models focus on hard instances and speed up convergence. The experimental results show that in the limited training time, HaLE can effectively improve the performance and training speed of KGE models on five commonly-used datasets. After training just a few minutes, the HaLE-trained models are competitive compared to the state-of-the-art models in both low- and high-dimensional conditions.
Event correlation reasoning infers whether a natural language paragraph containing multiple events conforms to human common sense. For example, “Andrew was very drowsy, so he took a long nap, and now he is very alert” is sound and reasonable. In contrast, “Andrew was very drowsy, so he stayed up a long time, now he is very alert” does not comply with human common sense. Such reasoning capability is essential for many downstream tasks, such as script reasoning, abductive reasoning, narrative incoherence, story cloze test, etc. However, conducting event correlation reasoning is challenging due to a lack of large amounts of diverse event-based knowledge and difficulty in capturing correlation among multiple events. In this paper, we propose EventBERT, a pre-trained model to encapsulate eventuality knowledge from unlabeled text. Specifically, we collect a large volume of training examples by identifying natural language paragraphs that describe multiple correlated events and further extracting event spans in an unsupervised manner. We then propose three novel event- and correlation-based learning objectives to pre-train an event correlation model on our created training corpus. Experimental results show EventBERT outperforms strong baselines on four downstream tasks, and achieves state-of-the-art results on most of them. Moreover, it outperforms existing pre-trained models by a large margin, e.g., 6.5 ∼ 23%, in zero-shot learning of these tasks.
Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing Web-scale KGs. Over the course of its development, the label supervision has been considered necessary for accurate alignments. Inspired by the recent progress of self-supervised learning, we explore the extent to which we can get rid of supervision for entity alignment. Commonly, the label information (positive entity pairs) is used to supervise the process of pulling the aligned entities in each positive pair closer. However, our theoretical analysis suggests that the learning of entity alignment can actually benefit more from pushing unlabeled negative pairs far away from each other than pulling labeled positive pairs close. By leveraging this discovery, we develop the self-supervised learning objective for entity alignment. We present SelfKG with efficient strategies to optimize this objective for aligning entities without label supervision. Extensive experiments on benchmark datasets demonstrate that SelfKG without supervision can match or achieve comparable results with state-of-the-art supervised baselines. The performance of SelfKG suggests that self-supervised learning offers great potential for entity alignment in KGs. The code and data are available at https://github.com/THUDM/SelfKG.
Question Generation (QG), as a challenging Natural Language Processing task, aims at generating questions based on given answers and context. Existing QG methods mainly focus on building or training models for specific QG datasets. These works are subject to two major limitations: (1) They are dedicated to specific QG formats (e.g., answer-extraction or multi-choice QG), therefore, if we want to address a new format of QG, a re-design of the QG model is required. (2) Optimal performance is only achieved on the dataset they were just trained on. As a result, we have to train and keep various QG models for different QG datasets, which is resource-intensive and ungeneralizable.
To solve the problems, we propose a model named Unified-QG based on lifelong learning techniques, which can continually learn QG tasks across different datasets and formats. Specifically, we first build a format-convert encoding to transform different kinds of QG formats into a unified representation. Then, a method named STRIDER (SimilariTy RegularIzed Difficult Example Replay) is built to alleviate catastrophic forgetting in continual QG learning. Extensive experiments were conducted on 8 QG datasets across 4 QG formats (answer-extraction, answer-abstraction, multi-choice, and boolean QG) to demonstrate the effectiveness of our approach. Experimental results demonstrate that our Unified-QG can effectively and continually adapt to QG tasks when datasets and formats vary. In addition, we verify the ability of a single trained Unified-QG model in improving 8 Question Answering (QA) systems’ performance through generating synthetic QA data.
Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a deeper understanding of unknown unknowns and more effective identification and treatment, this paper focuses on unknown unknowns characterization. We introduce a human-in-the-loop, semantic analysis framework for characterizing unknown unknowns at scale. We engage humans in two tasks that specify what a machine should know and describe what it really knows, respectively, both at the conceptual level, supported by information extraction and machine learning interpretability methods. Data partitioning and sampling techniques are employed to scale out human contributions in handling large data. Through extensive experimentation on scene recognition tasks, we show that our approach provides a rich, descriptive characterization of unknown unknowns and allows for more effective and cost-efficient detection than the state of the art.
Machine knowledge about the world’s entities should include quantity properties, such as heights of buildings, running times of athletes, energy efficiency of car models, energy production of power plants, and more. State-of-the-art knowledge bases (KBs), such as Wikidata, cover many relevant entities but often miss the corresponding quantities. Prior work on extracting quantity facts from web contents focused on high precision for top-ranked outputs, but did not tackle the KB coverage issue. This paper presents a recall-oriented approach which aims to close this gap in knowledge-base coverage. Our method is based on iterative learning for extracting quantity facts, with two novel contributions to boost recall for KB augmentation without sacrificing the quality standards of the knowledge base. The first contribution is a query expansion technique to capture a larger pool of fact candidates. The second contribution is a novel technique for harnessing observations on value distributions for self-consistency. Experiments with extractions from more than 13 million web documents demonstrate the benefits of our method.
Many place-related questions can only be answered by complex spatial reasoning, a task poorly supported by factoid question retrieval. Such reasoning using combinations of spatial and non-spatial criteria pertinent to place-related questions is increasingly possible on linked data knowledge bases. Yet, to enable question answering based on linked knowledge bases, natural language questions must first be re-formulated as formal queries. Here, we first present an enhanced version of YAGO2geo, the geospatially-enabled variant of the YAGO2 knowledge base, by linking and adding more than one million places from OpenStreetMap data to YAGO2. We then propose a novel approach to translate the place-related questions into logical representations, theoretically grounded in the core concepts of spatial information. Next, we use a dynamic template-based approach to generate fully executable GeoSPARQL queries from the logical representations. We test our approach using the Geospatial Gold Standard dataset and report substantial improvements over existing methods.
Reasoning on the knowledge graph (KG) aims to infer new facts from existing ones. Methods based on the relational path have shown strong, interpretable, and transferable reasoning ability. However, paths are naturally limited in capturing local evidence in graphs. In this paper, we introduce a novel relational structure, i.e., relational directed graph (r-digraph), which is composed of overlapped relational paths, to capture the KG’s local evidence. Since the r-digraphs are more complex than paths, how to efficiently construct and effectively learn from them are challenging. Directly encoding the r-digraphs cannot scale well and capturing query-dependent information is hard in r-digraphs. We propose a variant of graph neural network, i.e., RED-GNN, to address the above challenges. Specifically, RED-GNN makes use of dynamic programming to recursively encodes multiple r-digraphs with shared edges, and utilizes query-dependent attention mechanism to select the strongly correlated edges. We demonstrate that RED-GNN is not only efficient but also can achieve significant performance gains in both inductive and transductive reasoning tasks over existing methods. Besides, the learned attention weights in RED-GNN can exhibit interpretable evidence for KG reasoning. 1
Taxonomies are fundamental to many real-world applications in various domains, serving as structural representations of knowledge. To deal with the increasing volume of new concepts needed to be organized as taxonomies, researchers turn to automatically completion of an existing taxonomy with new concepts. In this paper, we propose TaxoEnrich, a new taxonomy completion framework, which effectively leverages both semantic features and structural information in the existing taxonomy and offers a better representation of candidate position to boost the performance of taxonomy completion. Specifically, TaxoEnrichconsists of four components: (1) taxonomy-contextualized embedding which incorporates both semantic meanings of concept and taxonomic relations based on powerful pretrained language models; (2) a taxonomy-aware sequential encoder which learns candidate position representations by encoding the structural information of taxonomy; (3) a query-aware sibling encoder which adaptively aggregates candidate siblings to augment candidate position representations based on their importance to the query-position matching; (4) a query-position matching model which extends existing work with our new candidate position representations. Extensive experiments on four large real-world datasets from different domains show that TaxoEnrichachieves the best performance among all evaluation metrics and outperforms previous state-of-the-art methods by a large margin.
Medication recommendation targets to provide a proper set of medicines according to patients’ diagnoses, which is a critical task in clinics. Currently, the recommendation is manually conducted by doctors. However, for complicated cases, like patients with multiple diseases at the same time, it’s difficult to propose a considerate recommendation even for experienced doctors. This urges the emergence of automatic medication recommendation which can help treat the diagnosed diseases without causing harmful drug-drug interactions. Due to the clinical value, medication recommendation has attracted growing research interests. Existing works mainly formulate medication recommendation as a multi-label classification task to predict the set of medicines. In this paper, we propose the Conditional Generation Net (COGNet) which introduces a novel copy-or-predict mechanism to generate the set of medicines. Given a patient, the proposed model first retrieves his or her historical diagnoses and medication recommendations and mines their relationship with current diagnoses. Then in predicting each medicine, the proposed model decides whether to copy a medicine from previous recommendations or to predict a new one. This process is quite similar to the decision process of human doctors. We validate the proposed model on the public MIMIC data set, and the experimental results show that the proposed model can outperform state-of-the-art approaches.
To facilitate human decisions with credible suggestions, personalized recommender systems should have the ability to generate corresponding explanations while making recommendations. Knowledge graphs (KG), which contain comprehensive information about users and products, are widely used to enable this. By reasoning over a KG in a node-by-node manner, existing explainable models provide a KG-grounded path for each user-recommended item. Such paths serve as an explanation and reflect the historical behavior pattern of the user. However, not all items can be reached following the connections within the constructed KG under finite hops. Hence, previous approaches are constrained by a recall bias in terms of existing connectivity of KG structures. To overcome this, we propose a novel Path Language Modeling Recommendation (PLM-Rec) framework, learning a language model over KG paths consisting of entities and edges. Through path sequence decoding, PLM-Rec unifies recommendation and explanation in a single step and fulfills them simultaneously. As a result, PLM-Rec not only captures the user behaviors but also eliminates the restriction to pre-existing KG connections, thereby alleviating the aforementioned recall bias. Moreover, the proposed technique makes it possible to conduct explainable recommendation even when the KG is sparse or possesses a large number of relations. Experiments and extensive ablation studies on three Amazon e-commerce datasets demonstrate the effectiveness and explainability of the PLM-Rec framework.
Knowledge graphs (KGs) have become a valuable asset for many AI applications. Although some KGs contain plenty of facts, they are widely acknowledged as incomplete. To address this issue, many KG completion methods are proposed. Among them, open KG completion methods leverage the Web to find missing facts. However, noisy data collected from diverse sources may damage the completion accuracy. In this paper, we propose a new trustworthy method that exploits facts for a KG based on multi-sourced noisy data and existing facts in the KG. Specifically, we introduce a graph neural network with a holistic scoring function to judge the plausibility of facts with various value types. We design value alignment networks to resolve the heterogeneity between values and map them to entities even outside the KG. Furthermore, we present a truth inference model that incorporates data source qualities into the fact scoring function, and design a semi-supervised learning way to infer the truths from heterogeneous values. We conduct extensive experiments to compare our method with the state-of-the-arts. The results show that our method achieves superior accuracy not only in completing missing facts but also in discovering new facts.
A typical index at the end of a textbook contains a manually-provided vocabulary of terms related to the content of the textbook. In this paper, we extend our previous work on extraction of knowledge models from digital textbooks. We are taking a more critical look at the content of a textbook index and present a mechanism for classifying index terms according to their domain specificity: a core domain concept, an in-domain concept, a concept from a related domain, and a concept from a foreign domain. We link the extracted models to DBpedia and leverage the aggregated linguistic and structural information from textbooks and DBpedia to construct and prune the domain-specific knowledge graphs. The evaluation experiments demonstrate (1) the ability of the approach to identify (with high accuracy) different levels of domain specificity for automatically extracted concepts, (2) its cross-domain robustness, and (3) the added value of the domain specificity information. These results clearly indicate the improved quality of the refined knowledge graphs and widen their potential applicability.
Providing access to information is the main and most important purpose of the Web. However, despite available easy-to-use tools (e.g., search engines, chatbots, question answering) the accessibility is typically limited by the capability of using the English language. This excludes a huge amount of people. In this work, we discuss Knowledge Graph Question Answering (KGQA) systems that aim at providing natural language access to data stored in Knowledge Graphs (KG). While several KGQA systems have been proposed, only very few have dealt with a language other than English. In this work, we follow our research agenda of enabling speakers of any language to access the knowledge stored in KGs. Because of the lack of native support for many languages, we use machine translation (MT) tools to evaluate KGQA systems regarding questions in languages that are unsupported by a KGQA system. In total, our evaluation is based on 8 different languages (including some that never were evaluated before). For the intensive evaluation, we extend the QALD-9 dataset for KGQA with Wikidata queries and high-quality translations. The extension was done in a crowdsourcing manner by native speakers of the different languages. By using multiple KGQA systems for the evaluation, we were enabled to investigate and answer the main research question: “Can MT be an alternative for multilingual KGQA systems?”. The evaluation results demonstrated that the monolingual KGQA systems can be effectively ported to the new languages with MT tools.
Aspect level sentiment classification (ALSC) is a difficult problem with state-of-the-art models showing less than 80% macro-F1 score on benchmark datasets. Existing models do not incorporate information on aspect-aspect relations in knowledge graphs (KGs), e.g. DBpedia. Two main challenges stem from inaccurate disambiguation of aspects to KG entities, and the inability to learn aspect representations from the large KGs in joint training with ALSC models. We propose AR-BERT, a novel two-level global-local entity embedding scheme that allows efficient joint training of KG-based aspect embeddings and ALSC models. A novel incorrect disambiguation detection technique addresses the problem of inaccuracy in aspect disambiguation. We also introduce the problem of determining mode significance in multi-modal explanation generation, and propose a two step solution. The proposed methods show a consistent improvement of 2.5 − 4.1 percentage points, over the recent BERT-based baselines on benchmark datasets.
Semantic parsing is a key NLP task that maps natural language to structured meaning representations. As in many other NLP tasks, SOTA performance in semantic parsing is now attained by fine-tuning a large pretrained language model (PLM). While effective, this approach is inefficient in the presence of multiple downstream tasks, as a new set of values for all parameters of the PLM needs to be stored for each task separately. Recent work has explored methods for adapting PLMs to downstream tasks while keeping most (or all) of their parameters frozen. We examine two such promising techniques, prefix tuning and bias-term tuning, specifically on semantic parsing. We compare them against each other on two different semantic parsing datasets, and we also compare them against full and partial fine-tuning, both in few-shot and conventional data settings. While prefix tuning is shown to do poorly for semantic parsing tasks off the shelf, we modify it by adding special token embeddings, which results in very strong performance without compromising parameter savings.
Taxonomy is a fundamental type of knowledge graph for a wide range of web applications like searching and recommendation systems. To keep a taxonomy automatically updated with the latest concepts, the taxonomy completion task matches a pair of proper hypernym and hyponym in the original taxonomy with the new concept as its parent and child. Previous solutions utilize term embeddings as input and only evaluate the parent-child relations between the new concept and the hypernym-hyponym pair. Such methods ignore the important sibling relations, and are not applicable in reality since term embeddings are not available for the latest concepts. They also suffer from the relational noise of the “pseudo-leaf” node, which is a null node acting as a node’s hyponym to enable the new concept to be a leaf node. To tackle the above drawbacks, we propose the Quadruple Evaluation Network (QEN), a novel taxonomy completion framework that utilizes easily accessible term descriptions as input, and applies pretrained language model and code attention for accurate inference while reducing online computation. QEN evaluates both parent-child and sibling relations to both enhance the accuracy and reduce the noise brought by pseudo-leaf. Extensive experiments on three real-world datasets in different domains with different sizes and term description sources prove the effectiveness and robustness of QEN on overall performance and especially the performance for adding non-leaf nodes, which largely surpasses previous methods and achieves the new state-of-the-art of the task.1
Structural data well exists in Web applications, such as social networks in social media, citation networks in academic websites, and threads data in online forums. Due to the complex topology, it is difficult to process and make use of the rich information within such data. Graph Neural Networks (GNNs) have shown great advantages on learning representations for structural data. However, the non-transparency of the deep learning models makes it non-trivial to explain and interpret the predictions made by GNNs. Meanwhile, it is also a big challenge to evaluate the GNN explanations, since in many cases, the ground-truth explanations are unavailable.
In this paper, we take insights of Counterfactual and Factual (CF2) reasoning from causal inference theory, to solve both the learning and evaluation problems in explainable GNNs. For generating explanations, we propose a model-agnostic framework by formulating an optimization problem based on both of the two casual perspectives. This distinguishes CF2 from previous explainable GNNs that only consider one of them. Another contribution of the work is the evaluation of GNN explanations. For quantitatively evaluating the generated explanations without the requirement of ground-truth, we design metrics based on Counterfactual and Factual reasoning to evaluate the necessity and sufficiency of the explanations. Experiments show that no matter ground-truth explanations are available or not, CF2 generates better explanations than previous state-of-the-art methods on real-world datasets. Moreover, the statistic analysis justifies the correlation between the performance on ground-truth evaluation and our proposed metrics.
Edges in real-world graphs are typically formed by a variety of factors and carry diverse relation semantics. For example, connections in a social network could indicate friendship, being colleagues, or living in the same neighborhood. However, these latent factors are usually concealed behind mere edge existence due to the data collection and graph formation processes. Despite rapid developments in graph learning over these years, most models take a holistic approach and treat all edges as equal. One major difficulty in disentangling edges is the lack of explicit supervisions. In this work, with close examination of edge patterns, we propose three heuristics and design three corresponding pretext tasks to guide the automatic edge disentanglement. Concretely, these self-supervision tasks are enforced on a designed edge disentanglement module to be trained jointly with the downstream node classification task to encourage automatic edge disentanglement. Channels of the disentanglement module are expected to capture distinguishable relations and neighborhood interactions, and outputs from them are aggregated as node representations. The proposed is easy to be incorporated with various neural architectures, and we conduct experiments on 6 real-world datasets. Empirical results show that it can achieve significant performance gains.
The Unified Medical Language System (UMLS) Metathesaurus construction process mainly relies on lexical algorithms and manual expert curation for integrating over 200 biomedical vocabularies. A lexical-based learning model (LexLM) was developed to predict synonymy among Metathesaurus terms and largely outperforms a rule-based approach (RBA) that approximates the current construction process. However, the LexLM has the potential for being improved further because it only uses lexical information from the source vocabularies, while the RBA also takes advantage of contextual information. We investigate the role of multiple types of contextual information available to the UMLS editors, namely source synonymy (SS), source semantic group (SG), and source hierarchical relations (HR), for the UMLS vocabulary alignment (UVA) problem.
In this paper, we develop multiple variants of context-enriched learning models (ConLMs) by adding to the LexLM the types of contextual information listed above. We represent these context types in context-enriched knowledge graphs (ConKGs) with four variants ConSS, ConSG, ConHR, and ConAll. We train these ConKG embeddings using seven KG embedding techniques. We create the ConLMs by concatenating the ConKG embedding vectors with the word embedding vectors from the LexLM. We evaluate the performance of the ConLMs using the UVA generalization test datasets with hundreds of millions of pairs.
Our extensive experiments show a significant performance improvement from the ConLMs over the LexLM, namely +5.0% in precision (93.75%), +0.69% in recall (93.23%), +2.88% in F1 (93.49%) for the best ConLM. Our experiments also show that the ConAll variant including the three context types takes more time, but does not always perform better than other variants with a single context type. Finally, our experiments show that the pairs of terms with high lexical similarity benefit most from adding contextual information, namely +6.56% in precision (94.97%), +2.13% in recall (93.23%), +4.35% in F1 (94.09%) for the best ConLM. The pairs with lower degrees of lexical similarity also show performance improvement with +0.85% in F1 (96%) for low similarity and +1.31% in F1 (96.34%) for no similarity. These results demonstrate the importance of using contextual information in the UVA problem.
Linked Data Fragments (LDFs) are Web interfaces that enable querying knowledge graphs on the Web. These interfaces, such as SPARQL endpoints or Triple Pattern Fragment servers, differ in the SPARQL expressions they can evaluate and the metadata they provide. So far, federated query processing has focused on federations with a single type of LDF interface, typically SPARQL endpoints. In this work, we address the challenges of SPARQL query processing over federations with heterogeneous LDF interfaces. To this end, we propose an interface-aware framework and illustrate its applicability with a prototypical approach. The results over the FedBench benchmark show a substantial improvement in performance by devising this interface-aware approach that exploits the capabilities of heterogeneous interfaces in federations.
Localizing the source of graph diffusion phenomena, such as misinformation propagation, is an important yet extremely challenging task in the real world. Existing source localization models typically are heavily dependent on the hand-crafted rules and only tailored for certain domain-specific applications. Unfortunately, a large portion of the graph diffusion process for many applications is still unknown to human beings so it is important to have expressive models for learning such underlying rules automatically. Recently, there is a surge of research body on expressive models such as Graph Neural Networks (GNNs) for automatically learning the underlying graph diffusion. However, source localization is instead the inverse of graph diffusion, which is a typical inverse problem in graphs that is well-known to be ill-posed because there can be multiple solutions and hence different from the traditional (semi-)supervised learning settings. This paper aims to establish a generic framework of invertible graph diffusion models for source localization on graphs, namely Invertible Validity-aware Graph Diffusion (IVGD), to handle major challenges including 1) Difficulty to leverage knowledge in graph diffusion models for modeling their inverse processes in an end-to-end fashion, 2) Difficulty to ensure the validity of the inferred sources, and 3) Efficiency and scalability in source inference. Specifically, first, to inversely infer sources of graph diffusion, we propose a graph residual scenario to make existing graph diffusion models invertible with theoretical guarantees; second, we develop a novel error compensation mechanism that learns to offset the errors of the inferred sources. Finally, to ensure the validity of the inferred sources, a new set of validity-aware layers have been devised to project inferred sources to feasible regions by flexibly encoding constraints with unrolled optimization techniques. A linearization technique is proposed to strengthen the efficiency of our proposed layers. The convergence of the proposed IVGD is proven theoretically. Extensive experiments on nine real-world datasets demonstrate that our proposed IVGD outperforms state-of-the-art comparison methods significantly. We have released our code at https://github.com/xianggebenben/IVGD.
Graph contrastive learning (GCL) has emerged as a dominant technique for graph representation learning which maximizes the mutual information between paired graph augmentations that share the same semantics. Unfortunately, it is difficult to preserve semantics well during augmentations in view of the diverse nature of graph data. Currently, data augmentations in GCL broadly fall into three unsatisfactory ways. First, the augmentations can be manually picked per dataset by trial-and-errors. Second, the augmentations can be selected via cumbersome search. Third, the augmentations can be obtained with expensive domain knowledge as guidance. All of these limit the efficiency and more general applicability of existing GCL methods. To circumvent these crucial issues, we propose a Simple framework for GRAph Contrastive lEarning, SimGRACE for brevity, which does not require data augmentations. Specifically, we take original graph as input and GNN model with its perturbed version as two encoders to obtain two correlated views for contrast. SimGRACE is inspired by the observation that graph data can preserve their semantics well during encoder perturbations while not requiring manual trial-and-errors, cumbersome search or expensive domain knowledge for augmentations selection. Also, we explain why SimGRACE can succeed. Furthermore, we devise adversarial training scheme, dubbed AT-SimGRACE, to enhance the robustness of graph contrastive learning and theoretically explain the reasons. Albeit simple, we show that SimGRACE can yield competitive or better performance compared with state-of-the-art methods in terms of generalizability, transferability and robustness, while enjoying unprecedented degree of flexibility and efficiency. The code is available at: https://github.com/junxia97/SimGRACE.
Graphs are widely used for representing pairwise interactions in complex systems. Since such real-world graphs are large and often evergrowing, sampling a small representative subgraph is indispensable for various purposes: simulation, visualization, stream processing, representation learning, crawling, to name a few. However, many complex systems consist of group interactions (e.g., collaborations of researchers and discussions on online Q&A platforms), and thus they can be represented more naturally and accurately by hypergraphs (i.e., sets of sets) than by ordinary graphs.
Motivated by the prevalence of large-scale hypergraphs, we study the problem of representative sampling from real-world hypergraphs, aiming to answer (Q1) what a representative sub-hypergraph is and (Q2) how we can find a representative one rapidly without an extensive search. Regarding Q1, we propose to measure the goodness of a sub-hypergraph by comparing it with the entire hypergraph in terms of ten graph-level, hyperedge-level, and node-level statistics. Regarding Q2, we first analyze the characteristics of six intuitive approaches in 11 real-world hypergraphs. Then, based on the analysis, we propose MiDaS, which draws hyperedges with a bias towards those with high-degree nodes. Through extensive experiments, we demonstrate that MiDaS is (a) Representative: finding overall the most representative samples among 13 considered approaches, (b) Fast: several orders of magnitude faster than the strongest competitors, which performs an extensive search, and (c) Automatic: rapidly searching a proper degree of bias.
Computing a dense subgraph is a fundamental problem in graph mining, with a diverse set of applications ranging from electronic commerce to community detection in social networks. In many of these applications, the underlying context is better modelled as a weighted hypergraph that keeps evolving with time.
This motivates the problem of maintaining the densest subhypergraph of a weighted hypergraph in a dynamic setting, where the input keeps changing via a sequence of updates (hyperedge insertions/deletions). Previously, the only known algorithm for this problem was due to Hu et al. . This algorithm worked only on unweighted hypergraphs, and had an approximation ratio of (1 +ϵ)r2 and an update time of O(poly(r, log n)), where r denotes the maximum rank of the input across all the updates.
We obtain a new algorithm for this problem, which works even when the input hypergraph is weighted. Our algorithm has a significantly improved (near-optimal) approximation ratio of (1 +ϵ) that is independent of r, and a similar update time of O(poly(r, log n)). It is the first (1 +ϵ)-approximation algorithm even for the special case of weighted simple graphs.
To complement our theoretical analysis, we perform experiments with our dynamic algorithm on large-scale, real-world data-sets. Our algorithm significantly outperforms the state of the art  both in terms of accuracy and efficiency.
Recent years have witnessed increasing attention on the application of graph alignment to on-Web tasks, such as knowledge graph integration and social network linking. Despite achieving remarkable performance, prevailing graph alignment models still suffer from noisy supervision, yet how to mitigate the impact of noise in labeled data is still under-explored. The negative sampling based noise discrimination model has been a feasible solution to detect the noisy data and filter them out. However, due to its sensitivity to the sampling distribution, the negative sampling based noise discrimination model would lead to an inaccurate decision boundary. Furthermore, it is difficult to find an abiding threshold to separate the potential positive (benign) and negative (noisy) data in the whole training process. To address these important issues, in this paper, we design a non-sampling discrimination model resorting to the unbiased risk estimation of positive-unlabeled learning to circumvent the harmful impact of negative sampling. We also propose to select the appropriate potential positive data at different training stages by an adaptive filtration threshold enabled by curriculum learning, for maximally improving the performance of alignment model and non-sampling discrimination model. Extensive experiments conducted on several real-world datasets validate the effectiveness of our proposed method.
Given entities and their interactions in the web data, which may have occurred at different time, how can we find communities of entities and track their evolution? In this paper, we approach this important task from graph clustering perspective. Recently, state-of-the-art clustering performance in various domains has been achieved by deep clustering methods. Especially, deep graph clustering (DGC) methods have successfully extended deep clustering to graph-structured data by learning node representations and cluster assignments in a joint optimization framework. Despite some differences in modeling choices (e.g., encoder architectures), existing DGC methods are mainly based on autoencoders and use the same clustering objective with relatively minor adaptations. Also, while many real-world graphs are dynamic, previous DGC methods considered only static graphs. In this work, we develop CGC, a novel end-to-end framework for graph clustering, which fundamentally differs from existing methods. CGC learns node embeddings and cluster assignments in a contrastive graph learning framework, where positive and negative samples are carefully selected in a multi-level scheme such that they reflect hierarchical community structures and network homophily. Also, we extend CGC for time-evolving data, where temporal graph clustering is performed in an incremental learning fashion, with the ability to detect change points. Extensive evaluation on real-world graphs demonstrates that the proposed CGC consistently outperforms existing methods.
Although existing Graph Neural Networks (GNNs) based on message passing achieve state-of-the-art, the over-smoothing issue, node similarity distortion issue and dissatisfactory link prediction performance can’t be ignored. This paper summarizes these issues as the interference between topology and attribute for the first time. By leveraging the recently proposed optimization perspective of GNNs, this interference is analyzed and ascribed to that the learned representation in GNNs essentially compromises between the topology and node attribute. To alleviate the interference, this paper attempts to break this compromise by proposing a novel objective function, which fits node attribute and topology with different representations and introduces mutual exclusion constraints to reduce the redundancy in both representations. The mutual exclusion employs the statistical dependence, which regards the representations from topology and attribute as the observations of two random variables, and is implemented with Hilbert-Schmidt Independence Criterion. Derived from the novel objective function, a novel GNN, i.e., Graph Neural Network Beyond Compromise (GNN-BC), is proposed to iteratively updates the representations of topology and attribute by simultaneously capturing semantic information and removing the common information, and the final representation is the concatenation of them. The performance improvements on node classification and link prediction demonstrate the superiority of GNN-BC on relieving the interference.
The past decades have witnessed the prosperity of graph mining, with a multitude of sophisticated models and algorithms designed for various mining tasks, such as ranking, classification, clustering and anomaly detection. Generally speaking, the vast majority of the existing works aim to answer the following question, that is, given a graph, what is the best way to mine it?
In this paper, we introduce the graph sanitation problem, to answer an orthogonal question. That is, given a mining task and an initial graph, what is the best way to improve the initially provided graph? By learning a better graph as part of the input of the mining model, it is expected to benefit graph mining in a variety of settings, ranging from denoising, imputation to defense. We formulate the graph sanitation problem as a bilevel optimization problem, and further instantiate it by semi-supervised node classification, together with an effective solver named GaSoliNe. Extensive experimental results demonstrate that the proposed method is (1) broadly applicable with respect to various graph neural network models and flexible graph modification strategies, (2) effective in improving the node classification accuracy on both the original and contaminated graphs in various perturbation scenarios. In particular, it brings up to 25% performance improvement over the existing robust graph neural network methods.
Fake News detection has attracted much attention in recent years. Social context based detection methods attempt to model the spreading patterns of fake news by utilizing the collective wisdom from users on social media. This task is challenging for three reasons: (1) There are multiple types of entities and relations in social context, requiring methods to effectively model the heterogeneity. (2) The emergence of news in novel topics in social media causes distribution shifts, which can significantly degrade the performance of fake news detectors. (3) Existing fake news datasets usually lack of great scale, topic diversity and user social relations, impeding the development of this field. To solve these problems, we formulate social context based fake news detection as a heterogeneous graph classification problem, and propose a fake news detection model named Post-User Interaction Network (PSIN), which adopts a divide-and-conquer strategy to model the post-post, user-user and post-user interactions in social context effectively while maintaining their intrinsic characteristics. Moreover,we adopt an adversarial topic discriminator for topic-agnostic feature learning, in order to improve the generalizability of our method for new-emerging topics. Furthermore, we curate a new dataset for fake news detection, which contains over 27,155 news from 5 topics, 5 million posts, 2 million users and their induced social graph with 0.2 billion edges. It has been published on https://github.com/qwerfdsaplking/MC-Fake. Extensive experiments illustrate that our method outperforms SOTA baselines in both in-topic and out-of-topic settings.
Temporal graph representation learning has drawn significant attention for the prevalence of temporal graphs in the real world. However, most existing works resort to taking discrete snapshots of the temporal graph, or are not inductive to deal with new nodes, or do not model the exciting effects which is the ability of events to influence the occurrence of another event. In this work, We propose TREND, a novel framework for temporal graph representation learning, driven by TempoRal Event and Node Dynamics and built upon a Hawkes process-based graph neural network (GNN). TREND presents a few major advantages: (1) it is inductive due to its GNN architecture; (2) it captures the exciting effects between events by the adoption of the Hawkes process; (3) as our main novelty, it captures the individual and collective characteristics of events by integrating both event and node dynamics, driving a more precise modeling of the temporal process. Extensive experiments on four real-world datasets demonstrate the effectiveness of our proposed model.
Graph Convolutional Networks (GCNs) are popular for learning representation of graph data and have a wide range of applications in social networks, recommendation systems, etc. However, training GCN models for large networks is resource intensive and time consuming, which hinders them from real deployment. The existing GCN training methods intended to optimize the sampling of mini-batches for stochastic gradient descent to accelerate training process, which did not reduce the problem size and had limited reduction in computation complexity. In this paper, we argue that a GCN can be trained with a sampled subgraph to produce approximate node representations, which inspires us a novel perspective to accelerate GCN training via network sampling. To this end, we propose a label-centric cumulative sampling (LCS) framework for training GCNs for large graphs. The proposed method constructs a subgraph cumulatively based on probabilistic sampling, and trains the GCN model iteratively to generate approximate node representations. The optimality of LCS is theoretically guaranteed to minimize the bias during node aggregation procedure in GCN training. Extensive experiments based on four real-world network datasets show that the LCS framework accelerates the training for the state-of-the-art GCN models up to 17x without causing noteworthy model accuracy drop.
Cross-Domain Recommendation (CDR) has been popularly studied to utilize different domain knowledge to solve the data sparsity and cold-start problem in recommender systems. In this paper, we focus on the Review-based Non-overlapped Recommendation (RNCDR) problem. The problem is commonly-existed and challenging due to two main aspects, i.e, there are only positive user-item ratings on the target domain and there is no overlapped user across different domains. Most previous CDR approaches cannot solve the RNCDR problem well, since (1) they cannot effectively combine review with other information (e.g., ID or ratings) to obtain expressive user or item embedding, (2) they cannot reduce the domain discrepancy on users and items. To fill this gap, we propose Collaborative Filtering with Attribution Alignment model (CFAA), a cross-domain recommendation framework for the RNCDR problem. CFAA includes two main modules, i.e., rating prediction module and embedding attribution alignment module. The former aims to jointly mine review, one-hot ID, and multi-hot historical ratings to generate expressive user and item embeddings. The later includes vertical attribution alignment and horizontal attribution alignment, tending to reduce the discrepancy based on multiple perspectives. Our empirical study on Douban and Amazon datasets demonstrates that CFAA significantly outperforms the state-of-the-art models under the RNCDR setting.
K-clique counting is a fundamental problem in network analysis which has attracted much attention in recent years. Computing the count of k-cliques in a graph for a large k (e.g., k = 8) is often intractable as the number of k-cliques increases exponentially w.r.t. (with respect to) k. Existing exact k-clique counting algorithms are often hard to handle large dense graphs, while sampling-based solutions either require a huge number of samples or consume very high storage space to achieve a satisfactory accuracy. To overcome these limitations, we propose a new framework to estimate the number of k-cliques which integrates both the exact k-clique counting technique and two novel color-based sampling techniques. The key insight of our framework is that we only apply the exact algorithm to compute the k-clique counts in the sparse regions of a graph, and use the proposed sampling-based techniques to estimate the number of k-cliques in the dense regions of the graph. Specifically, we develop two novel dynamic programming based k-color set sampling techniques to efficiently estimate the k-clique counts, where a k-color set contains k nodes with k different colors. Since a k-color set is often a good approximation of a k-clique in the dense regions of a graph, our sampling-based solutions are extremely efficient and accurate. Moreover, the proposed sampling techniques are space efficient which use near-linear space w.r.t. graph size. We conduct extensive experiments to evaluate our algorithms using 8 real-life graphs. The results show that our best algorithm is at least one order of magnitude faster than the state-of-the-art sampling-based solutions (with the same relative error 0.1%) and can be up to three orders of magnitude faster than the state-of-the-art exact algorithm on large graphs.
Graph representation learning is crucial for many real-world applications (e.g. social relation analysis). A fundamental problem for graph representation learning is how to effectively learn representations without human labeling, which is usually costly and time-consuming. Graph contrastive learning (GCL) addresses this problem by pulling the positive node pairs (or similar nodes) closer while pushing the negative node pairs (or dissimilar nodes) apart in the representation space. Despite the success of the existing GCL methods, they primarily sample node pairs based on the node-level proximity yet the community structures have rarely been taken into consideration. As a result, two nodes from the same community might be sampled as a negative pair. We argue that the community information should be considered to identify node pairs in the same communities, where the nodes insides are semantically similar. To address this issue, we propose a novel Graph Communal Contrastive Learning (gCooL) framework to jointly learn the community partition and learn node representations in an end-to-end fashion. Specifically, the proposed gCooL consists of two components: a Dense Community Aggregation (DeCA) algorithm for community detection and a Reweighted Self-supervised Cross-contrastive (ReSC) training scheme to utilize the community information. Additionally, the real-world graphs are complex and often consist of multiple views. In this paper, we demonstrate that the proposed gCooL can also be naturally adapted to multiplex graphs. Finally, we comprehensively evaluate the proposed gCooL on a variety of real-world graphs. The experimental results show that the gCooL outperforms the state-of-the-art methods.
Graph Convolutional Network (GCN) plays pivotal roles in many real-world applications. Despite the successes of GCN deployment, GCN often exhibits performance disparity with respect to node degrees, resulting in worse predictive accuracy for low-degree nodes. We formulate the problem of mitigating the degree-related performance disparity in GCN from the perspective of the Rawlsian difference principle, which is originated from the theory of distributive justice. Mathematically, we aim to balance the utility between low-degree nodes and high-degree nodes while minimizing the task-specific loss. Specifically, we reveal the root cause of this degree-related unfairness by analyzing the gradients of weight matrices in GCN. Guided by the gradients of weight matrices, we further propose a pre-processing method RawlsGCN-Graph and an in-processing method RawlsGCN-Grad that achieves fair predictive accuracy in low-degree nodes without modification on the GCN architecture or introduction of additional parameters. Extensive experiments on real-world graphs demonstrate the effectiveness of our proposed RawlsGCN methods in significantly reducing degree-related bias while retaining comparable overall performance.
Learning discriminative node representations benefits various downstream tasks in graph analysis such as community detection and node classification. Existing graph representation learning methods (e.g., based on random walk and contrastive learning) are limited to maximizing the local similarity of connected nodes. Such pair-wise learning schemes could fail to capture the global distribution of representations, since it has no explicit constraints on the global geometric properties of representation space. To this end, we propose Geometric Graph Representation Learning (G2R) to learn node representations in an unsupervised manner via maximizing rate reduction. In this way, G2R maps nodes in distinct groups (implicitly stored in the adjacency matrix) into different subspaces, while each subspace is compact and different subspaces are dispersedly distributed. G2R adopts a graph neural network as the encoder and maximizes the rate reduction with the adjacency matrix. Furthermore, we theoretically and empirically demonstrate that rate reduction maximization is equivalent to maximizing the principal angles between different subspaces. Experiments on real-world datasets show that G2R outperforms various baselines on node classification and community detection tasks.
Unsupervised graph representation learning has emerged as a powerful tool to address real-world problems and achieves huge success in the graph learning domain. Graph contrastive learning is one of the unsupervised graph representation learning methods, which recently attracts attention from researchers and has achieved state-of-the-art performances on various tasks. The key to the success of graph contrastive learning is to construct proper contrasting pairs to acquire the underlying structural semantics of the graph. However, this key part is not fully explored currently, most of the ways generating contrasting pairs focus on augmenting or perturbating graph structures to obtain different views of the input graph. But such strategies could degrade the performances via adding noise into the graph, which may narrow down the field of the applications of graph contrastive learning. In this paper, we propose a novel graph contrastive learning method, namely Dual Space Graph Contrastive (DSGC) Learning, to conduct graph contrastive learning among views generated in different spaces including the hyperbolic space and the Euclidean space. Since both spaces have their own advantages to represent graph data in the embedding spaces, we hope to utilize graph contrastive learning to bridge the spaces and leverage advantages from both sides. The comparison experiment results show that DSGC achieves competitive or better performances among all the datasets. In addition, we conduct extensive experiments to analyze the impact of different graph encoders on DSGC, giving insights about how to better leverage the advantages of contrastive learning between different spaces.
Graph Convolutional Networks (GCNs) have recently attracted vast interest and achieved state-of-the-art performance on graphs, but its success could typically hinge on careful training with amounts of expensive and time-consuming labeled data. To alleviate labeled data scarcity, self-training methods have been widely adopted on graphs by labeling high-confidence unlabeled nodes and then adding them to the training step. In this line, we empirically make a thorough study for current self-training methods on graphs. Surprisingly, we find that high-confidence unlabeled nodes are not always useful, and even introduce the distribution shift issue between the original labeled dataset and the augmented dataset by self-training, severely hindering the capability of self-training on graphs. To this end, in this paper, we propose a novel Distribution Recovered Graph Self-Training framework (DR-GST), which could recover the distribution of the original labeled dataset. Specifically, we first prove the equality of loss function in self-training framework under the distribution shift case and the population distribution if each pseudo-labeled node is weighted by a proper coefficient. Considering the intractability of the coefficient, we then propose to replace the coefficient with the information gain after observing the same changing trend between them, where information gain is respectively estimated via both dropout variational inference and dropedge variational inference in DR-GST. However, such a weighted loss function will enlarge the impact of incorrect pseudo labels. As a result, we apply the loss correction method to improve the quality of pseudo labels. Both our theoretical analysis and extensive experiments on five benchmark datasets demonstrate the effectiveness of the proposed DR-GST, as well as each well-designed component in DR-GST.
Graph Neural Networks (GNNs) have shown superior performance in analyzing attributed networks in various web-based applications such as social recommendation and web search. Nevertheless, in high-stake decision-making scenarios such as online fraud detection, there is an increasing societal concern that GNNs could make discriminatory decisions towards certain demographic groups. Despite recent explorations on fair GNNs, these works are tailored for a specific GNN model. However, myriads of GNN variants have been proposed for different applications, and it is costly to fine-tune existing debiasing algorithms for each specific GNN architecture. Different from existing works that debias GNN models, we aim to debias the input attributed network to achieve fairer GNNs through feeding GNNs with less biased data. Specifically, we propose novel definitions and metrics to measure the bias in an attributed network, which leads to the optimization objective to mitigate bias. We then develop a framework EDITS to mitigate the bias in attributed networks while maintaining the performance of GNNs in downstream tasks. EDITS works in a model-agnostic manner, i.e., it is independent of any specific GNN. Experiments demonstrate the validity of the proposed bias metrics and the superiority of EDITS on both bias mitigation and utility maintenance. Open-source implementation: https://github.com/yushundong/EDITS.
Graph Neural Networks (GNNs) show strong expressive power on graph data mining, by aggregating information from neighbors and using the integrated representation in the downstream tasks. The same aggregation methods and parameters for each node in a graph are used to enable the GNNs to utilize the homophily relational data. However, not all graphs are homophilic, even in the same graph, the distributions may vary significantly. Using the same convolution over all nodes may lead to the ignorance of various graph patterns. Furthermore, many existing GNNs integrate node features and structure identically, which ignores the distributions of nodes and further limits the expressive power of GNNs. To solve these problems, we propose Meta Weight Graph Neural Network (MWGNN) to adaptively construct graph convolution layers for different nodes. First, we model the Node Local Distribution (NLD) from node feature, topological structure and positional identity aspects with the Meta-Weight. Then, based on the Meta-Weight, we generate the adaptive graph convolutions to perform a node-specific weighted aggregation and boost the node representations. Finally, we design extensive experiments on real-world and synthetic benchmarks to evaluate the effectiveness of MWGNN. These experiments show the excellent expressive power of MWGNN in dealing with graph data with various distributions.
Given a graph dataset, how can we augment it for accurate graph classification? Graph augmentation is an essential strategy to improve the performance of graph-based tasks, and has been widely utilized for analyzing web and social graphs. However, previous works for graph augmentation either a) involve the target model in the process of augmentation, losing the generalizability to other tasks, or b) rely on simple heuristics that lead to unreliable results. In this work, we introduce five desired properties for effective augmentation. Then, we propose NodeSam (Node Split and Merge) and SubMix (Subgraph Mix), two model-agnostic algorithms for graph augmentation that satisfy all desired properties with different motivations. NodeSam makes a balanced change of the graph structure to minimize the risk of semantic change, while SubMix mixes random subgraphs of multiple graphs to create rich soft labels combining the evidence for different classes. Our experiments on social networks and molecular graphs show that NodeSam and SubMix outperform existing approaches in graph classification.
Continual graph learning is rapidly emerging as an important role in a variety of real-world applications such as online product recommendation systems and social media. While achieving great success, existing works on continual graph learning ignore the information from multiple modalities (e.g., visual and textual features) as well as the rich dynamic structural information hidden in the ever-changing graph data and evolving tasks. However, considering multimodal continual graph learning with evolving topological structures poses great challenges: i) it is unclear how to incorporate the multimodal information into continual graph learning and ii) it is nontrivial to design models that can capture the structure-evolving dynamics in continual graph learning. To tackle these challenges, in this paper we propose a novel Multimodal Structure-evolving Continual Graph Learning (MSCGL) model, which continually learns both the model architecture and the corresponding parameters for Adaptive Multimodal Graph Neural Network (AdaMGNN). To be concrete, our proposed MSCGL model simultaneously takes social information and multimodal information into account to build the multimodal graphs. In order for continually adapting to new tasks without forgetting the old ones, our MSCGL model explores a new strategy with joint optimization of Neural Architecture Search (NAS) and Group Sparse Regularization (GSR) across different tasks. These two parts interact with each other reciprocally, where NAS is expected to explore more promising architectures and GSR is in charge of preserving important information from the previous tasks. We conduct extensive experiments over two real-world multimodal continual graph scenarios to demonstrate the superiority of the proposed MSCGL model. Empirical experiments indicate that both the architectures and weight sharing across different tasks play important roles in affecting the model performances.
User-User interaction recommendation, or interaction recommendation, is an indispensable service in social platforms, where the system automatically predicts with whom a user wants to interact. In real-world social platforms, we observe that user interactions may occur in diverse scenarios, and new scenarios constantly emerge, such as new games or sales promotions. There are two challenges in these emerging scenarios: (1) The behavior of users on the emerging scenarios could be different from existing ones due to the diversity among scenarios; (2) Emerging scenarios may only have scarce user behavioral data for model learning. Towards these two challenges, we present KoMen, a Domain Knowledge Guided Meta-learning framework for Interaction Recommendation. KoMen first learns a set of global model parameters shared among all scenarios and then quickly adapts the parameters for an emerging scenario based on its similarities with the existing ones. There are two highlights of KoMen: (1) KoMen customizes global model parameters by incorporating domain knowledge of the scenarios (e.g., a taxonomy that organizes scenarios by their purposes and functions), which captures scenario inter-dependencies with very limited training. (2) KoMen learns the scenario-specific parameters through a mixture-of-expert architecture, which reduces model variance resulting from data scarcity while still achieving the expressiveness to handle diverse scenarios. Extensive experiments demonstrate that KoMen achieves state-of-the-art performance on a public benchmark dataset and a large-scale real industry dataset. Remarkably, KoMen improves over the best baseline w.r.t. weighted ROC-AUC by 2.14% and 2.03% on the two datasets, respectively. Our code is available at: https://github.com/Veronicium/koMen.
Though Graph Neural Networks (GNNs) have been successful for fraud detection tasks, they suffer from imbalanced labels due to limited fraud compared to the overall userbase. This paper attempts to resolve this label-imbalance problem for GNNs by maximizing the AUC (Area Under ROC Curve) metric since it is unbiased with label distribution. However, maximizing AUC on GNN for fraud detection tasks is intractable due to the potential polluted topological structure caused by intentional noisy edges generated by fraudsters. To alleviate this problem, we propose to decouple the AUC maximization process on GNN into a classifier parameter searching and an edge pruning policy searching, respectively. We propose a model named AO-GNN (Short for AUC-oriented GNN), to achieve AUC maximization on GNN under the aforementioned framework. In the proposed model, an AUC-oriented stochastic gradient is applied for classifier parameter searching, and an AUC-oriented reinforcement learning module supervised by a surrogate reward of AUC is devised for edge pruning policy searching. Experiments on three real-world datasets demonstrate that the proposed AO-GNN patently outperforms state-of-the-art baselines in not only AUC but also other general metrics, e.g. F1-macro, G-means.
Graph contrastive learning is the state-of-the-art unsupervised graph representation learning framework and has shown comparable performance with supervised approaches. However, evaluating whether the graph contrastive learning is robust to adversarial attacks is still an open problem because most existing graph adversarial attacks are supervised models, which means they heavily rely on labels and can only be used to evaluate the graph contrastive learning in a specific scenario. For unsupervised graph representation methods such as graph contrastive learning, it is difficult to acquire labels in real-world scenarios, making traditional supervised graph attack methods difficult to be applied to test their robustness. In this paper, we propose a novel unsupervised gradient-based adversarial attack that does not rely on labels for graph contrastive learning. We compute the gradients of the adjacency matrices of the two views and flip the edges with gradient ascent to maximize the contrastive loss. In this way, we can fully use multiple views generated by the graph contrastive learning models and pick the most informative edges without knowing their labels, and therefore can promisingly support our model adapted to more kinds of downstream tasks. Extensive experiments show that our attack outperforms unsupervised baseline attacks and has comparable performance with supervised attacks in multiple downstream tasks including node classification and link prediction. We further show that our attack can be transferred to other graph representation models as well.
Graph Neural Networks (GNNs) have achieved remarkable success by extending traditional convolution to learning on non-Euclidean data. The key to the GNNs is adopting the neural message-passing paradigm with two stages: aggregation and update. The current design of GNNs considers the topology information in the aggregation stage. However, in the updating stage, all nodes share the same updating function. The identical updating function treats each node embedding as i.i.d. random variables and thus ignores the implicit relationships between neighborhoods, which limits the capacity of the GNNs. The updating function is usually implemented with a linear transformation followed by a non-linear activation function. To make the updating function topology-aware, we inject the topological information into the non-linear activation function and propose Graph-adaptive Rectified Linear Unit (GReLU), which is a new parametric activation function incorporating the neighborhood information in a novel and efficient way. The parameters of GReLU are obtained from a hyperfunction based on both node features and the corresponding adjacent matrix. To reduce the risk of overfitting and the computational cost, we decompose the hyperfunction as two independent components for nodes and features respectively. We conduct comprehensive experiments to show that our plug-and-play GReLU method is efficient and effective given different GNN backbones and various downstream tasks.
Dynamic systems that consist of a set of interacting elements can be abstracted as temporal networks. Recently, higher-order patterns that involve multiple interacting nodes have been found crucial to indicate domain-specific laws of different temporal networks. This posts us the challenge of designing more sophisticated hypergraph models for these higher-order patterns and the associated new learning algorithms. Here, we propose the first model, named HIT, for full-spectrum higher-order pattern prediction in temporal hypergraphs. Particularly, we focus on predicting three types of common but important interaction patterns involving three interacting elements in temporal networks, which could be extended to even higher-order patterns. HIT extracts the structural representation of a node triplet of interest on the temporal hypergraph and uses it to tell what type of, when, and why the interaction expansion could happen in this triplet. HIT could achieve significant improvement (averaged 20% AUC gain to identify the interaction type, uniformly more accurate time estimation) compared to both heuristic and other neural-network-based baselines on 5 real-world large temporal hypergraphs. Moreover, HIT provides a certain degree of interpretability by identifying the most discriminatory structural features on the temporal hypergraphs for predicting different higher-order patterns.
The self-supervised graph representation learning has achieved much success in recent web based research and applications, such as recommendation system, social networks, and anomaly detection. However, existing works suffer from two problems. Firstly, in social networks, the influential neighbors are important, but the overwhelming routine in graph representation-learning utilizes the node-wise similarity metric defined on embedding vectors that cannot exactly capture the subtle local structure and the network proximity. Secondly, existing works implicitly assume a universal distribution across datasets, which presumably leads to sub-optimal models considering the potential distribution shift. To address these problems, in this paper, we learn structural embeddings in which the proximity is characterized by 1-Wasserstein distance. We propose a distributionally robust self-supervised graph neural network framework to learn the representations. More specifically, in our method, the embeddings are computed based on subgraphs centering at the node of interest and represent both the node of interest and its neighbors, which better preserves the local structure of nodes. To make our model end-to-end trainable, we adopt a deep implicit layer to compute the Wasserstein distance, which can be formulated as a differentiable convex optimization problem. Meanwhile, our distributionally robust formulation explicitly constrains the maximal diversity for matched queries and keys. As such, our model is insensitive to the data distributions and has better generalization abilities. Extensive experiments demonstrate that the graph encoder learned by our approach can be utilized for various downstream analyses, including node classification, graph classification, and top-k similarity search. The results show our algorithm outperforms state-of-the-art baselines, and the ablation study validates the effectiveness of our design.
Contrastive learning is an effective unsupervised method in graph representation learning. Recently, the data augmentation based contrastive learning method has been extended from images to graphs. However, most prior works are directly adapted from the models designed for images. Unlike the data augmentation on images, the data augmentation on graphs is far less intuitive and much harder to provide high-quality contrastive samples, which are the key to the performance of contrastive learning models. This leaves much space for improvement over the existing graph contrastive learning frameworks. In this work, by introducing an adversarial graph view and an information regularizer, we propose a simple but effective method, Adversarial Graph Contrastive Learning (ArieL), to extract informative contrastive samples within a reasonable constraint. It consistently outperforms the current graph contrastive learning methods in the node classification task over various real-world datasets and further improves the robustness of graph contrastive learning.
Recently, the rapid diffusion of malicious information in online social networks causes great harm to our society. Therefore, it is of great significance to localize diffusion sources as early as possible to stem the spread of malicious information. This paper proposes a novel sensor-based method, called greedy full-order neighbor localization (denoted as GFNL), to solve this problem under a low infection propagation in line with the real world. More specifically, GFNL includes two main components, i.e., the greedy-based sensor deployment strategy (DS) and direction-path-based source estimation strategy (ES). In more detail, to ensure sensors can observe a propagation information as early as possible, a set of sensors is deployed in a network to minimize the geodesic distance (i.e., the distance of the shortest path) between the candidate set and the sensor set based on DS. Then when a fraction of sensors observe a propagation, ES infers the source based on the idea that the distance of the actual propagation path is proportional to the observed time. Compared with some state-of-the-art methods, comprehensive experiments have proved the superiority and robustness of our proposed GFNL.
In recent years, Graph Neural Networks (GNNs) have shown superior performance on diverse real-world applications. To improve the model capacity, besides designing aggregation operations, GNN topology design is also very important. In general, there are two mainstream GNN topology design manners. The first one is to stack aggregation operations to obtain the higher-level features but easily got performance drop as the network goes deeper. Secondly, the multiple aggregation operations are utilized in each layer which provides adequate and independent feature extraction stage on local neighbors while are costly to obtain the higher-level information. To enjoy the benefits while alleviating the corresponding deficiencies of these two manners, we learn to design the topology of GNNs in a novel feature fusion perspective which is dubbed F2GNN. To be specific, we provide a feature fusion perspective in designing GNN topology and propose a novel framework to unify the existing topology designs with feature selection and fusion strategies. Then we develop a neural architecture search method on top of the unified framework which contains a set of selection and fusion operations in the search space and an improved differentiable search algorithm. The performance gains on diverse datasets, five homophily and three heterophily ones, demonstrate the effectiveness of F2GNN. We further conduct experiments to show that F2GNN can improve the model capacity while alleviating the deficiencies of existing GNN topology design manners, especially alleviating the over-smoothing problem, by utilizing different levels of features adaptively. 1
In recent years, graph neural networks (GNNs) have emerged as a successful tool in a variety of graph-related applications. However, the performance of GNNs can be deteriorated when noisy connections occur in the original graph structures; besides, the dependence on explicit structures prevents GNNs from being applied to general unstructured scenarios. To address these issues, recently emerged deep graph structure learning (GSL) methods propose to jointly optimize the graph structure along with GNN under the supervision of a node classification task. Nonetheless, these methods focus on a supervised learning scenario, which leads to several problems, i.e., the reliance on labels, the bias of edge distribution, and the limitation on application tasks. In this paper, we propose a more practical GSL paradigm, unsupervised graph structure learning, where the learned graph topology is optimized by data itself without any external guidance (i.e., labels). To solve the unsupervised GSL problem, we propose a novel StrUcture Bootstrapping contrastive LearnIng fraMEwork (SUBLIME for abbreviation) with the aid of self-supervised contrastive learning. Specifically, we generate a learning target from the original data as an “anchor graph”, and use a contrastive loss to maximize the agreement between the anchor graph and the learned graph. To provide persistent guidance, we design a novel bootstrapping mechanism that upgrades the anchor graph with learned structures during model learning. We also design a series of graph learners and post-processing schemes to model the structures to learn. Extensive experiments on eight benchmark datasets demonstrate the significant effectiveness of our proposed SUBLIME and high quality of the optimized graphs.
Despite the recent success of Message-passing Graph Neural Networks (MP-GNNs), the strong inductive bias of homophily limits their ability to generalize to heterophilic graphs and leads to the over-smoothing problem. Most existing works attempt to mitigate this issue in the spirit of emphasizing the contribution from similar neighbors and reducing those from dissimilar ones when performing aggregation, where the dissimilarities are utilized passively and their positive effects are ignored, leading to suboptimal performances. Inspired by the idea of attitude polarization in social psychology, that people tend to be more extreme when exposed to an opposite opinion, we propose Polarized Graph Neural Network (Polar-GNN). Specifically, pairwise similarities and dissimilarities of nodes are firstly modeled with node features and topological structure information. And specially, we assign negative weights for those dissimilar ones. Then nodes aggregate the messages on a hyper-sphere through a polarization operation, which effectively exploits both similarities and dissimilarities. Furthermore, we theoretically demonstrate the validity of the proposed operation. Lastly, an elaborately designed loss function is introduced for the hyper-spherical embedding space. Extensive experiments on real-world datasets verify the effectiveness of our model.
Center-based clustering techniques are fundamental to many real-world applications such as data summarization and social network analysis. In this work, we study the problem of fairness aware k-center clustering over large datasets. We are given an input dataset comprising a set of n points, where each point belongs to a specific demographic group characterized by a protected attribute, such as race or gender. The goal is to identify k clusters such that all clusters have considerable representation from all groups and the maximum radius of these clusters is minimized.
The majority of the prior techniques do not scale beyond 100K points for k = 50. To address the scalability challenges, we propose an efficient 2-round algorithm for the MapReduce setting that is guaranteed to be a 9-approximation to the optimal solution. Additionally, we develop a 2-pass streaming algorithm that is efficient and has a low memory footprint. These theoretical results are complemented with an empirical evaluation on million-scale datasets, demonstrating that our techniques are effective to identify high-quality fair clusters and efficient as compared to the state-of-the-art.
Graph embedding techniques are pivotal in real-world machine learning tasks that operate on graph-structured data, such as social recommendation and protein structure modeling. Embeddings are mostly performed on the node level for learning representations of each node. Since the formation of a graph is inevitably affected by certain sensitive node attributes, the node embeddings can inherit such sensitive information and introduce undesirable biases in downstream tasks. Most existing works impose ad-hoc constraints on the node embeddings to restrict their distributions for unbiasedness/fairness, which however compromise the utility of the resulting embeddings. In this paper, we propose a principled new way for unbiased graph embedding by learning node embeddings from an underlying bias-free graph, which is not influenced by sensitive node attributes. Motivated by this new perspective, we propose two complementary methods for uncovering such an underlying graph, with the goal of introducing minimum impact on the utility of the embeddings. Both our theoretical justification and extensive experimental comparisons against state-of-the-art solutions demonstrate the effectiveness of our proposed methods.
Prohibited item detection is an important problem in e-commerce, where the goal is to detect illegal items online for evading risks and stemming crimes. Traditional solutions usually mine evidence from individual instances, while current efforts try employing advanced Graph Neural Networks (GNN) to utilize multiple risk-relevant structures of items. However, it still remains two essential challenges, including weak structure and weak supervision. This work proposes the Risk Graph Structure Learning model (RGSL) for prohibited item detection. RGSL first introduces structure learning into large-scale risk graphs, to reduce noisy connections and add similar pairs. It then designs the pairwise training mechanism, which transforms the detection process as a metric learning from candidates to their similar prohibited items. Furthermore, RGSL generates risk-aware item representations and searches risk-relevant pairs for structure learning iteratively. We test RGSL on three real-world scenarios, and the improvements to baselines are up to 21.91% in AP and 18.28% in MAX-F1. Meanwhile, RGSL has been deployed on an e-commerce platform, and the improvements to traditional solutions are up to 23.59% in ACC@1000 and 6.52% in ACC@10000.
Simplicial complexes are a generalization of graphs that model higher-order relations. In this paper, we introduce simplicial patterns —that we call simplets— and generalize the task of frequent pattern mining from the realm of graphs to that of simplicial complexes. Our task is particularly challenging due to the enormous search space and the need for higher-order isomorphism. We show that finding the occurrences of simplets in a complex can be reduced to a bipartite graph isomorphism problem, in linear time and at most quadratic space. We then propose an anti-monotonic frequency measure that allows us to start the exploration from small simplets and stop expanding a simplet as soon as its frequency falls below the minimum frequency threshold. Equipped with these ideas and a clever data structure, we develop a memory-conscious algorithm that, by carefully exploiting the relationships among the simplices in the complex and among the simplets, achieves efficiency and scalability for our complex mining task. Our algorithm, FreSCo, comes in two flavors: it can compute the exact frequency of the simplets or, more quickly, it can determine whether a simplet is frequent, without having to compute the exact frequency. Experimental results prove the ability of FreSCo to mine frequent simplets in complexes of various size and dimension, and the significance of the simplets with respect to the traditional graph patterns.
Cross Domain Recommendation (CDR) has been popularly studied to alleviate the cold-start and data sparsity problem commonly existed in recommender systems. CDR models can improve the recommendation performance of a target domain by leveraging the data of other source domains. However, most existing CDR models assume information can directly ‘transfer across the bridge’, ignoring the privacy issues. To solve this problem, we propose a novel two stage based privacy-preserving CDR framework (PriCDR). In the first stage, we propose two methods, i.e., Johnson-Lindenstrauss Transform (JLT) and Sparse-aware JLT (SJLT), to publish the rating matrix of the source domain using Differential Privacy (DP). We theoretically analyze the privacy and utility of our proposed DP based rating publishing methods. In the second stage, we propose a novel heterogeneous CDR model (HeteroCDR), which uses deep auto-encoder and deep neural network to model the published source rating matrix and target rating matrix respectively. To this end, PriCDR can not only protect the data privacy of the source domain, but also alleviate the data sparsity of the source domain. We conduct experiments on two benchmark datasets and the results demonstrate the effectiveness of PriCDR and HeteroCDR.
Graph neural networks (GNNs) have gained significant success in graph representation learning and become the go-to approach for many graph-based tasks. Despite their effectiveness, the performance of GNNs is known to decline gradually as the number of layers increases. This attenuation is mainly caused by noise propagation, which refers to the useless or negative information propagated (directly or indirectly) from other nodes during the multi-layer graph convolution for node representation learning. This noise increases more severely as the layers of GNNs deepen, which is also a main reason of over-smoothing. In this paper, we propose a new convolution strategy for GNNs to address this problem via suppressing the noise propagation. Specifically, we first find that the feature propagation process of GNNs can be taken as a Markov chain. And then, based on the idea of Markov clustering, we introduce a new graph inflation layer (i.e., using a power function over the distribution) into GNNs to prevent noise propagating from local neighbourhoods to the whole graph with the increase of network layers. Our method is simple in design, which does not require any changes on the original basis and therefore can be easily extended. We conduct extensive experiments on real-world networks and have a stable improved performance as the network depth increases over existing GNNs.
Online social networks are a dominant medium in everyday life to stay in contact with friends and to share information. In Twitter, users can connect with other users by following them, who in turn can follow back. In recent years, researchers studied several properties of social networks and designed random graph models to describe them. Many of these approaches either focus on the generation of undirected graphs or on the creation of directed graphs without modeling the dependencies between reciprocal (i.e., two directed edges of opposite direction between two nodes) and directed edges. We propose an approach to generate directed social network graphs that creates reciprocal and directed edges and considers the correlation between the respective degree sequences.
Our model relies on crawled directed graphs in Twitter, on which information w.r.t. a topic is exchanged or disseminated. While these graphs exhibit a high clustering coefficient and small average distances between random node pairs (which is typical in real-world networks), their degree sequences seem to follow a χ2-distribution rather than power law. To achieve high clustering coefficients, we apply an edge rewiring procedure that preserves the node degrees.
We compare the crawled and the created graphs, and simulate certain algorithms for information dissemination and epidemic spreading on them. The results show that the created graphs exhibit very similar topological and algorithmic properties as the real-world graphs, providing evidence that they can be used as surrogates in social network analysis. Furthermore, our model is highly scalable, which enables us to create graphs of arbitrary size with almost the same properties as the corresponding real-world networks.
In the fraud graph, fraudsters often interact with a large number of benign entities to hide themselves. So, there are not only the homophilic connections formed by the same label nodes (similar nodes), but also the heterophilic connections formed by the different label nodes (dissimilar nodes). However, the existing GNN-based fraud detection methods just enhance the homophily in fraud graph and use the low-pass filter to retain the commonality of node features among the neighbors, which inevitably ignore the difference among neighbor of heterophilic connections. To address this problem, we propose a Graph Neural Network-based Fraud Detector with Homophilic and Heterophilic Interactions (H2-FDetector for short). Firstly, we identify the homophilic and heterophilic connections with the supervision of labeled nodes. Next, we design a new information aggregation strategy to make the homophilic connections propagate similar information and the heterophilic connections propagate difference information. Finally, a prototype prior is introduced to guide the identification of fraudsters. Extensive experiments on two real public benchmark fraud detection tasks demonstrate that our method apparently outperforms state-of-the-art baselines.
Maximal Frequent Subgraph (MFS) mining asks to identify the maximal subgraph that commonly appears in a set of graphs, which has been found valuable in many applications in social science, biology, and other domains. Previous studies focused on reducing the search space of MFSs and discovered the theoretically smallest search space. Despite the success in theory, no practical algorithm can exhaustively search the space as it is huge even for small graphs with only tens of nodes and hundreds of edges. Moreover, deciding whether a subgraph is an MFS needs to solve subgraph monomorphism (SM), an NP-complete problem that introduces extra challenges. Here, we propose a practical MFS mining algorithm that targets large MFSs, named SATMargin. SATMargin adopts random walk in the search space to perform efficient search and utilizes a customized conflict learning Boolean Satisfiability (SAT) algorithm to accelerate SM queries. We design a mechanism that reuses SAT solutions to combine the random walk and the SAT solver effectively. We evaluate SATMargin over synthetic graphs and 6 real-world graph datasets. SATMargin shows superior performance to baselines in finding more and larger MFSs. We further demonstrate the effectiveness of SATMargin in a case study of RNA graphs. The identified frequent subgraph by SATMargin well matches the functional core structure of RNAs previously detected in biological experiments. Our software can be found at https://github.com/MuyiLiu2022/SATMargin-and-Baselines.
The prevalence of graph structures attracts a surge of investigation on graph data, enabling several downstream tasks such as multi-graph classification. However, in the multi-graph setting, graphs usually follow a long-tailed distribution in terms of their sizes, i.e., the number of nodes. In particular, a large fraction of tail graphs usually have small sizes. Though recent graph neural networks (GNNs) can learn powerful graph-level representations, they treat the graphs uniformly and marginalize the tail graphs which suffer from the lack of distinguishable structures, resulting in inferior performance on tail graphs. To alleviate this concern, in this paper we propose a novel graph neural network named SOLT-GNN, to close the representational gap between the head and tail graphs from the perspective of knowledge transfer. In particular, SOLT-GNN capitalizes on the co-occurrence substructures exploitation to extract the transferable patterns from head graphs. Furthermore, a novel relevance prediction function is proposed to memorize the pattern relevance derived from head graphs, in order to predict the complements for tail graphs to attain more comprehensive structures for enrichment. We conduct extensive experiments on five benchmark datasets, and demonstrate that our proposed model can outperform the state-of-the-art baselines.
Listing dense subgraphs in large graphs plays a key task in varieties of network analysis applications like community detection. Clique, as the densest model, has been widely investigated. However, in practice, communities rarely form as cliques for various reasons, e.g., data noise. Therefore, k-plex, – graph with each vertex adjacent to all but at most k vertices, is introduced as a relaxed version of clique. Often, to better simulate cohesive communities, an emphasis is placed on connected k-plexes with small k. In this paper, we continue the research line of listing all maximal k-plexes and maximal k-plexes of prescribed size. Our first contribution is algorithm ListPlex that lists all maximal k-plexes in O*(γD) time for each constant k, where γ is a value related to k but strictly smaller than 2, and D is the degeneracy of the graph that is far less than the vertex number n in real-word graphs. Compared to the trivial bound of 2n, the improvement is significant, and our bound is better than all previously known results. In practice, we further use several techniques to accelerate listing k-plexes of a given size, such as structural-based prune rules, cache-efficient data structures, and parallel techniques. All these together result in a very practical algorithm. Empirical results show that our approach outperforms the state-of-the-art solutions by up to orders of magnitude.
Generative adversarial network (GAN) is widely used for generalized and robust learning on graph data. However, for non-Euclidean graph data, the existing GAN-based graph representation methods generate negative samples by random walk or traverse in discrete space, leading to the information loss of topological properties (e.g. hierarchy and circularity). Moreover, due to the topological heterogeneity (i.e., different densities across the graph structure) of graph data, they suffer from serious topological distortion problems. In this paper, we proposed a novel Curvature Graph Generative Adversarial Networks method, named CurvGAN, which is the first GAN-based graph representation method in the Riemannian geometric manifold. To better preserve the topological properties, we approximate the discrete structure as a continuous Riemannian geometric manifold and generate negative samples efficiently from the wrapped normal distribution. To deal with the topological heterogeneity, we leverage the Ricci curvature for local structures with different topological properties, obtaining to low-distortion representations. Extensive experiments show that CurvGAN consistently and significantly outperforms the state-of-the-art methods across multiple tasks and shows superior robustness and generalization.
Graph classification has a wide range of applications in bioinformatics, social sciences, automated fake news detection, web document classification, and more. In many practical scenarios, including web-scale applications, labels are scarce or hard to obtain. Unsupervised learning is thus a natural paradigm for these settings, but its performance often lags behind that of supervised learning. However, recently contrastive learning (CL) has enabled unsupervised computer vision models to perform comparably to supervised models. Theoretical and empirical works analyzing visual CL frameworks find that leveraging large datasets and task relevant augmentations is essential for CL framework success. Interestingly, graph CL frameworks report high performance while using orders of magnitude smaller data, and employing domain-agnostic graph augmentations (DAGAs) that can corrupt task relevant information.
Motivated by these discrepancies, we seek to determine why existing graph CL frameworks continue to perform well, and identify flawed practices in graph data augmentation and popular graph CL evaluation protocols. We find that DAGA can destroy task-relevant information and harm the model’s ability to learn discriminative representations. We also show that on small benchmark datasets, the inductive bias of graph neural networks can significantly compensate for these limitations, while on larger graph classification tasks commonly-used DAGAs perform poorly. Based on our findings, we propose better practices and sanity checks for future research and applications, including adhering to principles in visual CL when designing context-aware graph augmentations. For example, in graph-based document classification, which can be used for better web search, we show task-relevant augmentations improve accuracy by up to 20.
Graph Neural Networks (GNNs) are widely used on a variety of graph-based machine learning tasks. For node-level tasks, GNNs have strong power to model the homophily property of graphs (i.e., connected nodes are more similar), while their ability to capture heterophily property is often doubtful. This is partially caused by the design of the feature transformation with the same kernel for the nodes in the same hop and the followed aggregation operator. One kernel cannot model the similarity and the dissimilarity (i.e., the positive and negative correlation) between node features simultaneously even though we use attention mechanisms like Graph Attention Network (GAT), since the weight calculated by attention is always a positive value. In this paper, we propose a novel GNN model based on a bi-kernel feature transformation and a selection gate. Two kernels capture homophily and heterophily information respectively, and the gate is introduced to select which kernel we should use for the given node pairs. We conduct extensive experiments on various datasets with different homophily-heterophily properties. The experimental results show consistent and significant improvements against state-of-the-art GNN methods.
We address the problem of learning the Markov order in categorical sequences that represent paths in a network, i.e., sequences of variable lengths where transitions between states are constrained to a known graph. Such data pose challenges for standard Markov order detection methods and demand modeling techniques that explicitly account for the graph constraint. Adopting a multi-order modeling framework for paths, we develop a Bayesian learning technique that (i) detects the correct Markov order more reliably than a competing method based on the likelihood ratio test, (ii) requires considerably less data than methods using AIC or BIC, and (iii) is robust against partial knowledge of the underlying constraints. We further show that a recently published method that uses a likelihood ratio test exhibits a tendency to overfit the true Markov order of paths, which is not the case for our Bayesian technique. Our method is important for data scientists analyzing patterns in categorical sequence data that are subject to (partially) known constraints, e.g. click stream data or other behavioral data on the Web, information propagation in social networks, mobility trajectories, or pathway data in bioinformatics. Addressing the key challenge of model selection, our work is also relevant for the growing body of research that emphasizes the need for higher-order models in network analysis.
Online social networks provide a medium for citizens to form opinions on different societal issues, and a forum for public discussion. They also expose users to viral content, such as breaking news articles. In this paper, we study the interplay between these two aspects: opinion formation and information cascades in online social networks. We present a new model that allows us to quantify how users change their opinion as they are exposed to viral content. Our model is a combination of the popular Friedkin–Johnsen model for opinion dynamics and the independent cascade model for information propagation. We present algorithms for simulating our model, and we provide approximation algorithms for optimizing certain network indices, such as the sum of user opinions or the disagreement–controversy index; our approach can be used to obtain insights into how much viral content can increase these indices in online social networks. Finally, we evaluate our model on real-world datasets. We show experimentally that marketing campaigns and polarizing contents have vastly different effects on the network: while the former have only limited effect on the polarization in the network, the latter can increase the polarization up to 59% even when only 0.5% of the users start sharing a polarizing content. We believe that this finding sheds some light into the growing segregation in today’s online media.
In network analysis, the betweenness centrality of a node informally captures the fraction of shortest paths visiting that node. The computation of the betweenness centrality measure is a fundamental task in the analysis of modern networks, enabling the identification of the most central nodes in such networks. Additionally to being massive, modern networks also contain information about the time at which their events occur. Such networks are often called temporal networks. The temporal information makes the study of the betweenness centrality in temporal networks (i.e., temporal betweenness centrality) much more challenging than in static networks (i.e., networks without temporal information). Moreover, the exact computation of the temporal betweenness centrality is often impractical on even moderately-sized networks, given its extremely high computational cost. A natural approach to reduce such computational cost is to obtain high-quality estimates of the exact values of the temporal betweenness centrality. In this work we present , the first sampling-based approximation algorithm for estimating the temporal betweenness centrality values of the nodes in a temporal network, providing rigorous probabilistic guarantees on the quality of its output. is able to compute the estimates of the temporal betweenness centrality values under two different optimality criteria for the shortest paths of the temporal network. In addition, outputs high-quality estimates with sharp theoretical guarantees leveraging on the empirical Bernstein bound, an advanced concentration inequality. Finally, our experimental evaluation shows that significantly reduces the computational resources required by the exact computation of the temporal betweenness centrality on several real world networks, while reporting high-quality estimates with rigorous guarantees.
A key graph mining primitive is extracting dense structures from graphs, and this has led to interesting notions such as k-cores which subsequently have been employed as building blocks for capturing the structure of complex networks and for designing efficient approximation algorithms for challenging problems such as finding the densest subgraph. In applications such as biological, social, and transportation networks, interactions between objects span multiple aspects. Multilayer (ML) networks have been proposed for accurately modeling such applications. In this paper, we present FirmCore, a new family of dense subgraphs in ML networks, and show that it satisfies many of the nice properties of k-cores in single-layer graphs. Unlike the state of the art core decomposition of ML graphs, FirmCores have a polynomial time algorithm, making them a powerful tool for understanding the structure of massive ML networks. We also extend FirmCore for directed ML graphs. We show that FirmCores and directed FirmCores can be used to obtain efficient approximation algorithms for finding the densest subgraphs of ML graphs and their directed counterparts. Our extensive experiments over several real ML graphs show that our FirmCore decomposition algorithm is significantly more efficient than known algorithms for core decompositions of ML graphs. Furthermore, it returns solutions of matching or better quality for the densest subgraph problem over (possibly directed) ML graphs.
Graph Structure Learning (GSL) recently has attracted considerable attentions in its capacity of optimizing graph structure as well as learning suitable parameters of Graph Neural Networks (GNNs) simultaneously. Current GSL methods mainly learn an optimal graph structure (final view) from single or multiple information sources (basic views), however the theoretical guidance on what is the optimal graph structure is still unexplored. In essence, an optimal graph structure should only contain the information about tasks while compress redundant noise as much as possible, which is defined as ”minimal sufficient structure”, so as to maintain the accurancy and robustness. How to obtain such structure in a principled way? In this paper, we theoretically prove that if we optimize basic views and final view based on mutual information, and keep their performance on labels simultaneously, the final view will be a minimal sufficient structure. With this guidance, we propose a Compact GSL architecture by MI compression, named CoGSL. Specifically, two basic views are extracted from original graph as two inputs of the model, which are refinedly reestimated by a view estimator. Then, we propose an adaptive technique to fuse estimated views into the final view. Furthermore, we maintain the performance of estimated views and the final view and reduce the mutual information of every two views. To comprehensively evaluate the performance of CoGSL, we conduct extensive experiments on several datasets under clean and attacked conditions, which demonstrate the effectiveness and robustness of CoGSL.
We study the problem of supervised contrastive (SupCon) learning on graphs. The SupCon loss has been recently proposed for classification tasks by pulling data points in the same class closer than those of different classes. However, it could be difficult for SupCon to handle datasets with large intra-class variances and high inter-class similarities. This issue is also challenging when it couples with graph structures. To address this, we present the cluster-aware supervised contrastive learning loss (ClusterSCL1) for graph learning tasks. The main idea of ClusterSCL is to retain the structural and attribute properties of a graph in the form of nodes’ cluster distributions during supervised contrastive learning. Specifically, ClusterSCL introduces the strategy of cluster-aware data augmentation and integrates it with the SupCon loss. Extensive experiments on several widely adopted graph benchmarks demonstrate the superiority of ClusterSCL over the cross-entropy, SupCon, and other graph contrastive objectives.
Graph neural network (GNN) has become a popular tool to analyze the graph data. Existing GNNs only focus on networks with first-order dependency, that is, conventional networks following the Markov property. However, many networks in real life own the higher-order dependency, such as click-stream data where the choice of the next page depends not only on the current page but also on previous pages. This kind of sequential data from complex systems (including natural dependencies) are often ignored by existing GNNs which makes them ineffective. To address this problem, we propose for the first time new GNN approaches for higher-order networks in this paper. First, we form sequence fragments by the current node and its predecessor nodes of different orders as candidate higher-order dependencies. When the fragment significantly affects the probability distribution of different successor nodes of the current node, we include it in the higher-order dependency set. We formulize the network with higher-order dependency as an augmented conventional first-order network, and then feed it into GNNs to derive network embeddings. Moreover, we further propose a new end-to-end GNN framework for dealing with higher-order networks directly in the model. Specifically, the higher-order dependency is used as the neighbor aggregation controller when the node is embedded and updated. In the graph convolutional layer, in addition to the first-order neighbor information, we also aggregate the middle node information from the higher-order dependency segment. We finally test the new approaches on three real networks with higher-order dependency, and compare with some state-of-the-art methods. The results show significant improvements of the new approaches which consider higher-order dependency.
Learning low-dimensional representations for Heterogeneous Information Networks (HINs) has drawn increasing attention recently for its effectiveness in real-world applications. Compared with homogeneous networks, HINs are characterized by meta-paths connecting different types of nodes with semantic meanings. Existing methods mainly follow the prototype of independently learning meta-path-based embeddings and integrating them into a unified embedding. However, meta-paths in a HIN are inherently correlated since they reflect different perspectives of the same object. If each meta-path is treated as an isolated semantic data resource and the correlations among them are disregarded, sub-optimality in the both the meta-path based embedding and final embedding will be resulted. To address this issue, we make the first attempt to explicitly model the correlation among meta-paths by proposing Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding (CKD). More specifically, we model the knowledge in each meta-path with two different granularities: regional knowledge and global knowledge. We learn the meta-path-based embeddings by collaboratively distill the knowledge from intra-meta-path and inter-meta-path simultaneously. Experiments conducted on six real-world HIN datasets demonstrates the effectiveness of the CKD method.
We propose the Temporal Walk Centrality, which quantifies the importance of a node by measuring its ability to obtain and distribute information in a temporal network. In contrast to the widely-used betweenness centrality, we assume that information does not necessarily spread on shortest paths but on temporal random walks that satisfy the time constraints of the network. We show that temporal walk centrality can identify nodes playing central roles in dissemination processes that might not be detected by related betweenness concepts and other common static and temporal centrality measures. We propose exact and approximation algorithms with different running times depending on the properties of the temporal network and parameters of our new centrality measure. A technical contribution is a general approach to lift existing algebraic methods for counting walks in static networks to temporal networks. Our experiments on real-world temporal networks show the efficiency and accuracy of our algorithms. Finally, we demonstrate that the rankings by temporal walk centrality often differ significantly from those of other state-of-the-art temporal centralities.
Signed network embedding (SNE) has received considerable attention in recent years. A mainstream idea of SNE is to learn node representations by estimating the ratio of sampling densities. Though achieving promising performance, these methods based on density ratio estimation are limited to the issues of confusing sample, expected error, and fixed priori. To alleviate the above-mentioned issues, in this paper, we propose a novel dual-branch density ratio estimation (DDRE) architecture for SNE. Specifically, DDRE 1) consists of a dual-branch network, dealing with the confusing sample; 2) proposes the expected matrix factorization without sampling to avoid the expected error; and 3) devises an adaptive cross noise sampling to alleviate the fixed priori. We perform sign prediction and node classification experiments on four real-world and three artificial datasets, respectively. Extensive empirical results demonstrate that DDRE not only significantly outperforms the methods based on density ratio estimation but also achieves competitive performance compared with other types of methods such as graph likelihood, generative adversarial networks, and graph convolutional networks. Code is publicly available at https://github.com/WHU-SNA/DDRE.
With the increasing popularity of social media, online interpersonal communication now plays an essential role in people’s everyday information exchange. Whether and how a newcomer can better engage in the community has attracted great interest due to its application in many scenarios. Although some prior works that explore early socialization have obtained salient achievements, they are focusing on sociological surveys based on the small group. To help individuals get through the early socialization period and engage well in online conversations, we study a novel task to foresee whether a newcomer’s message will be responded to by other participants in a multi-party conversation (henceforth Successful New-entry Prediction)1. The task would be an important part of the research in online assistants and social media. To further investigate the key factors indicating such engagement success, we employ an unsupervised neural network, Variational Auto-Encoder (VAE), to examine the topic content and discourse behavior from newcomer’s chatting history and conversation’s ongoing context. Furthermore, two large-scale datasets, from Reddit and Twitter, are collected to support further research on new-entries. Extensive experiments on both Twitter and Reddit datasets show that our model significantly outperforms all the baselines and popular neural models. Additional explainable and visual analyses on new-entry behavior shed light on how to better join in others’ discussions.
Understanding collective decision making at a large-scale, and elucidating how community organization and community dynamics shape collective behavior are at the heart of social science research. In this work we study the behavior of thousands of communities with millions of active members. We define a novel task: predicting which community will undertake an unexpected, large-scale, distributed campaign. To this end, we develop a hybrid model, combining textual cues, community meta-data, and structural properties. We show how this multi-faceted model can accurately predict large-scale collective decision-making in a distributed environment. We demonstrate the applicability of our model through Reddit’s r/place – a large-scale online experiment in which millions of users, self-organized in thousands of communities, clashed and collaborated in an effort to realize their agenda.
Our hybrid model achieves a high F1 prediction score of 0.826. We find that coarse meta-features are as important for prediction accuracy as fine-grained textual cues, while explicit structural features play a smaller role. Interpreting our model, we provide and support various social insights about the unique characteristics of the communities that participated in the r/place experiment.
Our results and analysis shed light on the complex social dynamics that drive collective behavior, and on the factors that propel user coordination. The scale and the unique conditions of the r/place experiment suggest that our findings may apply in broader contexts, such as online activism, (countering) the spread of hate speech and reducing political polarization. The broader applicability of the model is demonstrated through an extensive analysis of the WallStreetBets community, their role in r/place and four years later, in the GameStop short squeeze campaign of 2021.
Crowdsourcing is widely used to solicit judgement from people in diverse applications ranging from evaluating information quality to rating gig worker performance. To encourage the crowd to put in genuine effort in the judgement tasks, various ways to structure and organize these tasks have been explored, though the understandings of how these task design choices influence the crowd’s judgement are still largely lacking. In this paper, using recidivism risk evaluation as an example, we conduct a randomized experiment to examine the effects of two common designs of crowdsourcing judgement tasks—encouraging the crowd to deliberate and providing feedback to the crowd—on the quality, strictness, and fairness of the crowd’s recidivism risk judgements. Our results show that different designs of the judgement tasks significantly affect the strictness of the crowd’s judgements. Moreover, task designs also have the potential to significantly influence how fairly the crowd judges defendants from different racial groups, on those cases where the crowd exhibits substantial in-group bias. Finally, we find that the impacts of task designs on the judgement also vary with the crowd workers’ own characteristics, such as their cognitive reflection levels. Together, these results highlight the importance of obtaining a nuanced understanding on the relationship between task designs and properties of the crowdsourced judgements.
Internet users make numerous decisions online on a daily basis. With the rapid advances in AI recently, AI-assisted decision making—in which an AI model provides decision recommendations and confidence, while the humans make the final decisions—has emerged as a new paradigm of human-AI collaboration. In this paper, we aim at obtaining a quantitative understanding of whether and when would human decision makers adopt the AI model’s recommendations. We define a space of human behavior models by decomposing the human decision maker’s cognitive process in each decision-making task into two components: the utility component (i.e., evaluate the utility of different actions) and the selection component (i.e., select an action to take), and we perform a systematic search in the model space to identify the model that fits real-world human behavior data the best. Our results highlight that in AI-assisted decision making, human decision makers’ utility evaluation and action selection are influenced by their own judgement and confidence on the decision-making task. Further, human decision makers exhibit a tendency to distort the decision confidence in utility evaluations. Finally, we also analyze the differences in humans’ adoption behavior of AI recommendations as the stakes of the decisions vary.
Access to commonsense knowledge is receiving renewed interest for developing neuro-symbolic AI systems, or debugging deep learning models. Little is currently understood about the types of knowledge that can be gathered using existing knowledge elicitation methods. Moreover, these methods fall short of meeting the evolving requirements of several downstream AI tasks. To this end, collecting broad and tacit knowledge, in addition to negative or discriminative knowledge can be highly useful. Addressing this research gap, we developed a novel game with a purpose, ‘FindItOut’, to elicit different types of knowledge from human players through easily configurable game mechanics. We recruited 125 players from a crowdsourcing platform, who played 2430 rounds, resulting in the creation of more than 150k tuples of knowledge. Through an extensive evaluation of these tuples, we show that FindItOut can successfully result in the creation of plural knowledge with a good player experience. We evaluate the efficiency of the game (over 10 × higher than a reference baseline) and the usefulness of the resulting knowledge, through the lens of two downstream tasks — commonsense question answering and the identification of discriminative attributes. Finally, we present a rigorous qualitative analysis of the tuples’ characteristics, that informs the future use of FindItOut across various researcher and practitioner communities.
When annotators label data, a key metric for quality assurance is inter-annotator agreement (IAA): the extent to which annotators agree on their labels. Though many IAA measures exist for simple categorical and ordinal labeling tasks, relatively little work has considered more complex labeling tasks, such as structured, multi-object, and free-text annotations. Krippendorff’s α, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability, but little work has studied its efficacy and consistency across complex annotation tasks.
We investigate the design and evaluation of IAA measures for complex annotation tasks, with evaluation spanning seven diverse tasks: image bounding boxes, image keypoints, text sequence tagging, ranked lists, free text translations, numeric vectors, and syntax trees. We identify the difficulty of interpretability and the complexity of choosing a distance function as key obstacles in applying Krippendorff’s α generally across these tasks. We propose two novel, more interpretable measures, showing they yield more consistent IAA measures across tasks and annotation distance functions.
Simple up/downvotes, arguably the most widely used reaction design across social media platforms, allow users to efficiently express their opinions and quickly evaluate others’ opinions from aggregated votes. However, such design forces users to project their diverse opinions onto dichotomized reactions and provides limited information to readers on why a comment was up/downvoted. We explore user-generated labels (UGLs) as an alternative reaction design to capture the rich context of user reactions to comments. We conducted a between-subjects study with 218 participants to understand how people use and are influenced by UGLs compared to up/downvotes. Specifically, we examine how UGLs affect users’ ability to express and perceive diverse opinions. Participants generated 234 unique labels on diverse aspects of a comment. Leaving more reactions than participants in the up/downvotes condition, participants reported that the ability to express their opinions improved with UGLs. UGLs also enabled participants to better understand the multifacetedness of public evaluation of a comment.
Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function’s resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs.
Flow scheduling is crucial in data centers, as it directly influences user experience of applications. According to different assumptions and design goals, there are four typical flow scheduling problems/solutions: SRPT, LAS, Fair Queueing, and Deadline-Aware scheduling. When implementing these solutions in commodity switches with limited number of queues, they need to set static parameters by measuring traffic in advance, while optimal parameters vary across time and space. This paper proposes a generic framework, namely QCluster, to adapt all scheduling problems for limited number of queues. The key idea of QCluster is to cluster packets with similar weights/properties into the same queue. QCluster is implemented in Tofino switches, and can cluster packets at a speed of 3.2 Tbps. To the best of our knowledge, QCluster is the fastest clustering algorithm. Experimental results in testbed with programmable switches and ns-2 show that QCluster reduces the average flow completion time (FCT) for short flows up to 56.6%, and reduces the overall average FCT up to 21.7% over state-of-the-art. All the source code in ns-2 is available in Github .
Distributed Deep Learning (DDL) is widely used to accelerate deep neural network training for various Web applications. In each iteration of DDL training, each worker synchronizes neural network gradients with other workers. This introduces communication overhead and degrades the scaling performance. In this paper, we propose a recursive model, OSF (Scaling Factor considering Overlap), for estimating the scaling performance of DDL training of neural network models, given the settings of the DDL system. OSF captures two main characteristics of DDL training: the overlap between computation and communication, and the tensor fusion for batching updates. Measurements on a real-world DDL system show that OSF obtains a low estimation error (ranging from 0.5% to 8.4% for different models). Using OSF, we identify the factors that degrade the scaling performance, and propose solutions to effectively mitigate their impacts. Specifically, the proposed adaptive tensor fusion improves the scaling performance by 32.2%∼ 150% compared to the constant tensor fusion buffer size.
Graph Neural Networks (GNNs) have gained growing interest in miscellaneous applications owing to their outstanding ability in extracting latent representation on graph structures. To render GNN-based service for IoT-driven smart applications, the traditional model serving paradigm resorts to the cloud by fully uploading the geo-distributed input data to the remote datacenter. However, our empirical measurements reveal the significant communication overhead of such cloud-based serving and highlight the profound potential in applying the emerging fog computing. To maximize the architectural benefits brought by fog computing, in this paper, we present Fograph, a novel distributed real-time GNN inference framework that leverages diverse resources of multiple fog nodes in proximity to IoT data sources. By introducing heterogeneity-aware execution planning and GNN-specific compression techniques, Fograph tailors its design to well accommodate the unique characteristics of GNN serving in fog environment. Prototype-based evaluation and case study demonstrate that Fograph significantly outperforms the state-of-the-art cloud serving and vanilla fog deployment by up to 5.39 × execution speedup and 6.84 × throughput improvement.
System instance clustering is crucial for large-scale Web services because it can significantly reduce the training overhead of anomaly detection methods. However, the vast number of system instances with massive time points, redundant metrics, and noise bring significant challenges. We propose OmniCluster to accurately and efficiently cluster system instances for large-scale Web services. It combines a one-dimensional convolutional autoencoder (1D-CAE), which extracts the main features of system instances, with a simple, novel, yet effective three-step feature selection strategy. We evaluated OmniCluster using real-world data collected from a top-tier content service provider providing services for one billion+ monthly active users (MAU), proving that OmniCluster achieves high accuracy (NMI=0.9160) and reduces the training overhead of five anomaly detection models by 95.01% on average.
Nowadays, the large online systems are constructed on the basis of microservice architecture. A failure in this architecture may cause a series of failures due to the fault propagation. Thus, the large online systems need to be monitored comprehensively to ensure the service quality. Even though many anomaly detection techniques have been proposed, few of them can be directly applied to a given microservice or cloud server in industrial environment. To settle these challenges, this paper presents SLA-VAE, a semi-supervised learning based active anomaly detection framework using variational auto-encoder. SLA-VAE first defines anomalies based on feature extraction module, introduces semi-supervised VAE to identify anomalies in multivariate time series, and employs active learning to update the online model via a small number of uncertain samples. We conduct experiments on the cloud server data from two different types of game business in Tencent. The results show that SLA-VAE significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online business system.
Deep learning has been used in various domains, including Web services. Convolutional neural networks (CNNs), which are deep learning representatives, are among the most popular neural networks in Web systems. However, CNN employs a high degree of computing. In comparison to the training phase, the inference process is more frequently done on low-power computing equipments. The limited computing resource and high computation pressure limit the effective use of CNN algorithms in industry. Fortunately, a minimal filtering algorithm called Winograd can reduce convolution calculations by minimizing multiplication operations. We find that Winograd convolution can be sped up further by deep reuse technique, which reuses the similar data and computation processes. In this paper, we propose a new inference method, called DREW, which combines deep reuse with Winograd for further accelerating CNNs. DREW handles three difficulties. First, it can detect the similarities from the complex minimal filtering patterns by clustering. Second, it reduces the online clustering cost in a reasonable range. Third, it provides an adjustable method in clustering granularity balancing the performance and accuracy. Experiments show that 1) DREW further accelerates the Winograd convolution by an average of 2.06 × speedup; 2) when DREW is applied to end-to-end Winograd CNN inference, it achieves 1.71 × the average performance speedup with no (<0.4%) accuracy loss; 3) DREW reduces the number of convolution operations to 11% of the original operations on average.
Graph neural networks (GNNs) have achieved state-of-the-art performance in various graph-based tasks. However, as mainstream GNNs are designed based on the neural message passing mechanism, they do not scale well to data size and message passing steps. Although there has been an emerging interest in the design of scalable GNNs, current researches focus on specific GNN design, rather than the general design space, limiting the discovery of potential scalable GNN models. This paper proposes PaSca, a new paradigm and system that offers a principled approach to systemically construct and explore the design space for scalable GNNs, rather than studying individual designs. Through deconstructing the message passing mechanism, PaSca presents a novel Scalable Graph Neural Architecture Paradigm (SGAP), together with a general architecture design space consisting of 150k different designs. Following the paradigm, we implement an auto-search engine that can automatically search well-performing and scalable GNN architectures to balance the trade-off between multiple criteria (e.g., accuracy and efficiency) via multi-objective optimization. Empirical studies on ten benchmark datasets demonstrate that the representative instances (i.e., PaSca-V1, V2, and V3) discovered by our system achieve consistent performance among competitive baselines. Concretely, PaSca-V3 outperforms the state-of-the-art GNN method JK-Net by 0.4% in terms of predictive accuracy on our large industry dataset while achieving up to 28.3 × training speedups.
Deep-learning-based compressor has received interests recently due to much improved compression ratio. However, modern approaches suffer from long execution time. To ease this problem, this paper targets on cutting down the execution time of deep-learning-based compressors. Building history-dependencies sequentially (e.g., recurrent neural networks) is responsible for long inference latency. Instead, we introduce transformer into deep learning compressors to build history-dependencies in parallel. However, existing transformer is too heavy in computation and incompatible to compression tasks.
This paper proposes a fast general-purpose lossless compressor, TRACE, by designing a compression-friendly structure based on a single-layer transformer. We first design a new metric to advise the selection part of compression model structures. Byte-grouping and Shared-ffn schemes are further proposed to fully utilize the capacity of the single-layer transformer. These features allow TRACE to achieve competitive compression ratio and a much faster speed. In addition, we further accelerate the compression procedure by designing a controller to reduce the parameter updating overhead. Experiments show that TRACE achieves an overall ∼ 3x speedup while keeps a comparable compression ratio to the state-of-the-art compressors. The source code for TRACE and links to the datasets are available at https://github.com/mynotwo/A-Fast-Transformer-based-General-Purpose-LosslessCompressor.
Multilingual natural language understanding, which aims to comprehend multilingual documents, is an important task. Existing efforts have been focusing on the analysis of centrally stored text data, but in real practice, multilingual data is usually distributed. Federated learning is a promising paradigm to solve this problem, which trains local models with decentralized data on local clients and aggregates local models on the central server to achieve a good global model. However, existing federated learning methods assume that data are independent and identically distributed (IID), and cannot handle multilingual data, that are usually non-IID with severely skewed distributions: First, multilingual data is stored on local client devices such that there are only monolingual or bilingual data stored on each client. This makes it difficult for local models to know the information of documents in other languages. Second, the distribution over different languages could be skewed. High resource language data is much more abundant than low resource language data. The model trained on such skewed data may focus more on high resource languages but fail to consider the key information of low resource languages. To solve the aforementioned challenges of multilingual federated NLU, we propose a plug-and-play knowledge composition (KC) module, called FedKC, which exchanges knowledge among clients without sharing raw data. Specifically, we propose an effective way to calculate a consistency loss defined based on the shared knowledge across clients, which enables models trained on different clients achieve similar predictions on similar data. Leveraging this consistency loss, joint training is thus conducted on distributed data respecting the privacy constraints. We also analyze the potential risk of FedKC and provide theoretical bound to show that it is difficult to recover data from the corrupted data. We conduct extensive experiments on three public multilingual datasets for three typical NLU tasks, including paraphrase identification, question answering matching, and news classification. The experiment results show that the proposed FedKC can outperform state-of-the-art baselines on the three datasets significantly.
A large-batch training with data parallelism is a widely adopted approach to efficiently train a large deep neural network (DNN) model. Large-batch training, however, often suffers from the problem of the model quality degradation because of its fewer iterations. To alleviate this problem, in general, learning rate (lr) scaling methods have been applied, which increases the learning rate to make an update larger at each iteration. Unfortunately, however, we observe that large-batch training with state-of-the-art lr scaling methods still often degrade the model quality when a batch size crosses a specific limit, rendering such lr methods less useful. To this phenomenon, we hypothesize that existing lr scaling methods overlook the subtle but important differences across “layers” in training, which results in the degradation of the overall model quality. From this hypothesis, we propose a novel approach (LENA) toward the learning rate scaling for large-scale DNN training, employing: (1) a layer-wise adaptive lr scaling to adjust lr for each layer individually, and (2) a layer-wise state-aware warm-up to track the state of the training for each layer and finish its warm-up automatically. The comprehensive evaluation with variations of batch sizes demonstrates that LENA achieves the target accuracy (i.e., the accuracy of single-worker training): (1) within the fewest iterations across different batch sizes (up to 45.2% fewer iterations and 44.7% shorter time than the existing state-of-the-art method), and (2) for training very large-batch sizes, surpassing the limits of all baselines.
Machine learning (ML) is powering a rapidly-increasing number of web applications. As a crucial part of 5G, edge computing facilitates edge artificial intelligence (AI) by ML model training and inference at the network edge on edge servers. Compared with centralized cloud AI, edge AI enables low-latency ML inference which is critical to many delay-sensitive web applications, e.g., web AR/VR, web gaming and Web-of-Things applications. Existing studies of edge AI focused on resource and performance optimization in training and inference, leveraging edge computing merely as a tool to accelerate training and inference processes. However, the unique ability of edge computing to process data with context awareness, a powerful feature for building the web-of-things for smart cities, has not been properly explored. In this paper, we propose a novel framework named Pyramid that unleashes the potential of edge AI by facilitating homogeneous and heterogeneous hierarchical ML inferences. We motivate and present Pyramid with traffic prediction as an illustrative example, and evaluate it through extensive experiments conducted on two real-world datasets. The results demonstrate the superior performance of Pyramid neural networks in hierarchical traffic prediction and weather analysis.
The autonomous database of the next generation aims to apply the reinforcement learning (RL) on tasks like query optimization and performance tuning with little or no human DBAs’ intervention. Despite the promise, to obtain a decent policy model in the domain of database optimization is still challenging — primarily due to the inherent computational overhead involved in the data hungry RL frameworks — in particular on large databases. In the line of mitigating this adverse effect, we propose Mirror in this work. The core to Mirror is a sampling process built in an RL framework together with a transferring process of the policy model from the sampled database to its original counterpart. While being conceptually simple, we identify that the policy transfer between databases involves heavy noise and prediction drifting that cannot be neglectable. Thereby we build a theoretical-guided sampling algorithm in Mirror assisted by a continuous fine-tuning module. The experiments on the PostgreSQL and an industry database PolarDB validate that Mirror has effectively reduced the computational cost while maintaining a satisfactory performance.
SPARQL’s Basic Graph Pattern (BGP) queries are well-researched for query optimisation and RDF indexing techniques. They resemble SQL inner joins and benefit from reorderability of triple patterns in their performance optimisation. But other components of SPARQL such as OPTIONAL, UNION, FILTER, DISTINCT pose more challenges, as they are part of the SPARQL recursive grammar and impose restrictions on the reorderability of triple patterns. These components are important because they help in querying the semi-structured data as opposed to the strictly structured relational data with a stringent schema. In this paper, we use previously published optimisation techniques for BGP-OPTIONAL queries as primitives, and show how they can be used for SPARQL queries with any intermix of UNION, FILTER, and DISTINCT clauses. We mainly focus on the structural aspects of these queries, identify UNION, FILTER, and DISTINCT queries that can use these BGP-OPTIONAL optimisation techniques, and extend some of the previously published theoretical results.
Logs provide first-hand information for engineers to diagnose failures in large-scale online service systems. Log parsing, which transforms semi-structured raw log messages into structured data, is a prerequisite of automated log analysis such as log-based anomaly detection and diagnosis. Almost all existing log parsers follow the general idea of extracting the common part as templates and the dynamic part as parameters. However, these log parsing methods, often neglect the semantic meaning of log messages. Furthermore, high diversity among various log sources also poses an obstacle in the generalization of log parsing across different systems. In this paper, we propose UniParser to capture the common logging behaviours from heterogeneous log data. UniParser utilizes a Token Encoder module and a Context Encoder module to learn the patterns from the log token and its neighbouring context. A Context Similarity module is specially designed to model the commonalities of learned patterns. We have performed extensive experiments on 16 public log datasets and our results show that UniParser outperforms state-of-the-art log parsers by a large margin. 1
Given a sequence of sets with timestamps, where each set includes an arbitrary number of elements, temporal sets prediction aims to predict elements in the consecutive set. Indeed, predicting temporal sets is much more complicated than the conventional predictions of time series and temporal events. Recent studies on temporal sets prediction follow the same pipeline that only learns from each user’s own sequence, which fails to discover the collaborative signals among the sequences of different users. In this paper, we propose a novel element-guided temporal graph neural network to tackle the above issue in temporal sets prediction. Specifically, we first connect sequences of different users via a temporal graph, where nodes contain users and elements, and edges represent user-element interactions with time information. Then, we devise a new message aggregation mechanism to improve the model expressive ability via adaptively learning element-specific representations for each user with the guidance of elements. By performing the element-guided message aggregation among multiple hops, collaborative signals latent in high-order user-element interactions are explicitly encoded. Finally, we present a temporal information utilization module to capture both the semantic and periodic patterns in user sequential behaviors. Experiments on real-world datasets demonstrate that our approach could not only outperform the existing methods with a significant margin but also capture the collaborative signals. Codes and datasets are available at https://github.com/yule-BUAA/ETGNN.
Hypercomplex algebras are well-developed in the area of mathematics. Recently, several hypercomplex recommendation approaches have been proposed and yielded great success. However, two vital issues have not been well-considered in existing hypercomplex recommenders. First, these methods are only designed for specific and low-dimensional hypercomplex algebras (e.g., complex and quaternion algebras), ignoring the exploration and utilization of high-dimensional ones. Second, most recommenders treat every user-item interaction as an isolated data instance, without considering high-order collaborative relationships.
To bridge these gaps, in this paper, we propose a novel recommendation framework named HyperComplex Graph Collaborative Filtering (HCGCF). To study the high-dimensional hypercomplex algebras, we introduce Cayley–Dickson construction which utilizes a recursive process to define hypercomplex algebras and their mathematical operations. Based on Cayley–Dickson construction, we devise a hypercomplex graph convolution operator to learn user and item representations. Specifically, the operator models both the neighborhood summary and interaction relations with neighbors in hypercomplex spaces, effectively exploiting the high-order connectivity in the user-item bipartite graph. To the best of our knowledge, it is the first time that Cayley-Dickson construction and graph convolution techniques have been explicitly discussed and used in hypercomplex recommender systems. Compared with several state-of-the-art recommender baselines, HCGCF achieves superior performance in both click-through rate prediction and top-K recommendation on three real-world datasets.
Recent years have witnessed great success in deep learning-based sequential recommendation (SR), which can provide more timely and accurate recommendations. One of the most effective deep SR architectures is to stack high-performance residual blocks, e.g., prevalent self-attentive and convolutional operations, for capturing long- and short-range dependence of sequential behaviors. By carefully revisiting previous models, we observe: 1) simple architecture modification of gating each residual connection can help us train deeper SR models and yield significant improvements; 2) compared with self-attention mechanism, stacking of convolution layers also can cover each item of the whole sequential behaviors and achieve competitive or even superior performance.
Guided by these findings, it is meaningful to design a deeper hybrid SR model to ensemble the capacity of both self-attentive and convolutional architectures for SR tasks. In this work, we aim to achieve this goal in the automatic algorithm sense, and propose NASR, an efficient neural architecture search (NAS) method that can automatically select the architecture operation on each layer. Specifically, we firstly design a Table-like search space, involving both self-attentive and convolutional-based SR architectures in a flexible manner. In the search phase, we leverage weight-sharing supernets to encode the entire search space, and further propose to factorize the whole supernet into blocks to ensure the potential candidate SR architectures can be fully trained. Owning to lacking supervisions, we train each block-wise supernet with a self-supervised contrastive optimization scheme, in which the training signals are constructed by conducting data augmentation on original sequential behaviors. The empirical studies show that the discovered deep hybrid network architectures can exhibit substantial improvements over compared baselines, indicating the practicality of searching deep hybrid network architectures on SR tasks. Notably, we show the discovered architecture also enjoys good generalizability and transferability among different datasets.
Crowdsourcing aims to enable the assignment of available resources to the completion of tasks at scale. The continued digitization of societal processes translates into increased opportunities for crowdsourcing. For example, crowdsourcing enables the assignment of computational resources of humans, called workers, to tasks that are notoriously hard for computers. In settings faced with malicious actors, detection of such actors holds the potential to increase the robustness of crowdsourcing platform. We propose a framework called Outlier Detection for Streaming Task Assignment that aims to improve robustness by detecting malicious actors. In particular, we model the arrival of workers and the submission of tasks as evolving time series and provide means of detecting malicious actors by means of outlier detection. We propose a novel socially aware Generative Adversarial Network (GAN) based architecture that is capable of contending with the complex distributions found in time series. The architecture includes two GANs that are designed to adversarially train an autoencoder to learn the patterns of distributions in worker and task time series, thus enabling outlier detection based on reconstruction errors. A GAN structure encompasses a game between a generator and a discriminator, where it is desirable that the two can learn to coordinate towards socially optimal outcomes, while avoiding being exploited by selfish opponents. To this end, we propose a novel training approach that incorporates social awareness into the loss functions of the two GANs. Additionally, to improve task assignment efficiency, we propose an efficient greedy algorithm based on degree reduction that transforms task assignment into a bipartite graph matching. Extensive experiments offer insight into the effectiveness and efficiency of the proposed framework.
Variational AutoEncoder (VAE) has been extended as a representative nonlinear method for collaborative filtering. However, the bottleneck of VAE lies in the softmax computation over all items, such that it takes linear costs in the number of items to compute the loss and gradient for optimization. This hinders the practical use due to millions of items in real-world scenarios. Importance sampling is an effective approximation method, based on which the sampled softmax has been derived. However, existing methods usually exploit the uniform or popularity sampler as proposal distributions, leading to a large bias of gradient estimation. To this end, we propose to decompose the inner-product-based softmax probability based on the inverted multi-index, leading to sublinear-time and highly accurate sampling. Based on the proposed proposals, we develop a fast Variational AutoEncoder (FastVAE) for collaborative filtering. FastVAE can outperform the state-of-the-art baselines in terms of both sampling quality and efficiency according to the experiments on three real-world datasets.
Graph Neural Networks (GNNs) have emerged as powerful tools for collaborative filtering. A key challenge of recommendations is to distill long-range collaborative signals from user-item graphs. Typically, GNNs generate embeddings of users/items by propagating and aggregating the messages between local neighbors. Thus, the ability of GNNs to capture long-range dependencies heavily depends on their depths. However, simply training deep GNNs has several bottleneck effects, e.g., over-fitting & over-smoothing, which may lead to unexpected results if GNNs are not well regularized.
Here we present Graph Optimal Transport Networks (GOTNet) to capture long-range dependencies without increasing the depths of GNNs. Specifically, we perform k-Means clustering on nodes’ GNN embeddings to obtain graph-level representations (e.g., centroids). We then compute node-centroid attentions, which enable long-range messages to be communicated among distant but similar nodes. Our non-local attention operators work seamlessly with local operators in original GNNs. As such, GOTNet is able to capture both local and non-local messages in graphs by only using shallow GNNs, which avoids the bottleneck effects of deep GNNs. Experimental results demonstrate that GOTNet achieves better performance compared with state-of-the-art GNNs.
Over the past decades, for One-Class Collaborative Filtering (OCCF), many learning objectives have been researched based on a variety of underlying probabilistic models. From our analysis, we observe that models trained with different OCCF objectives capture distinct aspects of user-item relationships, which in turn produces complementary recommendations. This paper proposes a novel OCCF framework, named as ConCF, that exploits the complementarity from heterogeneous objectives throughout the training process, generating a more generalizable model. ConCF constructs a multi-branch variant of a given target model by adding auxiliary heads, each of which is trained with heterogeneous objectives. Then, it generates consensus by consolidating the various views from the heads, and guides the heads based on the consensus. The heads are collaboratively evolved based on their complementarity throughout the training, which again results in generating more accurate consensus iteratively. After training, we convert the multi-branch architecture back to the original target model by removing the auxiliary heads, thus there is no extra inference cost for the deployment. Our extensive experiments on real-world datasets demonstrate that ConCF significantly improves the generalization of the model by exploiting the complementarity from heterogeneous objectives.
Feature quality has an impactful effect on recommendation performance. Thereby, feature selection is a critical process in developing deep learning-based recommender systems. Most existing deep recommender systems, however, focus on designing sophisticated neural networks, while neglecting the feature selection process. Typically, they just feed all possible features into their proposed deep architectures, or select important features manually by human experts. The former leads to non-trivial embedding parameters and extra inference time, while the latter requires plenty of expert knowledge and human labor effort. In this work, we propose an AutoML framework that can adaptively select the essential feature fields in an automatic manner. Specifically, we first design a differentiable controller network, which is capable of automatically adjusting the probability of selecting a particular feature field; then, only selected feature fields are utilized to retrain the deep recommendation model. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our framework. We conduct further experiments to investigate its properties, including the transferability, key components, and parameter sensitivity.
With the increasing trend of web services over the Internet, developing a robust Quality of Service (QoS) prediction algorithm for recommending services in real-time is becoming a challenge today. Designing an efficient QoS prediction algorithm achieving high accuracy, while supporting faster prediction to enable the algorithm to be integrated into a real-time system, is one of the primary focuses in the domain of Services Computing. The major state-of-the-art QoS prediction methods are yet to efficiently meet both criteria simultaneously, possibly due to the lack of analysis of challenges involved in designing the prediction algorithm. In this paper, we systematically analyze the various challenges associated with the QoS prediction algorithm and propose solution strategies to overcome the challenges, and thereby propose a novel offline framework using deep neural architectures for QoS prediction to achieve our goals. Our framework, on the one hand, handles the sparsity of the dataset, captures the non-linear relationship among data, figures out the correlation between users and services to achieve desirable prediction accuracy. On the other hand, our framework being an offline prediction strategy enables faster responsiveness. We performed extensive experiments on the publicly available WS-DREAM dataset to show the trade-off between prediction performance and prediction time. Furthermore, we observed our framework significantly improved one of the parameters (prediction accuracy or responsiveness) without considerably compromising the other as compared to the state-of-the-art methods.
Recommendation is prevalently studied for implicit feedback recently, but it seriously suffers from the lack of negative samples, which has a significant impact on the training of recommendation models. Existing negative sampling is based on the static or adaptive probability distributions. Sampling from the adaptive probability receives more attention, since it tends to generate more hard examples, to make recommender training faster to converge. However, item sampling becomes much more time-consuming particularly for complex recommendation models. In this paper, we propose an Adaptive Sampling method based on Importance Resampling (AdaSIR for short), which is not only almost equally efficient and accurate for any recommender models, but also can robustly accommodate arbitrary proposal distributions. More concretely, AdaSIR maintains a contextualized sample pool of fixed-size with importance resampling, from which items are only uniformly sampled. Such a simple sampling method can be proved to provide approximately accurate adaptive sampling under some conditions. The sample pool plays two extra important roles in (1) reusing historical hard samples with certain probabilities; (2) estimating the rank of positive samples for weighting, such that recommender training can concentrate more on difficult positive samples. Extensive empirical experiments demonstrate that AdaSIR outperforms state-of-the-art methods in terms of sampling efficiency and effectiveness.
Task-based Virtual Personal Assistants (VPAs) rely on multi-domain Dialogue State Tracking (DST) models to monitor goals throughout a conversation. Previously proposed models show promising results on established benchmarks, but they have difficulty adapting to unseen domains due to domain-specific parameters in their model architectures. We propose a new Similarity-based Multi-domain Dialogue State Tracking model (SM-DST) that uses retrieval-inspired and fine-grained contextual token-level similarity approaches to efficiently and effectively track dialogue state. The key difference with state-of-the-art DST models is that SM-DST has a single model with shared parameters across domains and slots. Because we base SM-DST on similarity it allows the transfer of tracking information between semantically related domains as well as to unseen domains without retraining. Furthermore, we leverage copy mechanisms that consider the system’s response and the dialogue state from previous turn predictions, allowing it to more effectively track dialogue state for complex conversations. We evaluate SM-DST on three variants of the MultiWOZ DST benchmark datasets. The results demonstrate that SM-DST significantly and consistently outperforms state-of-the-art models across all datasets by absolute 5-18% and 3-25% in the few- and zero-shot settings, respectively.
Learning from implicit feedback is one of the most common cases in the application of recommender systems. Generally speaking, interacted examples are considered as positive while negative examples are sampled from uninteracted ones. However, noisy examples are prevalent in real-world implicit feedback. A noisy positive example could be interacted but it actually leads to negative user preference. A noisy negative example which is uninteracted because of user unawareness could also denote potential positive user preference. Conventional training methods overlook these noisy examples, leading to sub-optimal recommendations.
In this work, we propose a general framework to learn robust recommenders from implicit feedback. Through an empirical study, we find that different models make relatively similar predictions on clean examples which denote the real user preference, while the predictions on noisy examples vary much more across different models. Motivated by this observation, we propose denoising with cross-model agreement (DeCA) which minimizes the KL-divergence between the real user preference distributions parameterized by two recommendation models while maximizing the likelihood of data observation. We instantiate DeCA on four representative recommendation models, empirically demonstrating its superiority over normal training and existing denoising methods. Codes are available at https://github.com/wangyu-ustc/DeCA.
Designing data sharing mechanisms providing performance and strong privacy guarantees is a hot topic for the Online Advertising industry. Namely, a prominent proposal discussed under the Improving Web Advertising Business Group at W3C only allows sharing advertising signals through aggregated, differentially private reports of past displays. To study this proposal extensively, an open Privacy-Preserving Machine Learning Challenge took place at AdKDD’21, a premier workshop on Advertising Science with data provided by advertising company Criteo. In this paper, we describe the challenge tasks, the structure of the available datasets, report the challenge results, and enable its full reproducibility. A key finding is that learning models on large, aggregated data in the presence of a small set of unaggregated data points can be surprisingly efficient and cheap. We also run additional experiments to observe the sensitivity of winning methods to different parameters such as privacy budget or quantity of available privileged side information. We conclude that the industry needs either alternate designs for private data sharing or a breakthrough in learning with aggregated data only to keep ad relevance at a reasonable level.
Sequential recommendation models the dynamics of a user’s previous behaviors in order to forecast the next item, and has drawn a lot of attention. Transformer-based approaches, which embed items as vectors and use dot-product self-attention to measure the relationship between items, demonstrate superior capabilities among existing sequential methods. However, users’ real-world sequential behaviors are uncertain rather than deterministic, posing a significant challenge to present techniques. We further suggest that dot-product-based approaches cannot fully capture collaborative transitivity, which can be derived in item-item transitions inside sequences and is beneficial for cold start items. We further argue that BPR loss has no constraint on positive and sampled negative items, which misleads the optimization.
We propose a novel STOchastic Self-Attention (STOSA) to overcome these issues. STOSA, in particular, embeds each item as a stochastic Gaussian distribution, the covariance of which encodes the uncertainty. We devise a novel Wasserstein Self-Attention module to characterize item-item position-wise relationships in sequences, which effectively incorporates uncertainty into model training. Wasserstein attentions also enlighten the collaborative transitivity learning as it satisfies triangle inequality. Moreover, we introduce a novel regularization term to the ranking loss, which assures the dissimilarity between positive and the negative items. Extensive experiments on five real-world benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art baselines, especially on cold start items. The code is available in https://github.com/zfan20/STOSA.
Real-world recommendation datasets have been shown to be subject to selection bias, which can challenge recommendation models to learn real preferences of users, so as to make accurate recommendations. Existing approaches to mitigate selection bias, such as data imputation and inverse propensity score, are sensitive to the quality of the additional imputation or propensity estimation models. To break these limitations, in this work, we propose a novel self-supervised learning (SSL) framework, i.e., Rating Distribution Calibration (RDC), to tackle selection bias without introducing additional models. In addition to the original training objective, we introduce a rating distribution calibration loss. It aims to correct the predicted rating distribution of biased users by taking advantage of that of their similar unbiased users. We empirically evaluate RDC on two real-world datasets and one synthetic dataset. The experimental results show that RDC outperforms the original model as well as the state-of-the-art debiasing approaches by a significant margin.
Recent works have shown the effectiveness of incorporating textual and visual information to tackle the sparsity problem in recommendation scenarios. To fuse these useful heterogeneous modality information, an essential prerequisite is to align these information for modality-robust features learning and semantic understanding. Unfortunately, existing works mainly focus on tackling the learning of common knowledge across modalities, while the specific characteristics of each modality is discarded, which may inevitably degrade the recommendation performance.
To this end, we propose a pretraining framework PAMD, which stands for PretrAining Modality-Disentangled Representations Model. Specifically, PAMD utilizes pretrained VGG19 and Glove to embed items’ both visual and textual modalities into the continuous embedding space. Based on these primitive heterogeneous representations, a disentangled encoder is devised to automatically extract their modality-common characteristics while preserving their modality-specific characteristics. After this, a contrastive learning is further designed to guarantee the consistence and gaps between modality-disentangled representations. To the best of our knowledge, this is the first pretraining framework to learn modality-disentangled representations in recommendation scenarios. Extensive experiments on three public real-world datasets demonstrate the effectiveness of our pretraining solution against a series of state-of-the-art alternatives, which results in the significant performance gain of 4.70%-17.44%.
Recommender system is playing an increasingly important role in online news platforms nowadays. Recently, there is a growing demand for applying reinforcement learning (RL) algorithms to news recommendation aiming to maximize long-term and/or non-differentiable objectives. However, without an interactive simulated environment, it is extremely costly to develop powerful RL agents for news recommendation. In this paper, we build a user simulator, namely MINDSim, for news recommendation. Targeting at new user generation and corresponding behavior simulation, we first construct a hidden space for users using a generative adversarial network, so that new users can be generated by sampling from this hidden space. To capture complex and fast user interest drifts over time, we adopt an encoder-decoder architecture, which takes the clicked news during the simulation as input and outputs the new user interests for the next period of time. Finally, we build the MINDSim simulator using MIcrosoft News Dataset (MIND), and extensive experimental results on this large-scale real-world dataset demonstrate that MINDSim can simulate the behaviors of real users with high quality.
In online advertising, conventional post-click conversion rate (CVR) estimation models are trained using clicked samples. However, during online serving the models need to estimate for all impression ads, leading to the sample selection bias (SSB) issue. Intuitively, providing reliable supervision signals for unclicked ads is a feasible way to alleviate the SSB issue. This paper proposes an uncertainty-regularized knowledge distillation (UKD) framework to debias CVR estimation via distilling knowledge from unclicked ads. A teacher model learns click-adaptive representations and produces pseudo-conversion labels on unclicked ads as supervision signals. Then a student model is trained on both clicked and unclicked ads with knowledge distillation, performing uncertainty modeling to alleviate the inherent noise in pseudo-labels. Experiments on billion-scale datasets show that UKD outperforms previous debiasing methods. Online results verify that UKD achieves significant improvements.
Accurate user interest modeling is important for news recommendation. Most existing methods for news recommendation rely on implicit feedbacks like click for inferring user interests and model training. However, click behaviors usually contain heavy noise, and cannot help infer complicated user interest such as dislike. Besides, the feed recommendation models trained solely on click behaviors cannot optimize other objectives such as user engagement. In this paper, we present a news feed recommendation method that can exploit various kinds of user feedbacks to enhance both user interest modeling and model training. We propose a unified user modeling framework to incorporate various explicit and implicit user feedbacks to infer both positive and negative user interests. In addition, we propose a strong-to-weak attention network that uses the representations of stronger feedbacks to distill positive and negative user interests from implicit weak feedbacks for accurate user interest modeling. Besides, we propose a multi-feedback model training framework to learn an engagement-aware feed recommendation model. Extensive experiments on a real-world dataset show that our approach can effectively improve the model performance in terms of both news clicks and user engagement.
Knowledge graphs (KGs) have been widely used to improve recommendation accuracy. The multi-hop paths on KGs also enable recommendation reasoning, which is considered a crystal type of explainability. In this paper, we propose a reinforcement learning framework for multi-level recommendation reasoning over KGs, which leverages both ontology-view and instance-view KGs to model multi-level user interests. This framework ensures convergence to a more satisfying solution by effectively transferring high-level knowledge to lower levels. Based on the framework, we propose a multi-level reasoning path extraction method, which automatically selects between high-level concepts and low-level ones to form reasoning paths that better reveal user interests. Experiments on three datasets demonstrate the effectiveness of our method.
Authors publish documents in a dynamic manner. Their topic of interest and writing style might shift over time. Tasks such as author classification, author identification or link prediction are difficult to solve in such complex data settings. We propose a new representation learning model, DGEA (for Dynamic Gaussian Embedding of Authors), that is more suited to solve these tasks by capturing this temporal evolution. We formulate a general embedding framework: author representation at time t is a Gaussian distribution that leverages pre-trained document vectors, and that depends on the publications observed until t. The representations should retain some form of multi-topic information and temporal smoothness. We propose two models that fit into this framework. The first one, K-DGEA, uses a first order Markov model optimized with an Expectation Maximization Algorithm with Kalman Equations. The second, R-DGEA, makes use of a Recurrent Neural Network to model the time dependence. We evaluate our method on several quantitative tasks: author identification, classification, and co-authorship prediction, on two datasets written in English. In addition, our model is language agnostic since it only requires pre-trained document embeddings. It outperforms existing baselines by up to 18% on an author classification task on a news articles dataset.
Users’ social connections have recently shown significant benefits to session-based recommendations, and graph neural networks have demonstrated great success in learning the pattern of information flow among users. However, the current paradigm presumes a given social network, which is not necessarily consistent with the fast-evolving shared interests and is expensive to collect. We propose a novel idea to learn the graph structure among users and make recommendations collectively in a coupled framework. This idea raises two challenges, i.e., scalability and effectiveness. We introduce a novel graph-structure learning framework for session-based recommendations (GSL4Rec) for solving both challenges simultaneously. Our framework has a two-stage strategy, i.e., the coarse neighbor screening and the self-adaptive graph structure learning, to enable the exploration of potential links among all users while maintaining a tractable amount of computation for scalability. We also propose a phased heuristic learning strategy to sequentially and synergistically train the graph learning part and recommendation part of GSL4Rec, thus improving the effectiveness by making the model easier to achieve good local optima. Experiments on five public datasets demonstrate that our proposed model significantly outperforms strong baselines, including state-of-the-art social network-based methods.
In Community Question Answering (CQA) websites, expert finding aims at seeking suitable experts to answer questions. The key is to explore the inherent relevance based on the representations of questions and experts. Existing methods usually learn these features from single view information (e.g., question title), which would be not insufficient to fully learn their representations. In this paper, we propose a personalized expert finding method with a multi-view attentive matching mechanism. We design three modules under the multi-view paradigm, including a question encoder, an intra-view encoder, and an inter-view encoder, which aims to comprehend the comprehensive relationships between experts and questions. In the question encoder, we learn the multi-view question features from its title, body and tag views respectively. In the intra-view encoder, we design an interactive attention network to capture the view-specific relevance between the target question and the historical answered questions of experts for all different views. Furthermore, in the inter-view encoder we employ a personalized attention network to aggregate different view information to learn expert/question representations. In this way, the match of the expert and question could be fully captured from the multi-view information via the intra- and inter-view mechanisms. Experimental results on six datasets demonstrate that the proposed method could achieve better performance than existing state-of-the-art methods.
Recommendation system has been a widely studied task both in academia and industry. Previous works mainly focus on homogeneous recommendation and little progress has been made for heterogeneous recommender systems. However, heterogeneous recommendations, e.g., recommending different types of items including products, videos, celebrity shopping notes, among many others, are dominant nowadays. State-of-the-art methods are incapable of leveraging attributes from different types of items and thus suffer from data sparsity problems. And it is indeed quite challenging to represent items with different feature spaces jointly. To tackle this problem, we propose a kernel-based neural network, namely deep unified representation (or DURation) for heterogeneous recommendation, to jointly model unified representations of heterogeneous items while preserving their original feature space topology structures. Theoretically, we prove the representation ability of the proposed model. Besides, we conduct extensive experiments on the real-world datasets. Experimental results demonstrate that with the unified representation, our model achieves remarkable improvement (e.g., 4.1% ~34.9% lift by AUC score and 3.7% lift by online CTR) over existing state-of-the-art models.
Conversational recommendation system (CRS) is able to obtain fine-grained and dynamic user preferences based on interactive dialogue. Previous CRS assumes that the user has a clear target item, which often deviates from the real scenario, that is for many users who resort to CRS, they might not have a clear idea about what they really like. Specifically, the user may have a clear single preference for some attribute types (e.g. brand) of items, while for other attribute types (e.g. color), the user may have multiple preferences or even no clear preferences, which leads to multiple acceptable attribute instances (e.g. black and red) of one attribute type. Therefore, the users could show their preferences over items under multiple combinations of attribute instances rather than a single item with unique combination of all attribute instances. As a result, we first propose a more realistic conversational recommendation learning setting, namely Multi-Interest Multi-round Conversational Recommendation (MIMCR), where users may have multiple interests in attribute instance combinations and accept multiple items with partially overlapped combinations of attribute instances. To effectively cope with the new CRS learning setting, in this paper, we propose a novel learning framework, namely Multiple Choice questions based Multi-Interest Policy Learning (MCMIPL). In order to obtain user preferences more efficiently, the agent generates multiple choice questions rather than binary yes/no ones on specific attribute instance. Furthermore, we propose a union set strategy to select candidate items instead of existing intersection set strategy in order to overcome over-filtering items during the conversation. Finally, we design a Multi-Interest Policy Learning (MIPL) module, which utilizes captured multiple interests of the user to decide next action, either asking attribute instances or recommending items. Extensive experimental results on four datasets demonstrate the superiority of our method for the proposed MIMCR setting.
Explanations in a recommender system assist users make informed decisions among a set of recommended items. Extensive research attention has been devoted to generate natural language explanations to depict how the recommendations are generated and why the users should pay attention to them. However, due to different limitations of those solutions, e.g., template-based or generation-based, it is hard to make the explanations easily perceivable, reliable, and personalized at the same time.
In this work, we develop a graph attentive neural network model that seamlessly integrates user, item, attributes and sentences for extraction-based explanation. The attributes of items are selected as the intermediary to facilitate message passing for user-item specific evaluation of sentence relevance. And to balance individual sentence relevance, overall attribute coverage and content redundancy, we solve an integer linear programming problem to make the final selection of sentences. Extensive empirical evaluations against a set of state-of-the-art baseline methods on two benchmark review datasets demonstrated the generation quality of proposed solution.
Users’ interactions with items are driven by various intents (e.g., preparing for holiday gifts, shopping for fishing equipment, etc.). However, users’ underlying intents are often unobserved/latent, making it challenging to leverage such latent intents for Sequential recommendation (SR). To investigate the benefits of latent intents and leverage them effectively for recommendation, we propose Intent Contrastive Learning (ICL), a general learning paradigm that leverages a latent intent variable into SR. The core idea is to learn users’ intent distribution functions from unlabeled user behavior sequences and optimize SR models with contrastive self-supervised learning (SSL) by considering the learnt intents to improve recommendation. Specifically, we introduce a latent variable to represent users’ intents and learn the distribution function of the latent variable via clustering. We propose to leverage the learnt intents into SR models via contrastive SSL, which maximizes the agreement between a view of sequence and its corresponding intent. The training is alternated between intent representation learning and the SR model optimization steps within the generalized expectation-maximization (EM) framework. Fusing user intent information into SR also improves model robustness. Experiments conducted on four real-world datasets demonstrate the superiority of the proposed learning paradigm, which improves performance, and robustness against data sparsity and noisy interaction issues 1.
Users who come to recommendation platforms are heterogeneous in activity levels. There usually exists a group of core users who visit the platform regularly and consume a large body of content upon each visit, while others are casual users who tend to visit the platform occasionally and consume less each time. As a result, consumption activities from core users often dominate the training data used for learning. As core users can exhibit different activity patterns from casual users, recommender systems trained on historical user activity data usually achieve much worse performance on casual users than core users. To bridge the gap, we propose a model-agnostic framework L2Aug to improve recommendations for casual users through data augmentation, without sacrificing core user experience. L2Aug is powered by a data augmentor that learns to generate augmented interaction sequences, in order to fine-tune and optimize the performance of the recommendation system for casual users. On four real-world public datasets, L2Aug outperforms other treatment methods and achieves the best sequential recommendation performance for both casual and core users. We also test L2Aug in an online simulation environment with real-time feedback to further validate its efficacy, and showcase its flexibility in supporting different augmentation actions.
Sequential recommendation holds the promise of understanding user preference by capturing successive behavior correlations. Existing research focus on designing different models for better fitting the offline datasets. However, the observational data may have been contaminated by the exposure or selection biases, which renders the learned sequential models unreliable. In order to solve this fundamental problem, in this paper, we propose to reformulate the sequential recommendation task with the potential outcome framework, where we are able to clearly understand the data bias mechanism and correct it by re-weighting the training instances with the inverse propensity score (IPS). For more robustness modeling, a clipping strategy is applied to the IPS estimation to reduce the variance of the learning objective. To make our framework more practical, we design a parameterized model to remove the impact of the potential latent confounders. At last, we theoretically analyze the unbiasedness of the proposed framework under both vanilla and clipping IPS estimations. To the best of our knowledge, this is the first work on debiased sequential recommendation. We conduct extensive experiment based on both synthetic and real-world datasets to demonstrate the effectiveness of our framework.
In many personalized recommendation scenarios, the generalization ability of a target task can be improved via learning with additional auxiliary tasks alongside this target task on a multi-task network. However, this method often suffers from a serious optimization imbalance problem. On the one hand, one or more auxiliary tasks might have a larger influence than the target task and even dominate the network weights, resulting in worse recommendation accuracy for the target task. On the other hand, the influence of one or more auxiliary tasks might be too weak to assist the target task. More challenging is that this imbalance dynamically changes throughout the training process and varies across the parts of the same network. We propose a new method: MetaBalance to balance auxiliary losses via directly manipulating their gradients w.r.t the shared parameters in the multi-task network. Specifically, in each training iteration and adaptively for each part of the network, the gradient of an auxiliary loss is carefully reduced or enlarged to have a closer magnitude to the gradient of the target loss, preventing auxiliary tasks from being so strong that dominate the target task or too weak to help the target task. Moreover, the proximity between the gradient magnitudes can be flexibly adjusted to adapt MetaBalance to different scenarios. The experiments show that our proposed method achieves a significant improvement of 8.34% in terms of NDCG@10 upon the strongest baseline on two real-world datasets. The code of our approach can be found at here.1
Effectively representing users lie at the core of modern recommender systems. Since users’ interests naturally exhibit multiple aspects, it is of increasing interest to develop multi-interest frameworks for recommendation, rather than represent each user with an overall embedding. Despite their effectiveness, existing methods solely exploit the encoder (the forward flow) to represent multiple aspects of interests. However, without explicit regularization, the interest embeddings may not be distinct from each other nor semantically reflect representative historical items. Towards this end, we propose the Re4 framework, which leverages the backward flow to reexamine each interest embedding. Specifically, Re4 encapsulates three backward flows, i.e., 1) Re-contrast, which drives each interest embedding to be distinct from other interests using contrastive learning; 2) Re-attend, which ensures the interest-item correlation estimation in the forward flow to be consistent with the criterion used in final recommendation; and 3) Re-construct, which ensures that each interest embedding can semantically reflect the information of representative items that relate to the corresponding interest. We demonstrate the novel forward-backward multi-interest paradigm on ComiRec, and perform extensive experiments on three real-world datasets. Empirical studies validate that Re4 helps to learn learning distinct and effective multi-interest representations.
Session-based recommendation has recently attracted increasing attention from both industry and academic communities. Previous models mostly focus on designing different models to fit the observed data, which can be quite sparse in real-world scenarios. To alleviate this problem, in this paper, we propose a novel generative session-based recommendation framework. The main building block of our idea is to develop a generator to simulate user sequential behaviors, which are leveraged to train and improve the target sequential recommender model. In order to generate high quality samples, we consider two aspects: (1) the rationality as a sequence of user behaviors, and (2) the informativeness for training the target model. To satisfy these requirements, we design a doubly adversarial network. The first adversarial module aims to make the generated samples conform to the underlying patterns of the real user sequential preference (rationality requirement). The second adversarial module is targeted at widening the model experiences by generating samples which can induce larger model losses (informativeness requirement). In our model, the samples are generated based on a reinforcement learning strategy, where the reward is related with both of the above aspects. In order to stable the training process, we introduce a self-paced regularizer to learn the agent in an easy-to-hard manner. We conduct extensive experiments based on real-world datasets to demonstrate the effectiveness of our model.
Most machine learning classifiers only concern classification accuracy, while certain applications (such as medical diagnosis, meteorological forecasting, and computation advertising) require the model to predict the true probability, known as a calibrated estimate. In previous work, researchers have developed several calibration methods to post-process the outputs of a predictor to obtain calibrated values, such as binning and scaling methods. Compared with scaling, binning methods are shown to have distribution-free theoretical guarantees, which motivates us to prefer binning methods for calibration. However, we notice that existing binning methods have several drawbacks: (a) the binning scheme only considers the original prediction values, thus limiting the calibration performance; and (b) the binning approach is non-individual, mapping multiple samples in a bin to the same value, and thus is not suitable for order-sensitive applications. In this paper, we propose a feature-aware binning framework, called Multiple Boosting Calibration Trees (MBCT), along with a multi-view calibration loss to tackle the above issues. Our MBCT optimizes the binning scheme by the tree structures of features, and adopts a linear function in a tree node to achieve individual calibration. Our MBCT is non-monotonic, and has the potential to improve order accuracy, due to its learnable binning scheme and the individual calibration. We conduct comprehensive experiments on three datasets in different fields. Results show that our method outperforms all competing models in terms of both calibration error and order accuracy. We also conduct simulation experiments, justifying that the proposed multi-view calibration loss is a better metric in modeling calibration error. In addition, our approach is deployed in a real-world online advertising platform; an A/B test over two weeks further demonstrates the effectiveness and great business value of our approach.
Conducting experiments with objectives that take significant delays to materialize (e.g. conversions, add-to-cart events, etc.) is challenging. Although the classical “split sample testing” is still valid for the delayed feedback, the experiment will take longer to complete, which also means spending more resources on worse-performing strategies due to their fixed allocation schedules. Alternatively, adaptive approaches such as “multi-armed bandits” are able to effectively reduce the cost of experimentation. But these methods generally cannot handle delayed objectives directly out of the box. This paper presents an adaptive experimentation solution tailored for delayed binary feedback objectives by estimating the real underlying objectives before they materialize and dynamically allocating variants based on the estimates. Experiments show that the proposed method is more efficient for delayed feedback compared to various other approaches and is robust in different settings. In addition, we describe an experimentation product powered by this algorithm. This product is currently deployed in the online experimentation platform of JD.com, a large e-commerce company and a publisher of digital ads.
Modeling user’s long-term and short-term interests is crucial for accurate recommendation. However, since there is no manually annotated label for user interests, existing approaches always follow the paradigm of entangling these two aspects, which may lead to inferior recommendation accuracy and interpretability. In this paper, to address it, we propose a Contrastive learning framework to disentangle Long and Short-term interests for Recommendation (CLSR) with self-supervision. Specifically, we first propose two separate encoders to independently capture user interests of different time scales. We then extract long-term and short-term interests proxies from the interaction sequences, which serve as pseudo labels for user interests. Then pairwise contrastive tasks are designed to supervise the similarity between interest representations and their corresponding interest proxies. Finally, since the importance of long-term and short-term interests is dynamically changing, we propose to adaptively aggregate them through an attention-based network for prediction. We conduct experiments on two large-scale real-world datasets for e-commerce and short-video recommendation. Empirical results show that our CLSR consistently outperforms all state-of-the-art models with significant improvements: GAUC is improved by over 0.01, and NDCG is improved by over 4%. Further counterfactual evaluations demonstrate that stronger disentanglement of long and short-term interests is successfully achieved by CLSR. The code and data are available at https://github.com/tsinghua-fib-lab/CLSR.
With the prosperity of recommender systems, the biases existing in user behaviors, which may lead to inconsistency between user preference and behavior records, have attracted wide attention. Though large efforts have been made to infer user preference from biased data with learning to debias, unfortunately, they mainly focus on the effect of one specific item attribute, e.g., position or modality which may affect users’ click probability on items. However, the comprehensive description for potential interactions between multiple items with various attributes, namely the context bias between items, may not be fully summarized. To that end, in this paper, we design a novel Context Bias aware Recommendation (CBR) model for describing and debiasing the context bias caused by comprehensive interactions between multiple items. Specifically, we first propose a content encoder and a bias encoder based on multi-head self-attention to embed the latent interactions between items. Then, we calculate the biased representation for users based on an attention network, which will be further utilized to infer the negative preference, i.e., the dislikes of users based on the items the user never clicked. Finally, the real user preference will be captured based on the negative preference to estimate the click prediction score. Extensive experiments on a real-world dataset demonstrate the competitiveness of our CBR framework compared with state-of-the-art baseline methods.
Personalized product search that provides users with customized search services is an important task for e-commerce platforms. This task remains a challenge when inferring users’ preferences from few records or even no records, which is also known as the few-shot or zero-shot learning problem. In this paper, we propose a Bayesian Online Meta-Learning Model (BOML), which transfers meta-knowledge, from the inference for other users’ preferences, to help to infer the current user’s interest behind her/his few or even no historical records. To extract meta-knowledge from various inference patterns, our model constructs a mixture of meta-knowledge and transfers the corresponding meta-knowledge to the specific user according to her/his records. Based on the meta-knowledge learned from other similar inferences, our proposed model searches a ranked list of products to meet users’ personalized query intents for those with few search records (i.e., few-shot learning problem) or even no search records (i.e., zero-shot learning problem). Under the records arriving sequentially setting, we propose an online variational inference algorithm to update meta-knowledge over time. Experimental results demonstrate that our proposed BOML outperforms state-of-the-art algorithms.
Sequential recommendation basically aims to capture user evolving preference. Intuitively, a user interacts with an item usually because of some specific feature, and user evolving preference is essentially determined by a series of important features along the time line. However, existing sequential models usually represent each item by a unified embedding, which fails to distinguish item features, let along modeling the feature sequences. To bridge this gap, in this paper, we propose a novel sequential recommender model by learning the key item feature sequences underlying user behaviors, which facilitates more focused model optimization and better recommendation performance. To achieve this goal, we firstly represent each item by explicit or latent features, and then build both soft and hard models to route optimal feature sequences. More specifically, in the soft model, we design a 2D attention mechanism, which simultaneously distinguishes the importances of the items in a sequence and the features for the same item. For the hard model, we regard the feature routing problem as a Markov decision process, and propose a reinforcement learning method to generate feature sequences, which can lead to the lowered negative log-likelihood. In the experiments, we compare our model with the state-of-the-art methods based on real-world datasets, where we can empirically demonstrate 8.2 and 16.1 improvements of our model on NDCG and MRR, respectively.
TikTok currently is the fastest growing social media platform with over 1 billion active monthly users of which the majority is from generation Z. Arguably, its most important success driver is its recommendation system. Despite the importance of TikTok’s algorithm to the platform’s success and content distribution, little work has been done on the empirical analysis of the algorithm. Our work lays the foundation to fill this research gap. Using a sock-puppet audit methodology with a custom algorithm developed by us, we tested and analysed the effect of the language and location used to access TikTok, follow- and like-feature, as well as how the recommended content changes as a user watches certain posts longer than others. We provide evidence that all the tested factors influence the content recommended to TikTok users. Further, we identified that the follow-feature has the strongest influence, followed by the like-feature and video view rate. We also discuss the implications of our findings in the context of the formation of filter bubbles on TikTok and the proliferation of problematic content.
Offering incentives (e.g., coupons at Amazon, discounts at Uber and video bonuses at Tiktok) to user is a common strategy used by online platforms to increase user engagement and platform revenue. Despite its proven effectiveness, these marketing incentives incur an inevitable cost and might result in a low ROI (Return on Investment) if not used properly. On the other hand, different users respond differently to these incentives, for instance, some users never buy certain products without coupons, while others do anyway. Thus, how to select the right amount of incentives (i.e. treatment) to each user under budget constraints is an important research problem with great practical implications. In this paper, we call such problem as a budget-constrained treatment selection (BTS) problem.
The challenge is how to efficiently solve BTS problem on a Large-Scale dataset and achieve improved results over the existing techniques. We propose a novel tree-based treatment selection technique under budget constraints, called Large-Scale Budget-Constrained Causal Forest (LBCF) algorithm, which is also an efficient treatment selection algorithm suitable for modern distributed computing systems. A novel offline evaluation method is also proposed to overcome an intrinsic challenge in assessing solutions’ performance for BTS problem in randomized control trials (RCT) data. We deploy our approach in a real-world scenario on a large-scale video platform, where the platform gives away bonuses in order to increase users’ campaign engagement duration. The simulation analysis, offline and online experiments all show that our method outperforms various tree-based state-of-the-art baselines 1. The proposed approach is currently serving over hundreds of millions of users on the platform and achieves one of the most tremendous improvements over these months.
Recently, graph collaborative filtering methods have been proposed as an effective recommendation approach, which can capture users’ preference over items by modeling the user-item interaction graphs. Despite the effectiveness, these methods suffer from data sparsity in real scenarios. In order to reduce the influence of data sparsity, contrastive learning is adopted in graph collaborative filtering for enhancing the performance. However, these methods typically construct the contrastive pairs by random sampling, which neglect the neighboring relations among users (or items) and fail to fully exploit the potential of contrastive learning for recommendation.
To tackle the above issue, we propose a novel contrastive learning approach, named Neighborhood-enriched Contrastive Learning, named NCL, which explicitly incorporates the potential neighbors into contrastive pairs. Specifically, we introduce the neighbors of a user (or an item) from graph structure and semantic space respectively. For the structural neighbors on the interaction graph, we develop a novel structure-contrastive objective that regards users (or items) and their structural neighbors as positive contrastive pairs. In implementation, the representations of users (or items) and neighbors correspond to the outputs of different GNN layers. Furthermore, to excavate the potential neighbor relation in semantic space, we assume that users with similar representations are within the semantic neighborhood, and incorporate these semantic neighbors into the prototype-contrastive objective. The proposed NCL can be optimized with EM algorithm and generalized to apply to graph collaborative filtering methods. Extensive experiments on five public datasets demonstrate the effectiveness of the proposed NCL, notably with 26% and 17% performance gain over a competitive graph collaborative filtering base model on the Yelp and Amazon-book datasets, respectively. Our implementation code is available at: https://github.com/RUCAIBox/NCL.
Knowledge tracing is the task of understanding student’s knowledge acquisition processes by estimating whether to solve the next question correctly or not. Most deep learning-based methods tackle this problem by identifying hidden representations of knowledge states from learning histories. However, due to the sparse interactions between students and questions, the hidden representations can be easily over-fitted and often fail to capture student’s knowledge states accurately. This paper introduces a contrastive learning framework for knowledge tracing that reveals semantically similar or dissimilar examples of a learning history and stimulates to learn their relationships. To deal with the complexity of knowledge acquisition during learning, we carefully design the components of contrastive learning, such as architectures, data augmentation methods, and hard negatives, taking into account pedagogical rationales. Our extensive experiments on six benchmarks show statistically significant improvements from the previous methods. Further analysis shows how our methods contribute to improving knowledge tracing performances.
The majority of recent work in latent Collaborative Filtering (CF) has focused on developing new model architectures to learn accurate user and item representations. Typically, a standard pairwise loss function (BPR, Triplet, etc.) is used in these models, and little exploration is done on how to optimally extract signals from the available preference information. In the implicit setting, negative examples are sampled, and these losses allocate weights that solely depend on the difference in user distance between observed (positive) and negative item pairs. This can ignore valuable global information from other users and items, and lead to sub-optimal results. Motivated by this problem, we propose a novel loss which first leverages mining to select the most informative pairs, followed by a weighing process to allocate more weight to harder examples. Our weighting process consists of four different components, and incorporates distance information from other users, enabling the model to better position the learned representations. We conduct extensive experiments and demonstrate that our loss can be applied to different types of CF models leading to significant gains with each type. In particular, by applying our loss to the graph convolutional architecture, we achieve new state-of-the-art results on four different datasets. Further analysis shows that through our loss the model is able to learn better user-item representation space compared to other losses. Full code for this work is available here: https://github.com/layer6ai-labs/MCL.
Reinforcement learning has recently become an active topic in recommender system research, where the logged data that records interactions between items and users feedback is used to discover the policy. Much off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has been a popular research topic in reinforcement learning. However, the log entries are biased in that the logs over-represent actions favored by the recommender system, as the user feedback contains only partial information limited to the particular items exposed to the user. As a result, the policy learned from such off-line logged data tends to be biased from the true behaviour policy.
In this paper, we are the first to propose a novel off-policy learning augmented by meta-paths for the recommendation. We argue that the Heterogeneous information network (HIN), which provides rich contextual information of items and user aspects, could scale the logged data contribution for unbiased target policy learning. Towards this end, we develop a new HIN augmented target policy model (HINpolicy), which explicitly leverages contextual information to scale the generated reward for target policy. In addition, being equipped with the HINpolicy model, our solution adaptively receives HIN-augmented corrections for counterfactual risk minimization, and ultimately yields an effective policy to maximize the long run rewards for the recommendation. Finally, we extensively evaluate our method through a series of simulations and large-scale real-world datasets, obtaining favorable results compared with state-of-the-art methods.
Recommender systems are incremental in nature. Recent progresses in incremental recommendation rely on capturing the temporal dynamics of users/items from temporal interaction graphs, so that their user/item embeddings can evolve together with the graph structures. However, these methods are faced with two key challenges: 1) model training and/or updating are time-consuming and 2) new users/items cannot be effectively handled. To this end, we propose the fast incremental recommendation (FIRE) method from a graph signal processing perspective. FIRE is non-parametric which does not suffer from the time-consuming back-propagations as in previous learning-based methods, significantly improving the efficiency of model updating. In addition, we encode user/item temporal information and side information by designing new graph filters in FIRE, which can capture the temporal dynamics of users/items and address the cold-start issue for new users/items, respectively. Experimental studies on four popular datasets demonstrate that FIRE can improve the accuracy by a large margin and improve the model updating efficiency by at least 3X compared with the state-of-the-art incremental recommendation algorithms. The Code is available at https://github.com/Yaveng/FIRE.
Most recommender systems optimize the model on observed interaction data, which is affected by the previous exposure mechanism and exhibits many biases like popularity bias. The loss functions, such as the mostly used pointwise Binary Cross-Entropy and pairwise Bayesian Personalized Ranking, are not designed to consider the biases in observed data. As a result, the model optimized on the loss would inherit the data biases, or even worse, amplify the biases. For example, a few popular items take up more and more exposure opportunities, severely hurting the recommendation quality on niche items — known as the notorious Mathew effect.
In this work, we develop a new learning paradigm named Cross Pairwise Ranking (CPR) that achieves unbiased recommendation without knowing the exposure mechanism. Distinct from inverse propensity scoring (IPS), we change the loss term of a sample — we innovatively sample multiple observed interactions once and form the loss as the combination of their predictions. We prove in theory that this way offsets the influence of user/item propensity on the learning, removing the influence of data biases caused by the exposure mechanism. Advantageous to IPS, our proposed CPR ensures unbiased learning for each training instance without the need of setting the propensity scores. Experimental results demonstrate the superiority of CPR over state-of-the-art debiasing solutions in both model generalization and training efficiency. The codes are available at https://github.com/Qcactus/CPR.
Recently, user-oriented auto-encoders (UAEs) have been widely used in recommender systems to learn semantic representations of users based on their historical ratings. However, since latent item variables are not modeled in UAE, it is difficult to utilize the widely available item content information when ratings are sparse. In addition, whenever new items arrive, we need to wait for collecting rating data for these items and retrain the UAE from scratch, which is inefficient in practice. Aiming to address the above two problems simultaneously, we propose a mutually-regularized dual collaborative variational auto-encoder (MD-CVAE) for recommendation. First, by replacing randomly initialized last layer weights of the vanilla UAE with stacked latent item embeddings, MD-CVAE integrates two heterogeneous information sources, i.e., item content and user ratings, into the same principled variational framework where the weights of UAE are regularized by item content such that convergence to a non-optima due to data sparsity can be avoided. In addition, the regularization is mutual in that user ratings can also help the dual item content module learn more recommendation-oriented item content embeddings. Finally, we propose a symmetric inference strategy for MD-CVAE where the first layer weights of the UAE encoder are tied to the latent item embeddings of the UAE decoder. Through this strategy, no retraining is required to recommend newly introduced items. Empirical studies show the effectiveness of MD-CVAE in both normal and cold-start scenarios. Codes are available at https://github.com/yaochenzhu/MD-CVAE.
Recently, deep neural networks such as RNN, CNN and Transformer have been applied in the task of sequential recommendation, which aims to capture the dynamic preference characteristics from logged user behavior data for accurate recommendation. However, in online platforms, logged user behavior data is inevitable to contain noise, and deep recommendation models are easy to overfit on these logged data. To tackle this problem, we borrow the idea of filtering algorithms from signal processing that attenuates the noise in the frequency domain. In our empirical experiments, we find that filtering algorithms can substantially improve representative sequential recommendation models, and integrating simple filtering algorithms (e.g., Band-Stop Filter) with an all-MLP architecture can even outperform competitive Transformer-based models. Motivated by it, we propose FMLP-Rec, an all-MLP model with learnable filters for sequential recommendation task. The all-MLP architecture endows our model with lower time complexity, and the learnable filters can adaptively attenuate the noise information in the frequency domain. Extensive experiments conducted on eight real-world datasets demonstrate the superiority of our proposed method over competitive RNN, CNN, GNN and Transformer-based methods. Our code and data are publicly available at the link: https://github.com/RUCAIBox/FMLP-Rec .
The increasing complexity of the web has attracted a number of solutions to offer optimized versions of web pages that are lighter to process and faster to load. These solutions have been quantitatively evaluated to show significant speed-ups in load times and/or considerable savings in bandwidth/memory consumption. However, while these solutions often produce optimized versions from existing pages, they rarely evaluate the impact of their optimizations on the original content and functionality. Additionally, due to the lack of a unified metric to evaluate the similarity of the pages generated by these solutions in comparison to the original pages, it is not yet possible to fairly compare the results obtained from different user studies campaigns, unless recruiting the exact same users, which is extremely challenging. In this paper, we demonstrate the lack of qualitative evaluation metrics, and propose QLUE (QuaLitative Uniform Evaluation), a tool that automates the qualitative evaluation of web pages generated by web complexity solutions with respect to their original versions using computer vision. QLUE evaluates the content and the functionality of these pages separately using two metrics: QLUE’s Structural Similarity, to assess the former, and QLUE’s Functional Similarity to assess the latter—a task that is proven to be challenging for humans given the complex functional dependencies in modern pages. Our results show that QLUE computes comparable content and functional scores to those provided by humans. Specifically, 90% of a set of 100 pages were given a similarity score between 90% and 100% by human evaluators, while QLUE shows similar scores for more than 75% of the same pages. In terms of time complexity, QLUE shows that it is capable of evaluating an optimized web page in a few minutes.
Interactive recommender systems (RSs) allow users to express intent, preferences and contexts in a rich fashion, often using natural language. One challenge in using such feedback is inferring a user’s semantic intent from the open-ended terms used to describe an item, and using it to refine recommendation results. Leveraging concept activation vectors (CAVs) , we develop a framework to learn a representation that captures the semantics of such attributes and connects them to user preferences and behaviors in RSs. A novel feature of our approach is its ability to distinguish objective and subjective attributes and associate different senses with different users. Using synthetic and real-world datasets, we show that our CAV representation accurately interprets users’ subjective semantics, and can improve recommendations via interactive critiquing.
Managing uncertainty in preferences is core to creating the next generation of conversational recommender systems (CRS). However, an often-overlooked element of conversational interaction is the role of clarification. Users are notoriously noisy at revealing their preferences, and a common error is being unnecessarily specific, e.g., suggesting ”chicken fingers” when a restaurant with a ”kids menu” was the intended preference. Correcting such errors requires reasoning about the level of generality and specificity of preferences and verifying that the user has expressed the correct level of generality. To this end, we propose a novel clarification-based conversational critiquing framework that allows the system to clarify user preferences as it accepts critiques. To support clarification, we propose the use of distributional embeddings that can capture the specificity and generality of concepts through distributional coverage while facilitating state-of-the-art embedding-based recommendation methods. Specifically, we incorporate Distributional Contrastive Embeddings of critiqueable keyphrases with user preference embeddings in a Variational Autoencoder recommendation framework that we term DCE-VAE. Our experiments show that our proposed DCE-VAE is (1) competitive in terms of general performance in comparison to state-of-the-art recommenders and (2) supports effective clarification-based critiquing in comparison to alternative clarification baselines. In summary, this work adds a new dimension of clarification to enhance the well-known critiquing framework along with a novel data-driven distributional embedding for clarification suggestions that significantly improves the efficacy of user interaction with critiquing-based CRSs.
Recommender systems are modulating what billions of people are exposed to on a daily basis. Typically, these systems are optimized for user engagement signals such as clicks, streams, likes, or a weighted combination of such sets. Despite the pervasiveness of this practice, little research has been done to explore the downstream impacts of optimization choice on users, creators and the ecosystem they are offered in. We used a platform that caters recommendations to millions of people and show in practice what you optimize for can have a large impact on the content users are exposed to, as well as what they end up consuming.
In this work, we use podcast recommendations with two engagement signals: Subscription vs. Plays to show that the choice of user engagement matters. We deployed recommendation models optimized for each signal in production and observed that consumption outcomes substantially defer depending on the target used. Upon further investigation, we observed that users’ patterns of podcast engagement depend on the type of podcast, and each podcast can cater to specific user goals & needs. Optimizing for streams can bias the recommendations towards certain podcast types, undermine users’ aspirational interests and put some show categories at disadvantage. Finally, using calibration we demonstrate that informed balanced recommendations can help address this issue and thereby satisfy diverse user interests.
Music streaming services heavily rely upon recommender systems to acquire, engage, and retain users. One notable component of these services are playlists, which can be dynamically generated in a sequential manner based on the user’s feedback during a listening session. Online learning to rank approaches have recently been shown effective at leveraging such feedback to learn users’ preferences in the space of song features. Nevertheless, these approaches can suffer from slow convergence as a result of their random exploration component and get stuck in local minima as a result of their session-agnostic exploitation component. To overcome these limitations, we propose a novel online learning to rank approach which efficiently explores the space of candidate recommendation models by restricting itself to the orthogonal complement of the subspace of previous underperforming exploration directions. Moreover, to help overcome local minima, we propose a session-aware exploitation component which adaptively leverages the current best model during model updates. Our thorough evaluation using simulated listening sessions from Last.fm demonstrates substantial improvements over state-of-the-art approaches regarding early-stage performance and overall long-term convergence.
The personalized recommendation is an essential part of modern e-commerce, where user’s demands are not only conditioned by their profile but also by their recent browsing behaviors as well as periodical purchases made some time ago. In this paper, we propose a novel framework named Search-based Time-Aware Recommendation (STARec), which captures the evolving demands of users over time through a unified search-based time-aware model. More concretely, we first design a search-based module to retrieve a user’s relevant historical behaviors, which are then mixed up with her recent records to be fed into a time-aware sequential network for capturing her time-sensitive demands. Besides retrieving relevant information from her personal history, we also propose to search and retrieve similar user’s records as an additional reference. All these sequential records are further fused to make the final recommendation. Beyond this framework, we also develop a novel label trick that uses the previous labels (i.e., user’s feedbacks) as the input to better capture the user’s browsing pattern. We conduct extensive experiments on three real-world commercial datasets on click-through-rate prediction tasks against state-of-the-art methods. Experimental results demonstrate the superiority and efficiency of our proposed framework and techniques. Furthermore, results of online experiments on a daily item recommendation platform of Company X show that STARec gains average performance improvement of around 6% and 1.5% in its two main item recommendation scenarios on CTR metric respectively.
In large-scale recommender systems, the user-item networks are generally scale-free or expand exponentially. For the representation of the user and item, the latent features (a.k.a, embeddings) depend on how well the embedding space matches the data distribution. Hyperbolic space offers a spacious room to learn embeddings with its negative curvature and metric properties, which can well fit data with tree-like structures. Recently, several hyperbolic approaches have been proposed to learn high-quality representations for the users and items. However, most of them concentrate upon developing the hyperbolic similitude by designing appropriate projection operations, whereas many advantageous and exciting geometric properties of hyperbolic space have not been explicitly explored. For example, one of the most notable properties of hyperbolic space is that its capacity space increases exponentially with the radius, which indicates the area far away from the hyperbolic origin is much more embeddable. Regarding the geometric properties of hyperbolic space, we bring up a Hyperbolic Regularization powered Collaborative Filtering (HRCF) and design a geometric-aware hyperbolic regularizer. Specifically, the proposal boosts optimization procedure via the root alignment and origin-aware penalty, which is simple yet impressively effective. Through theoretical analysis, we further show that our proposal is able to tackle the over-smoothing problem caused by the hyperbolic aggregation and also brings the models a better discriminative ability. We conduct extensive empirical analysis, comparing our proposal against a large set of baselines on several public benchmarks. The empirical results show that our approach achieves highly competitive performance and surpasses both the leading Euclidean and hyperbolic baselines by considerable margins. Further analysis verifies the rationality and effectiveness of the proposal for robust, deeper, and lightweight neural graph collaborative filtering.
Multi-task learning (MTL) has been widely utilized in various industrial scenarios, such as recommender systems and search engines. MTL can improve learning efficiency and prediction accuracy by exploiting commonalities and differences across tasks. However, MTL is sensitive to relationships among tasks and may have performance degradation in real-world applications, because existing neural-based MTL models often share the same network structures and original input features. To address this issue, we propose a novel multi-task learning model based on Prototype Feature Extraction (PFE) to balance task-specific objectives and inter-task relationships. PFE is a novel component to disentangle features for multiple tasks. To better extract features from original inputs before gating networks, we introduce a new concept, namely prototype feature center, to disentangle features for multiple tasks. The extracted prototype features fuse various features from different tasks to better learn inter-task relationships. PFE updates prototype feature centers and prototype features iteratively. Our model utilizes the learned prototype features and task-specific experts for MTL. We implement PFE on two public datasets. Empirical results show that PFE outperforms state-of-the-art MTL models by extracting prototype features. Furthermore, we deploy PFE in a real-world recommender system (one of the world’s top-tier short video sharing platforms) to showcase that PFE can be widely applied in industrial scenarios.
Motivated by the recent successes of deep generative models used for collaborative filtering, we propose a novel framework of VAE for collaborative filtering using multiple experts and stochastic expert selection, which allows the model to learn a richer and more complex latent representation of user preferences. In our method, individual experts are sampled stochastically at each user-item interaction which can effectively utilize the variability among multiple experts. While we propose this framework in the context of collaborative filtering, the proposed stochastic expert technique can be used to enhance VAEs in general beyond the application of collaborative filtering. Hence, this novel technique can be of independent interest. We comprehensively evaluate our proposed method, Stochastic-Expert Variational Autoencoder (SE-VAE) on numerical experiments on the real-world benchmark datasets from MovieLens and Netflix and show that it consistently outperforms the existing state-of-the-art methods across all metrics. Our proposed stochastic expert framework is generic and adaptable to any VAE architecture. The experimental results show that the adaptations to various architectures provided performance gains over the existing methods.
Email services use spam filtering algorithms (SFAs) to filter emails that are unwanted by the user. However, at times, the emails perceived by an SFA as unwanted may be important to the user. Such incorrect decisions can have significant implications if SFAs treat emails of user interest as spam on a large scale. This is particularly important during national elections. To study whether the SFAs of popular email services have any biases in treating the campaign emails, we conducted a large-scale study of the campaign emails of the US elections 2020 by subscribing to a large number of Presidential, Senate, and House candidates using over a hundred email accounts on Gmail, Outlook, and Yahoo. We analyzed the biases in the SFAs towards the left and the right candidates and further studied the impact of the interactions (such as reading or marking emails as spam) of email recipients on these biases. We observed that the SFAs of different email services indeed exhibit biases towards different political affiliations.
The prevalence and perniciousness of fake news has been a critical issue on the Internet, which stimulates the development of automatic fake news detection in turn. In this paper, we focus on the evidence-based fake news detection, where several evidences are utilized to probe the veracity of news (i.e., a claim). Most previous methods first employ sequential models to embed the semantic information and then capture the claim-evidence interaction based on different attention mechanisms. Despite their effectiveness, they still suffer from two main weaknesses. Firstly, due to the inherent drawbacks of sequential models, they fail to integrate the relevant information that is scattered far apart in evidences for veracity checking. Secondly, they neglect much redundant information contained in evidences that may be useless or even harmful. To solve these problems, we propose a unified Graph-based sEmantic sTructure mining framework, namely GET in short. Specifically, different from the existing work that treats claims and evidences as sequences, we model them as graph-structured data and capture the long-distance semantic dependency among dispersed relevant snippets via neighborhood propagation. After obtaining contextual semantic information, our model reduces information redundancy by performing graph structure learning. Finally, the fine-grained semantic representations are fed into the downstream claim-evidence interaction module for predictions. Comprehensive experiments have demonstrated the superiority of GET over the state-of-the-arts.
A growing number of people are seeking healthcare advice online. Usually, they diagnose their medical conditions based on the symptoms they are experiencing, which is also known as self-diagnosis. From the machine learning perspective, online disease diagnosis is a sequential feature (symptom) selection and classification problem. Reinforcement learning (RL) methods are the standard approaches to this type of tasks. Generally, they perform well when the feature space is small, but frequently become inefficient in tasks with a large number of features, such as the self-diagnosis. To address the challenge, we propose a non-RL Bipartite Scalable framework for Online Disease diAgnosis, called BSODA. BSODA is composed of two cooperative branches that handle symptom-inquiry and disease-diagnosis, respectively. The inquiry branch determines which symptom to collect next by an information-theoretic reward. We employ a Product-of-Experts encoder to significantly improve the handling of partial observations of a large number of features. Besides, we propose several approximation methods to substantially reduce the computational cost of the reward to a level that is acceptable for online services. Additionally, we leverage the diagnosis model to estimate the reward more precisely. For the diagnosis branch, we use a knowledge-guided self-attention model to perform predictions. In particular, BSODA determines when to stop inquiry and output predictions using both the inquiry and diagnosis models. We demonstrate that BSODA outperforms the state-of-the-art methods on several public datasets. Moreover, we propose a novel evaluation method to test the transferability of symptom checking methods from synthetic to real-world tasks. Compared to existing RL baselines, BSODA is more effectively scalable to large search spaces.
Digital advertising is the most popular way for content monetization on the Internet. Publishers spawn new websites, and older ones change hands with the sole purpose of monetizing user traffic. In this ever-evolving ecosystem, it is challenging to effectively answer questions such as: Which entities monetize what websites? What categories of websites does an average entity typically monetize on and how diverse are these websites? How has this website administration ecosystem changed across time?
In this paper, we propose a novel, graph-based methodology to detect administration of websites on the Web, by exploiting the ad-related publisher-specific IDs. We apply our methodology across the top 1 million websites and study the characteristics of the created graphs of website administration. Our findings show that approximately 90% of the websites are associated each with a single publisher, and that small publishers tend to manage less popular websites. We perform a historical analysis of up to 8 million websites, and find a new, constantly rising number of (intermediary) publishers that control and monetize traffic from hundreds of websites, seeking a share of the ad-market pie. We also observe that over time, websites tend to move from big to smaller administrators.
We study the interaction between the consumption of digital services via mobile devices and urbanization levels, using measurement data collected in an operational network serving the whole territory of France. We unveil that such an interaction follows a power law, or, in other words, there exists an emergent behavior that prompts subscribers living in increasingly extended and populated urban areas to exhibit a surging individual consumption of mobile traffic. The result holds for the global traffic, but is also consistently observed across a range of mobile services, although with varying intensity. An unprecedented longitudinal analysis of the phenomenon unveils how the imbalance in the per-capita mobile data traffic usage across cities of different size has grown steadily and substantially in the 2014–2019 time frame in France. Our study raises questions on the presence of second-level digital divides in developed countries, and paves the road to further investigations.
Social media platforms are performing “soft moderation” by attaching warning labels to misinformation to reduce dissemination of, and engagement with, such content. This study investigates the warning labels that Twitter placed on Donald Trump’s false tweets about the 2020 US Presidential election. It specifically studies their relation to misinformation spread, and the magnitude and nature of user engagement. We categorize the warning labels by type –“veracity labels” calling out falsity and “contextual labels” providing more information. In addition, we categorize labels by their rebuttal strength and textual overlap (linguistic, topical) with the underlying tweet. We look at user interactions (liking, retweeting, quote tweeting, and replying), the content of user replies, and the type of user involved (partisanship and Twitter activity level) according to various standard metrics. Using appropriate statistical tools, we find that, overall, label placement did not change the propensity of users to share and engage with labeled content, but the falsity of content did. However, we show that the presence of textual overlap in labels did reduce user interactions, while stronger rebuttals reduced the toxicity in comments. We also find that users were more likely to discuss their positions on the underlying tweets in replies when the labels contained rebuttals. When false content was labeled, results show that liberals engaged more than conservatives. Labels also increased the engagement of more passive Twitter users. This case study has direct implications for the design of effective soft moderation and related policies.
Population-level disease prediction estimates the number of potential patients of particular diseases in some location at a future time based on (frequently updated) historical disease statistics. Existing approaches often assume the existing disease statistics are reliable and will not change. However, in practice, data collection is often time-consuming and has time delays, with both historical and current disease statistics being updated continuously. In this work, we propose a real-time population-level disease prediction model which captures data latency (PopNet) and incorporates the updated data for improved predictions. To achieve this goal, PopNet models real-time data and updated data using two separate systems, each capturing spatial and temporal effects using hybrid graph attention networks and recurrent neural networks. PopNet then fuses the two systems using both spatial and temporal latency-aware attentions in an end-to-end manner. We evaluate PopNet on real-world disease datasets and show that PopNet consistently outperforms all baseline disease prediction and general spatial-temporal prediction models, achieving up to 47% lower root mean squared error and 24% lower mean absolute error compared with the best baselines.
User-generated text on social media is a promising avenue for public health surveillance and has been actively explored for its feasibility in the early identification of depression. Existing methods in the identification of depression have shown promising results; however, these methods were all focused on treating the identification as a binary classification problem. To date, there has been little effort towards identifying users’ depression severity level and disregard the inherent ordinal nature across these fine-grain levels. This paper aims to make early identification of depression severity levels on social media data. To accomplish this, we built a new dataset based on the inherent ordinal nature over depression severity levels using clinical depression standards on Reddit posts. The posts were classified into 4 depression severity levels covering the clinical depression standards on social media. Accordingly, we reformulate the early identification of depression as an ordinal classification task over clinical depression standards such as Beck’s Depression Inventory and the Depressive Disorder Annotation scheme to identify depression severity levels. With these, we propose a hierarchical attention method optimized to factor in the increasing depression severity levels through a soft probability distribution. We experimented using two datasets (a public dataset having more than one post from each user and our built dataset with a single user post) using real-world Reddit posts that have been classified according to questionnaires built by clinical experts and demonstrated that our method outperforms state-of-the-art models. Finally, we conclude by analyzing the minimum number of posts required to identify depression severity level followed by a discussion of empirical and practical considerations of our study.
In a user-generated text such as on social media platforms and online forums, people often use disease or symptom terms in ways other than to describe their health. In data-driven public health surveillance, the health mention classification (HMC) task aims to identify posts where users are discussing health conditions rather than using disease and symptom terms for other reasons. Existing computational research typically only studies health mentions in Twitter, with limited coverage of disease or symptom terms, ignore user behavior information, and other ways people use disease or symptom terms. To advance the HMC research, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD consists of 10,015 manually labeled Reddit posts that mention 15 common disease or symptom terms and are annotated with four labels: namely personal health mentions, non-personal health mentions, figurative health mentions, and hyperbolic health mentions. With RHMD, we propose HMCNET that combines a target keyword (disease or symptom term) identification and user behavior hierarchically to improve HMC. Experimental results demonstrate that the proposed approach outperforms state-of-the-art methods with an F1-Score of 0.75 (an increase of 11% over the state-of-the-art) and shows that our new dataset poses a strong challenge to the existing HMC methods.
WhatsApp is a popular messaging app used by over a billion users around the globe. Due to this popularity, understanding misbehavior on WhatsApp is an important issue. The sending of unwanted junk messages by unknown contacts via WhatsApp remains understudied by researchers, in part because of the end-to-end encryption offered by the platform. We address this gap by studying junk messaging on a multilingual dataset of 2.6M messages sent to 5K public WhatsApp groups in India. We characterise both junk content and senders. We find that nearly 1 in 10 messages is unwanted content sent by junk senders, and a number of unique strategies are employed to reflect challenges faced on WhatsApp, e.g., the need to change phone numbers regularly. We finally experiment with on-device classification to automate the detection of junk, whilst respecting end-to-end encryption.
With the rise of social media, users from across the world are able to connect and converse with each other online. While these connections have facilitated a growth in knowledge, online discussions can also end in acrimonious conflict. Previous computational studies have focused on creating online conflict detection models from inferred labels, primarily examine disagreement but not acrimony, and do not examine the conflict’s emergence. Social science studies have investigated offline conflict, which can differ from its online form, and rarely examines its emergence. The current research aims to understand how online conflicts arise in online personal conversations. Our ground truth is a Facebook tool that allows group members to report conflict to administrators. We contrast discussions ending with a conflict report with paired non-conflict discussions from the same post. We study both user characteristics (e.g., historical user-to-user interactions) and conversation dynamics (e.g., changes in emotional intensity over the course of the conversation). We use logistic regression to identify the features that predict conflict. User characteristics such as the commenter’s gender and previous involvement in negative online activity are strong indicators of conflict. Conversational dynamics, such as an increase in person-oriented discussion, are also important signals of conflict. These results help us understand how conflicts emerge and suggest better detection models and ways to alert group administrators and members early on to mediate the conversation.
The COVID-19 pandemic has been the single most important global agenda in the past two years. In addition to its health and economic impacts, it has affected people’s psychological states, including a rise in depression and domestic violence. We traced how the overall emotional states of individual Twitter users changed before and after the pandemic. Our data, including more than 9 million tweets posted by 9,493 users, suggest that the threat posed by the virus did not upset the emotional equilibrium of social media. In early 2020, COVID-related tweets skyrocketed in number and were filled with negative emotions; however, this emotional outburst was short-lived. We found that users who had expressed positive emotions in the pre-COVID period remained positive after the initial outbreak, while the opposite was true for those who regularly expressed negative emotions. Individuals achieved such emotional consistency by selectively focusing on emotion-reinforcing topics. The implications are discussed in light of an emotionally motivated confirmation bias, which we conceptualize as emotion bubbles that demonstrate the public’s resilience to a global health risk.
Moderators and automated methods enforce bans on malicious users who engage in disruptive behavior. However, malicious users can easily create a new account to evade such bans. Previous research has focused on other forms of online deception, like the simultaneous operation of multiple accounts by the same entities (sockpuppetry), impersonation of other individuals, and studying the effects of de-platforming individuals and communities. Here we conduct the first data-driven study of ban evasion, i.e., the act of circumventing bans on an online platform, leading to temporally disjoint operation of accounts by the same user.
We curate a novel dataset of 8,551 ban evasion pairs (parent, child) identified on Wikipedia and contrast their behavior with benign users and non-evading malicious users. We find that evasion child accounts demonstrate similarities with respect to their banned parent accounts on several behavioral axes — from similarity in usernames and edited pages to similarity in content added to the platform and its psycholinguistic attributes. We reveal key behavioral attributes of accounts that are likely to evade bans. Based on the insights from the analyses, we train logistic regression classifiers to detect and predict ban evasion at three different points in the ban evasion lifecycle. Results demonstrate the effectiveness of our methods in predicting future evaders (AUC = 0.78), early detection of ban evasion (AUC = 0.85), and matching child accounts with parent accounts (MRR = 0.97). Our work can aid moderators by reducing their workload and identifying evasion pairs faster and more efficiently than current manual and heuristic-based approaches.
Social biases on Wikipedia, a widely-read global platform, could greatly influence public opinion. While prior research has examined man/woman gender bias in biography articles, possible influences of other demographic attributes limit conclusions. In this work, we present a methodology for analyzing Wikipedia pages about people that isolates dimensions of interest (e.g., gender), from other attributes (e.g., occupation). Given a target corpus for analysis (e.g. biographies about women), we present a method for constructing a comparison corpus that matches the target corpus in as many attributes as possible, except the target one. We develop evaluation metrics to measure how well the comparison corpus aligns with the target corpus and then examine how articles about gender and racial minorities (cis. women, non-binary people, transgender women, and transgender men; African American, Asian American, and Hispanic/Latinx American people) differ from other articles. In addition to identifying suspect social biases, our results show that failing to control for covariates can result in different conclusions and veil biases. Our contributions include methodology that facilitates further analyses of bias in Wikipedia articles, findings that can aid Wikipedia editors in reducing biases, and a framework and evaluation metrics to guide future work in this area.
Fact verification is a challenging task that requires the retrieval of multiple pieces of evidence from a reliable corpus for verifying the truthfulness of a claim. Although the current methods have achieved satisfactory performance, they still suffer from one or more of the following three problems: (1) unable to extract sufficient contextual information from the evidence sentences; (2) containing redundant evidence information and (3) incapable of capturing the interaction between claim and evidence. To tackle the problems, we propose an evidence fusion network called EvidenceNet. The proposed EvidenceNet model captures global contextual information from various levels of evidence information for deep understanding. Moreover, a gating mechanism is designed to filter out redundant information in evidence. In addition, a symmetrical interaction attention mechanism is also proposed for identifying the interaction between claim and evidence. We conduct extensive experiments based on the FEVER dataset. The experimental results have shown that the proposed EvidenceNet model outperforms the current fact verification methods and achieves the state-of-the-art performance.
Recently, almost all conferences have moved to virtual mode due to the pandemic-induced restrictions on travel and social gathering. Contrary to in-person conferences, virtual conferences face the challenge of efficiently scheduling talks, accounting for the availability of participants from different timezones and their interests in attending different talks. A natural objective for conference organizers is to maximize efficiency, e.g., total expected audience participation across all talks. However, we show that optimizing for efficiency alone can result in an unfair virtual conference schedule, where individual utilities for participants and speakers can be highly unequal. To address this, we formally define fairness notions for participants and speakers, and derive suitable objectives to account for them. As the efficiency and fairness objectives can be in conflict with each other, we propose a joint optimization framework that allows conference organizers to design schedules that balance (i.e., allow trade-offs) among efficiency, participant fairness and speaker fairness objectives. While the optimization problem can be solved using integer programming to schedule smaller conferences, we provide two scalable techniques to cater to bigger conferences. Extensive evaluations over multiple real-world datasets show the efficacy and flexibility of our proposed approaches.
Given a social issue that needs to be solved, decision-makers need to listen to the crowd opinions and preferences. However, existing online voting systems with limited capabilities cannot conduct such investigations. Our idea is that decision-makers can collect many human opinions from crowds on the web and then prioritize them for social decision-making. A solution of the prioritization entails collecting a large amount of pairwise preference comparisons from crowds and utilizing the aggregated preference labels as the collective preferences on the opinions. In practice, because there is a large number of combinations of all candidate opinion pairs, we can only collect a small number of labels for a small subset of pairs. How to utilize only a small number of pairwise crowd preferences on the opinions to estimate collective preferences is the problem. Existing works on preference aggregation methods for general scenarios utilize only the pairwise preference labels. In our scenario, additional contextual information, such as the text contents of the opinions, can potentially promote the aggregation performance. Therefore, we propose preference aggregation approaches that can effectively incorporate contextual information by externally or internally building the relations between the opinion contexts and preference scores. We propose approaches for both the homogeneous and heterogeneous settings of modeling the evaluators. The experiments conducted on real datasets collected from real-world crowdsourcing platform show that our approaches can generate better aggregation results than the baselines for estimating collective preferences, especially when there are only a small number of preference labels available.
Search engine results can include misinformation that is inaccurate, misleading, or even harmful. But people may not recognize or realize false information results when searching online. We suspect that the percentage of misinformation search results (misinformation density) may influence people’s search activities, learning outcomes, and search experience. We conducted a zoom-mediated “lab” user study to examine this matter. The experiment used a between-subjects design. We asked 60 participants to finish two health information search tasks using search engines with High, Medium, or Low misinformation density levels. To create these experimental settings, we trained task-dependent text classifiers to manipulate the number of correct and misinformation results displayed on SERPs. We collected participants’ search activities, responses to pre-task and post-task surveys, and answers to task-related factual questions before and after searching.
Our results indicate that search result misinformation density strongly affects users’ search behavior. High misinformation density made people search more frequently, use longer queries, and click on more results. However, such increased search activities did not lead to better search outcomes. Participants using the High misinformation density search engine answered factual questions less accurately and learned very limitedly from a search session than the two other systems. Moreover, participants in systems with a balanced amount of correct and misinformation results (Medium) could learn factual knowledge as effectively as others in a system with little misinformation (Low). Surprisingly, participants using different misinformation density systems did not rate their perceived goodness of search systems with significant differences, indicating that search engine misinformation may adversely but imperceptibly affect people and society. Our findings have disclosed the effects of misinformation density on health information search and offered insights to improve online health information search.
Analyzing the causal impact of different policies in reducing the spread of COVID-19 is of critical importance. The main challenge here is the existence of unobserved confounders (e.g., vigilance of residents) which influence both the presence of policies and the spread of COVID-19. Besides, as the confounders may be time-varying, it is even more difficult to capture them. Fortunately, the increasing prevalence of web data from various online applications provides an important resource of time-varying observational data, and enhances the opportunity to capture the confounders from them, e.g., the vigilance of residents over time can be reflected by the popularity of Google searches about COVID-19 at different time periods. In this paper, we study the problem of assessing the causal effects of different COVID-19 related policies on the outbreak dynamics in different counties at any given time period. To this end, we integrate COVID-19 related observational data covering different U.S. counties over time, and then develop a neural network based causal effect estimation framework which learns the representations of time-varying (unobserved) confounders from the observational data. Experimental results indicate the effectiveness of our proposed framework in quantifying the causal impact of policies at different granularities, ranging from a category of policies with a certain goal to a specific policy type. Compared with baseline methods, our assessment of policies is more consistent with existing epidemiological studies of COVID-19. Besides, our assessment also provides insights for future policy-making.
Social network platforms have enabled large-scale measurement of user-to-user networks such as friendships. Less studied is user sentiment about their networks, such as a user’s satisfaction with their number of friends. We surveyed over 85,000 Facebook users about how satisfied they were with their number of friends on Facebook, connecting these responses to their on-platform activity. As suggested in prior work, we’d expect users who are not satisfied with their friend count to have a higher probability of experiencing the friendship paradox: “your friends have more friends than you”. However in our sample, among users with more than 3,500 friends, no user experiences the friendship paradox. Instead, we still observe that those users with more friends would prefer to have even more friends. The friendship paradox also contributes to local perception bias, defined as the difference between the average number of friends among a user’s friends and the average friend count in the population. Users with a positive perception bias – their friends have more friends than others – are less satisfied with their friend count. We then introduce a weighted perception bias metric that considers the fact that different friends have different effects on an individual’s perception. We find this new weighted perception bias better distinguishes friend count satisfaction outcomes for users with high friend count when compared to the original perception bias metric. We conclude with modeling the behavior interactions via a machine learning model, demonstrating the heterogeneity in the interactions across users with different perception biases. Altogether, these findings offer more insights on users’ friend count satisfaction, which may provide guidelines to improve the user experience and promote healthy interactions.
While major strides have been made towards gender equality in public life, serious inequality remains in the domestic sphere, especially around parenting. The present study analyses discussions about parenting on Reddit (i.e., a content aggregation website) to explore audience effects and gender stereotypes. It suggests a novel method to study topical variation in individuals’ language when interacting with different audiences. Comments posted in 2020 were collected from three parenting subreddits (i.e., topical communities), described as being for fathers (r/Daddit), mothers (r/Mommit), and all parents (r/Parenting). Users posting on r/Parenting and r/Daddit or on r/Parenting and r/Mommit were assumed to identify as fathers or mothers, respectively, allowing gender comparison. Users’ comments on r/Parenting (to a mixed-gender audience) were compared with their comments to single-gender audiences on r/Daddit or r/Mommit using Latent Dirichlet Allocation (LDA) topic modelling.
Results show that the most discussed topic among parents is about education and family advice, a topic mainly discussed in the mixed-gender subreddit and more by fathers than mothers. The topic model also indicates that, when it comes to the basic needs of children (sleep, food, and medical care), mothers seem to be more concerned regardless of the audience. In contrast, topics such as birth and pregnancy announcements and physical appearance are more discussed by fathers in the father-centric subreddit. Overall, findings seem to show that mothers are generally more concerned about the practical sides of parenting while fathers’ expressed concerns are more contextual: with other fathers, there seems to be a desire to show their fatherhood and be recognized for it while they discuss education with mothers. These results demonstrate that concerns expressed by parents on Reddit are context-sensitive but also consistent with gender stereotypes, potentially reflecting a persistent gendered and unequal division of labour in parenting.
Conspiracy theories are increasingly a subject of research interest as society grapples with their rapid growth in areas such as politics or public health. Previous work has established YouTube as one of the most popular sites for people to host and discuss different theories. In this paper, we present an analysis of monetization methods of conspiracy theorist YouTube creators and the types of advertisers potentially targeting this content. We collect 184,218 ad impressions from 6,347 unique advertisers found on conspiracy-focused channels and mainstream YouTube content. We classify the ads into business categories and compare their prevalence between conspiracy and mainstream content. We also identify common offsite monetization methods. In comparison with mainstream content, conspiracy videos had similar levels of ads from well-known brands, but an almost eleven times higher prevalence of likely predatory or deceptive ads. Additionally, we found that conspiracy channels were more than twice as likely as mainstream channels to use offsite monetization methods, and 53% of the demonetized channels we observed were linking to third-party sites for alternative monetization opportunities. Our results indicate that conspiracy theorists on YouTube had many potential avenues to generate revenue, and that predatory ads were more frequently served for conspiracy videos.
Recommender systems typically suggest to users content similar to what they consumed in the past. If a user happens to be exposed to strongly polarized content, she might subsequently receive recommendations which may steer her towards more and more radicalized content, eventually being trapped in what we call a “radicalization pathway”. In this paper, we study the problem of mitigating radicalization pathways using a graph-based approach. Specifically, we model the set of recommendations of a “what-to-watch-next” recommender as a d-regular directed graph where nodes correspond to content items, links to recommendations, and paths to possible user sessions.
We measure the “segregation” score of a node representing radicalized content as the expected length of a random walk from that node to any node representing non-radicalized content. High segregation scores are associated to larger chances to get users trapped in radicalization pathways. Hence, we define the problem of reducing the prevalence of radicalization pathways by selecting a small number of edges to “rewire”, so to minimize the maximum of segregation scores among all radicalized nodes, while maintaining the relevance of the recommendations.
We prove that the problem of finding the optimal set of recommendations to rewire is NP-hard and NP-hard to approximate within any factor. Therefore, we turn our attention to heuristics, and propose an efficient yet effective greedy algorithm based on the absorbing random walk theory. Our experiments on real-world datasets in the context of video and news recommendations confirm the effectiveness of our proposal.
Online forums that allow participatory engagement between users have been transformative for public discussion of important issues. However, debates on such forums can sometimes escalate into full blown exchanges of hate or misinformation. An important tool in understanding and tackling such problems is to be able to infer the argumentative relation of whether a reply is supporting or attacking the post it is replying to. This so called polarity prediction task is difficult because replies may be based on external context beyond a post and the reply whose polarity is being predicted. We propose GraphNLI, a novel graph-based deep learning architecture that uses graph walk techniques to capture the wider context of a discussion thread in a principled fashion. Specifically, we propose methods to perform root-seeking graph walks that start from a post and captures its surrounding context to generate additional embeddings for the post. We then use these embeddings to predict the polarity relation between a reply and the post it is replying to. We evaluate the performance of our models on a curated debate dataset from Kialo, an online debating platform. Our model outperforms relevant baselines, including S-BERT, with an overall accuracy of 83%.
Zero-shot stance detection (ZSSD) is challenging as it requires detecting the stance of previously unseen targets during the inference stage. Being able to detect the target-related transferable stance features from the training data is arguably an important step in ZSSD. Generally speaking, stance features can be grouped into target-invariant and target-specific categories. Target-invariant stance features carry the same stance regardless of the targets they are associated with. On the contrary, target-specific stance features only co-occur with certain targets. As such, it is important to distinguish these two types of stance features when learning stance features of unseen targets. To this end, in this paper, we revisit ZSSD from a novel perspective by developing an effective approach to distinguish the types (target-invariant/-specific) of stance features, so as to better learn transferable stance features. To be specific, inspired by self-supervised learning, we frame the stance-feature-type identification as a pretext task in ZSSD. Furthermore, we devise a novel hierarchical contrastive learning strategy to capture the correlation and difference between target-invariant and -specific features and further among different stance labels. This essentially allows the model to exploit transferable stance features more effectively for representing the stance of previously unseen targets. Extensive experiments on three benchmark datasets show that the proposed framework achieves the state-of-the-art performance in ZSSD.
Ordinary least-squares estimation is proved to be the best linear unbiased estimator according to the Gauss-Markov theorem. In the last two decades, however, some researchers criticized that least-squares was substantially inaccurate in fitting power-law distributions; such criticism has caused a strong bias in research community. In this paper, we conduct extensive experiments to rebut that such criticism is complete nonsense. Specifically, we sample different sizes of discrete and continuous data from power-law models, showing that even though the long-tailed noises are sampled from power-law models, they cannot be treated as power-law data. We define the correct way to bin continuous power-law data into data points and propose an average strategy for least-squares to fit power-law distributions. Experiments on both simulated and real-world data show that our proposed method fits power-law data perfectly. We uncover a fundamental flaw in the popular method proposed by Clauset et al. : it tends to discard the majority of power-law data and fit the long-tailed noises. Experiments also show that the reverse cumulative distribution function is a bad idea to plot power-law data in practice because it usually hides the true probability distribution of data. We hope that our research can clean up the bias about least-squares fitting power-law distributions.
Source code can be found at https://github.com/xszhong/LSavg.
Bloom filter is an efficient data structure for filtering negative keys (keys not in a given set) with substantially small space. However, in real-world applications, there widely exist vulnerable negative keys, which will bring high costs if not being properly filtered, especially when positive keys are added/deleted dynamically. To address the problem, we propose SeeSaw Counting Filter (SSCF), which is innovated with encapsulating the vulnerable negative keys into a unified counter array named seesaw counter array, and dynamically modulating (or varying) the applied hash functions to guard the encapsulated keys from being misidentified. Moreover, we propose ada-SSCF to handle the scenarios where the vulnerable negative keys cannot be obtained in advance. We extensively evaluate our SSCF, which shows that SSCF outperforms the cutting-edge filters by 3 × on averages regarding accuracy while ensuring a low operation latency. All source codes are in .
Recommender systems provide essential web services by learning users’ personal preferences from collected data. However, in many cases, systems also need to forget some training data. From the perspective of privacy, users desire a tool to erase the impacts of their sensitive data from the trained models. From the perspective of utility, if a system’s utility is damaged by some bad data, the system needs to forget such data to regain utility. While unlearning is very important, it has not been well-considered in existing recommender systems. Although there are some researches have studied the problem of machine unlearning, existing methods can not be directly applied to recommendation as they are unable to consider the collaborative information.
In this paper, we propose RecEraser, a general and efficient machine unlearning framework tailored to recommendation tasks. The main idea of RecEraser is to divide the training set into multiple shards and train submodels with these shards. Specifically, to keep the collaborative information of the data, we first design three novel data partition algorithms to divide training data into balanced groups. We then further propose an adaptive aggregation method to improve the global model utility. Experimental results on three public benchmarks show that RecEraser can not only achieve efficient unlearning but also outperform the state-of-the-art unlearning methods in terms of model utility. The source code can be found at https://github.com/chenchongthu/Recommendation-Unlearning
Recently, prompt-tuning has achieved promising results for specific few-shot classification tasks. The core idea of prompt-tuning is to insert text pieces (i.e., templates) into the input and transform a classification task into a masked language modeling problem. However, for relation extraction, determining an appropriate prompt template requires domain expertise, and it is cumbersome and time-consuming to obtain a suitable label word. Furthermore, there exists abundant semantic and prior knowledge among the relation labels that cannot be ignored. To this end, we focus on incorporating knowledge among relation labels into prompt-tuning for relation extraction and propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt). Specifically, we inject latent knowledge contained in relation labels into prompt construction with learnable virtual type words and answer words. Then, we synergistically optimize their representation with structured constraints. Extensive experimental results on five datasets with standard and low-resource settings demonstrate the effectiveness of our approach. Our code and datasets are available in GitHub1 for reproducibility.
Rumors spread through the Internet, especially on Twitter, have harmed social stability and residents’ daily lives. Recently, in addition to utilizing the text features of posts for rumor detection, the structural information of rumor propagation trees has also been valued. Most rumors with salient features can be quickly locked by graph models dominated by cross entropy loss. However, these conventional models may lead to poor generalization, and lack robustness in the face of noise and adversarial rumors, or even the conversational structures that is deliberately perturbed (e.g., adding or deleting some comments). In this paper, we propose a novel Graph Adversarial Contrastive Learning (GACL) method to fight these complex cases, where the contrastive learning is introduced as part of the loss function for explicitly perceiving differences between conversational threads of the same class and different classes. At the same time, an Adversarial Feature Transformation (AFT) module is designed to produce conflicting samples for pressurizing model to mine event-invariant features. These adversarial samples are also used as hard negative samples in contrastive learning to make the model more robust and effective. Experimental results on three public benchmark datasets prove that our GACL method achieves better results than other state-of-the-art models.
False rumors are known to have detrimental effects on society. To prevent the spread of false rumors, social media platforms such as Twitter must detect them early. In this work, we develop a novel probabilistic mixture model that classifies true vs. false rumors based on the underlying spreading process. Specifically, our model is the first to formalize the self-exciting nature of true vs. false retweeting processes. This results in a novel mixture marked Hawkes model (MMHM). Owing to this, our model obviates the need for feature engineering; instead, it directly models the spreading process in order to make inferences of whether online rumors are incorrect. Our evaluation is based on 13,650 retweet cascades of both true. vs. false rumors from Twitter. Our model recognizes false rumors with a balanced accuracy of 64.97 % and an AUC of 69.46 %. It outperforms state-of-the-art baselines (both neural and feature engineering) by a considerable margin but while being fully interpretable. Our work has direct implications for practitioners: it leverages the spreading process as an implicit quality signal and, based on it, detects false content.
In this work, we develop a Graph Neural Network (GNN) framework for the problem of personalized visualization recommendation. The GNN-based framework first represents the large corpus of datasets and visualizations from users as a large heterogeneous graph. Then, it decomposes a visualization into its data and visual components, and then jointly models each of them as a large graph to obtain embeddings of the users, attributes (across all datasets in the corpus), and visual-configurations. From these user-specific embeddings of the attributes and visual-configurations, we can predict the probability of any visualization arising from a specific user. Finally, the experiments demonstrated the effectiveness of using graph neural networks for automatic and personalized recommendation of visualizations to specific users based on their data and visual (design choice) preferences. To the best of our knowledge, this is the first such work to develop and leverage GNNs for this problem.
Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.
Social recommendation, which leverages social connections to construct Recommender Systems (RS), plays an important role in alleviating information overload. Recently, Graph Neural Networks (GNNs) have received increasing attention due to their great capacity for graph data. Since data in RS is essentially in the structure of graphs, GNN-based RS is flourishing. However, existing works lack in-depth thinking of social recommendations. These methods contain implicit assumptions that are not well analyzed in practice. To tackle these problems, we conduct statistical analyses on widely used social recommendation datasets. We design metrics to evaluate the social information, which can provide guidance about whether and how we should use this information in the RS task. Based on these analyses, we propose a Distillation Enhanced SocIal Graph Network (DESIGN). We train a model that integrates information from the user-item interaction graph and the user-user social graph and train two auxiliary models that only use one of the above graphs respectively. These models are trained simultaneously, where the knowledge distillation technique restricts the training process and makes them learn from each other. Our extensive experiments show that our model significantly and consistently outperforms the state-of-the-art competitors on real-world datasets.
While controllable text generation has received attention due to the recent advances in large-scale pre-trained language models, there is a lack of research that focuses on story-specific controllability. To address this, we present Story Control via Supervised Contrastive learning model (SCSC), to create a story conditioned on genre. For this, we design a supervised contrastive objective combined with log-likelihood objective, to capture the intrinsic differences among the stories in different genres. The results of our automated evaluation and user study demonstrate that the proposed method is effective in genre-controlled story generation.
The proliferation of social media and online communication platforms has made social interactions more accessible, leading to a significant expansion of research into language use with a particular focus on toxic behavior and hate speech. Few studies, however, have focused on the tacit information that may imply a negative intention and the perspective that impacts the interpretation of such intention. Conversation is a joint activity that relies on coordination between what one party expresses and how the other party construes what has been expressed. Thus, how a message is perceived becomes equally important regardless of whether the sent message includes any form of explicit attack or offense. This study focuses on identifying the implicit attacks and negative intentions in text-based conversation from the reader’s point of view. We focus on questions in conversations and investigate the underlying perceived intention. We introduce our dataset that includes questions, intention polarity, and type of attacks. We conduct a meta-analysis on the data to demonstrate how a question may be used as a means of attack and how different perspectives can lead to multiple interpretations. We also report benchmark results of several models for detecting instances of tacit attacks in questions with the aim of avoiding latent or manifest conflict in conversations.
In a debate, two debaters with opposite stances put forward arguments to fight for their viewpoints. Debaters organize their arguments to support their proposition and attack opponents’ points. The common purpose of debating is to persuade the opponents and the audiences to agree with the mentioned propositions. Previous works have investigated the issue of identifying which debater is more persuasive. However, modeling the interaction of arguments between rounds is rarely discussed. In this paper, we focus on assessing the overall performance of debaters in a multi-round debate on online forums. To predict the winner in a multi-round debate, we propose a novel neural model that is aimed at capturing the interaction of arguments by exploiting raw text, structure information, argumentative discourse units (ADUs), and the relations among ADUs. Experimental results show that our model achieves competitive performance compared with the existing models, and is capable of extracting essential argument relations during a multi-round debate by leveraging argumentative structure and attention mechanism.
Researchers using social media data want to understand the discussions occurring in and about their respective fields. These domain experts often turn to topic models to help them see the entire landscape of the conversation, but unsupervised topic models often produce topic sets that miss topics experts expect or want to see. To solve this problem, we propose Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind. The input to GTM is a set of topics that are of interest to the user and a small number of words or phrases that belong to those topics. These seed topics are used to guide the topic generation process, and can be augmented interactively, expanding the seed word list as the model provides new relevant words for different topics. GTM uses a novel initialization and a new sampling algorithm called Generalized Polya Urn (GPU) seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics. We demonstrate the robustness of GTM on open-ended responses from a public opinion survey and four domain-specific Twitter data sets.
Predicting the popularity of online videos has many real-world applications, such as recommendation, precise advertising, and edge caching strategies. Despite many efforts have been dedicated to the online video popularity prediction, there still exist several challenges: (1) The meta-data from online videos is usually sparse and noisy, which makes it difficult to learn a stable and robust representation. (2) The influence of content features and temporal features in different life cycles of online videos is dynamically changing, so it is necessary to build a model that can capture the dynamics. (3) Besides, there is a great need to interpret the predictive behavior of the model to assist administrators of video platforms in the subsequent decision-making.
In this paper, we propose a Knowledge-based Temporal Fusion Network (KTFN) that incorporates knowledge graph representation to address the aforementioned challenges in the task of online video popularity prediction. To be more specific, we design a Tree Attention Network (TAN) to learn the embedding of online video entities in knowledge graphs via selectively aggregating local neighborhood information, thus enabling our model to learn the importance of different entities under the same relation. Besides, an Attention-based Long Short-Term Memory (ALSTM) is utilized to learn the temporal feature representation. Finally, we propose an Adaptively Temporal Feature Fusion (ATFF) scheme to adaptively fuse content features and temporal features, in which a learnable exponential decay function with the global attention mechanism is constructed. We collect two large-scale real-world datasets from the server logs of two popular Chinese online video platforms, and experimental results on the two datasets have demonstrated the superiority and interpretability of KTFN.
As an emotion plays an important role in people’s everyday lives and is often mirrored in their social media use, extensive research has been conducted to characterize and model emotions from social media data. However, prior research has not sufficiently considered trends of social media use—the increasing use of images and the decreasing use of text—nor identified the features of images in social media that are likely to be different from those in non-social media. Our study aims to fill this gap by (1) considering the notion of visual hints that depict contextual information of images, (2) presenting their characteristics in positive or negative emotions, and (3) demonstrating their effectiveness in emotion prediction modeling through an in-depth analysis of their relationship with the text in the same posts. The results of our experiments showed that our visual hint-based model achieved 20% improvement in emotion prediction, compared with the baseline. In particular, the performance of our model was comparable with that of the text-based model, highlighting not only a strong relationship between visual hints of the image and emotion, but also the potential of using only images for emotion prediction which well reflects current and future trends of social media use.
Cross-modal learning is essential to enable accurate fake news detection due to the fast-growing multimodal contents in online social communities. A fundamental challenge of multimodal fake news detection lies in the inherent ambiguity across different content modalities, i.e., decisions made from unimodalities may disagree with each other, which may lead to inferior multimodal fake news detection. To address this issue, we formulate the cross-modal ambiguity learning problem from an information-theoretic perspective and propose CAFE — an ambiguity-aware multimodal fake news detection method. CAFE consists of 1) a cross-modal alignment module to transform the heterogeneous unimodality features into a shared semantic space, 2) a cross-modal ambiguity learning module to estimate the ambiguity between different modalities, and 3) a cross-modal fusion module to capture the cross-modal correlations. CAFE improves fake news detection accuracy by judiciously and adaptively aggregating unimodal features and cross-modal correlations, i.e., relying on unimodal features when cross-modal ambiguity is weak and referring to cross-modal correlations when cross-modal ambiguity is strong. Experimental studies on two widely used datasets (Twitter and Weibo) demonstrate that CAFE outperforms state-of-the-art fake news detection methods by 2.2-18.9% and 1.7-11.4% on accuracy, respectively.
Self-supervised learning, especially contrastive learning, has made an outstanding contribution to the development of many deep learning research fields. Recently, researchers in the acoustic signal processing field noticed its success and leveraged contrastive learning for better music representation. Typically, existing approaches maximize the similarity between two distorted audio segments sampled from the same music. In other words, they ensure a semantic agreement at the music level. However, those coarse-grained methods neglect some inessential or noisy elements at the frame level, which may be detrimental to the model to learn the effective representation of music. Towards this end, this paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR. Concretely, PEMR incorporates a Positive-Negative Mask Generation module, which leverages transformer blocks to generate frame masks on Log-Mel spectrogram. We can generate self-augmented negative and positive samples by masking important components or inessential components, respectively. We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music. We conduct experiments on four public datasets. The experimental results of two music-related downstream tasks, music classification and cover song identification, demonstrate the generalization ability and transferability of music representation learned by PEMR.
Named Entity Recognition (NER) is a crucial natural language understanding task for many down-stream tasks such as question answering and retrieval. Despite significant progress in developing NER models for multiple languages and domains, scaling to emerging and/or low-resource domains still remains challenging, due to the costly nature of acquiring training data. We propose CycleNER, an unsupervised approach based on cycle-consistency training that uses two functions: (i) sentence-to-entity – S2E and (ii) entity-to-sentence – E2S, to carry out the NER task. CycleNER does not require annotations but a set of sentences with no entity labels and another independent set of entity examples. Through cycle-consistency training, the output from one function is used as input for the other (e.g. S2E → E2S) to align the representation spaces of both functions and therefore enable unsupervised training. Evaluation on several domains comparing CycleNER against supervised and unsupervised competitors shows that CycleNER achieves highly competitive performance with only a few thousand input sentences. We demonstrate competitive performance against supervised models, achieving 73% of supervised performance without any annotations on CoNLL03, while significantly outperforming unsupervised approaches.
Psychological stress has become a wider-spread and serious health issue in modern society. Detecting stressors that cause the stress could enable people to take effective actions to manage the stress. Previous work relied on the stressor dictionary built upon words from the stressor-related categories in the LIWC (Linguistic Inquiry and Word Count), and focused on stress categories that appear frequently on social media. In this paper, we build a meta-learning based stress category detection framework, which can learn how to distinguish a new stress category with very little data through learning on frequently appeared categories without relying on any lexicon. It is comprised of three modules, i.e., encoder module, induction module, and relation module. The encoder module focuses on learning category-relevant representation of each tweet with Dependency Graph Convolutional Network and tweet attention. The induction module deploys Mixture of Experts mechanism to integrate and summarize a representation for each category. The relation module is adopted to measure the correlation between each pair of query tweets and categories. Through the three modules and the meta-training process, we can then obtain a model which learns to learn how to identify stress categories and can directly be employed to a new category with little labelled data. Our experimental results show that the proposed framework can achieve 75.3 accuracy with 3 labeled data for the rarely appeared stress categories. We also build a stress category dataset consisting of 12 stress categories with 1,553 manually labeled stressful microblogs which can help train AI models to assist psychological stress diagnosis.
We consider the task of sequencing tracks on music streaming platforms where the goal is to maximise not only user satisfaction, but also artist- and platform-centric objectives, needed to ensure long-term health and sustainability of the platform. Grounding the work across four objectives: Sat, Discovery, Exposure and Boost, we highlight the need and the potential to trade-off performance across these objectives, and propose Mostra, a Set Transformer-based encoder-decoder architecture equipped with submodular multi-objective beam search decoding. The proposed model affords system designers the power to balance multiple goals, and dynamically control the impact on one objective to satisfy other objectives. Through extensive experiments on data from a large-scale music streaming platform, we present insights on the trade-offs that exist across different objectives, and demonstrate that the proposed framework leads to a superior, just-in-time balancing across the various metrics of interest.
Current popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. Nevertheless, real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, existing text normalization methods are prohibitively expensive to execute over web-scale datasets, can hardly process noisy texts from social networks, or require annotations to learn the corrections in a supervised manner. In this paper, we present Flan (Fast LSH Algorithm for Text Normalization), a scalable randomized algorithm to clean and canonicalize massive text data. Our approach suggests corrections based on the morphology of the words, where lexically similar words are considered the same with high probability. We efficiently handle the pairwise word-to-word comparisons via locality sensitive hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, does not rely on feature engineering, and does not require any annotation. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of Flan. Based on recent advances in densified Minhash, our approach requires much less computational time compared to baseline text normalization techniques on large-scale Twitter and Reddit datasets. In a human evaluation of the quality of the normalization, Flan achieves 5% and 14% improvement against the baselines over the Reddit and Twitter datasets, respectively. Our method also improves performance on Twitter sentiment classification applications and the perturbed GLUE benchmark datasets, where we introduce random errors into the text.
In this paper, we propose and test three content-based hypotheses that significantly increase message virality. We measure virality as the retweet counts of messages in a pair of real-world Twitter datasets A large dataset - UK Brexit with 51 million tweets from 2.8 million users between June 1, 2015 and May 12, 2019 and a smaller dataset - Nord Stream 2 with 516,000 tweets from 250,000 users between October 1, 2019 and October 15, 2019. We hypothesize, test and conclude that messages incorporating “negativity bias”, “causal arguments” and “threats to personal or societal core values of target audiences” singularly and jointly increase message virality on social media.
Web data and computational models can play important roles in analyzing cultural trends. The current study presents an analysis of 23,492 sneaker images and metadata collected from a global reselling shop, StockX.com. Based on data encompassing 22 years from 1999 to 2020, we propose a sneaker design index that helps track changes in the design characteristics of sneakers using a contrastive learning method. Our data suggest that sneaker designs have been employing brighter colors and lower hue and saturation values over time. We also observe how popular brands have continued to build their unique identities in shape-related design space. The embedding analysis also predicts which sneakers will likely see a high premium in the reselling market, suggesting viable algorithm-driven investment and design strategies. The current work is one of the first publicly available studies to analyze product design evolution over a long historical period and has implications for the novel use of Web data to understand cultural patterns that are otherwise difficult to assess.
In e-commerce websites, multiple related product recommendations are usually organized into “widgets”, each given a name, as a recommendation caption, to describe the products within. These recommendation captions are usually manually crafted and generic in nature, making it difficult to attach meaningful and informative names at scale. As a result, the captions are inadequate in helping customers to better understand the connection between the multiple recommendations and make faster product discovery.
We propose an Adaptive Multiple-Product Summarization framework (AmpSum) that automatically and adaptively generates widget captions based on different recommended products. The multiplicity of products to be summarized in a widget caption is particularly novel. The lack of well-developed labels motivates us to design a weakly supervised learning approach with distant supervision to bootstrap the model learning from pseudo labels, and then fine-tune the model with a small amount of manual labels. To validate the efficacy of this method, we conduct extensive experiments on several product categories of Amazon data. The results demonstrate that our proposed framework consistently outperforms state-of-the-art baselines over 9.47-29.14% on ROUGE and 27.31% on METEOR. With case studies, we illustrate how AmpSum could adaptively generate summarization based on different product recommendations.
Finding special items, like heavy hitters, top-k, and persistent items, has always been a hot issue in data stream processing for web analysis. While data streams nowadays are usually high-dimensional, most prior works focus on special items according to a certain primary dimension and yield little insight into the correlations between dimensions. Therefore, we propose to find special quadratic elements to reveal close correlations. Based on the items mentioned above, we extend our problem to three applications related to heavy hitters, top-k, and persistent items, and design a generic framework DUET to process them. Besides, we analyze the error bound of our algorithm and conduct extensive experiments on four data sets. Our experimental results show that DUET can achieve 3.5 times higher throughput and three orders of magnitude lower average relative error compared with cutting-edge algorithms.
User Satisfaction Estimation (USE) is an important yet challenging task in goal-oriented conversational systems. Whether the user is satisfied with the system largely depends on the fulfillment of the user’s needs, which can be implicitly reflected by users’ dialogue acts. However, existing studies often neglect the sequential transitions of dialogue act or rely heavily on annotated dialogue act labels when utilizing dialogue acts to facilitate USE. In this paper, we propose a novel framework, namely USDA, to incorporate the sequential dynamics of dialogue acts for predicting user satisfaction, by jointly learning User Satisfaction Estimation and Dialogue Act Recognition tasks. In specific, we first employ a Hierarchical Transformer to encode the whole dialogue context, with two task-adaptive pre-training strategies to be a second-phase in-domain pre-training for enhancing the dialogue modeling ability. In terms of the availability of dialogue act labels, we further develop two variants of USDA to capture the dialogue act information in either supervised or unsupervised manners. Finally, USDA leverages the sequential transitions of both content and act features in the dialogue to predict the user satisfaction. Experimental results on four benchmark goal-oriented dialogue datasets across different applications show that the proposed method substantially and consistently outperforms existing methods on USE, and validate the important role of dialogue act sequences in USE.
Multi-task learning aims to solve multiple machine learning tasks at the same time, with good solutions being both generalizable and Pareto optimal. A multi-task deep learning model consists of a shared representation learned to capture task commonalities, and task-specific sub-networks capturing the specificities of each task. In this work, we offer insights on the under-explored trade-off between minimizing task training conflicts in multi-task learning and improving multi-task generalization, i.e. the generalization capability of the shared presentation across all tasks. The trade-off can be viewed as the tension between multi-objective optimization and shared representation learning: As a multi-objective optimization problem, sufficient parameterization is needed for mitigating task conflicts in a constrained solution space; However, from a representation learning perspective, over-parameterizing the task-specific sub-networks may give the model too many ”degrees of freedom” and impedes the generalizability of the shared representation.
Specifically, we first present insights on understanding the parameterization effect of multi-task deep learning models and empirically show that larger models are not necessarily better in terms of multi-task generalization. A delicate balance between mitigating task training conflicts vs. improving generalizability of the shared presentation learning is needed to achieve optimal performance across multiple tasks. Motivated by our findings, we then propose the use of a under-parameterized self-auxiliary head alongside each task-specific sub-network during training, which automatically balances the aforementioned trade-off. As the auxiliary heads are small in size and are discarded during inference time, the proposed method incurs minimal training cost and no additional serving cost. We conduct experiments with the proposed self-auxiliaries on two public datasets and live experiments on one of the largest industrial recommendation platforms serving billions of users. The results demonstrate the effectiveness of the proposed method in improving the predictive performance across multiple tasks in multi-task models.
With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS ) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation. Code and configurations are available at https://github.com/gimpong/WWW22-HCQ.
Although neural networks have achieved great successes in various machine learning tasks, people can hardly know what neural networks learn from data due to their black-box nature. The lack of such explainability is one of the limitations of neural networks when applied in domains, e.g., healthcare and finance, that demand transparency and accountability. Moreover, explainability is beneficial for guiding a neural network to learn the causal patterns that can extrapolate out-of-distribution (OOD) data, which is critical in real-world applications and has surged as a hot research topic.
In order to improve the explainability of neural networks, we propose a novel method—Explainable Neural Rule Learning (denoted as ENRL), with the aim to integrate the expressiveness of neural networks and the explainability of rule-based systems. Specifically, we first design several operator modules and guide them to behave as certain relational operators via self-supervised learning. With input feature fields and learnable context values serving as arguments, these operator modules are used as predicates to constitute the atomic propositions. Then we employ neural logical operations to combine atomic propositions into a collection of rules. Finally, we design a voting mechanism for these rules so that they collaboratively make up our predictive model. Thus, rule learning is transformed to neural architecture search, that is, to choose the appropriate arrangements of feature fields and operator modules. After searching for a specific architecture and learning the involved modules, the resulting neural network explicitly expresses some rules and thus possesses explainability. Therefore, we can predict for each input instance according to rules it satisfies, which at the same time explains how the neural network makes that decision. We conduct a series of experiments on both synthetic and real-world datasets to evaluate ENRL. Compared with conventional neural networks, ENRL achieves competitive in-distribution performance while providing the extra benefits of explainability. Meanwhile, ENRL significantly alleviates performance drop on OOD test data, implying the effectiveness of rule learning. Codes are provided at https://github.com/Shuriken13/ENRL.
Joint aspect category sentiment analysis (ACSA) and rating prediction (RP) is a newly proposed task (namely ASAP) that integrates the characteristics of both fine-grained and coarse-grained sentiment analysis. However, the prior joint models for the ASAP task only consider the shallow interaction between the two granularities. In this work, we gain the inspiration from human intuition, presenting an innovative from-fine-to-coarse reasoning framework for better joint task performance. Our system advances mainly in three aspects. First, we additionally make use of the category label text features, co-encoding them with the input document texts, allowing to accurately capture the key clues of each category. Second, we build a fine-to-coarse hierarchical label graph, modeling the aspect categories and the overall rating as a hierarchical structure for full interaction of the two granularities. Third, we propose to perform global iterative reasoning with a cross-collaboration between the hierarchical label graph and the context graphs, enabling sufficient communication between categories and review contexts. Based on the ASAP dataset, experimental results demonstrate that our proposed framework outperforms state-of-the-art baselines by large margins. Further in-depth analyses prove that our method is effective on addressing both the unbalanced data distribution and the long-text issue.
We tackle the longstanding question of checking hypotheses on the social Web. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input data. We develop GroupTest, a framework for group testing that addresses both challenges. We formulate CoverTest, a generic top-n problem that seeks n user groups satisfying one-sample, two-sample, or multiple-sample tests, and maximizing data coverage. We show the hardness of CoverTest and develop a greedy algorithm with a provable approximation guarantee as well as a faster heuristic-based algorithm based on α-investing. Our extensive experiments on four real-world datasets demonstrate the necessity to optimize coverage for sound data-driven discoveries, and the efficiency of our heuristic-based algorithm.
A geospatial database is today at the core of an ever increasing number of services. Building and maintaining it remains challenging due to the need to merge information from multiple providers. Entity Resolution (ER) consists of finding entity mentions from different sources that refer to the same real world entity. In geospatial ER, entities are often represented using different schemes and are subject to incomplete information and inaccurate location, making ER and deduplication daunting tasks. While tremendous advances have been made in traditional entity resolution and natural language processing, geospatial data integration approaches still heavily rely on static similarity measures and human-designed rules. In order to achieve automatic linking of geospatial data, a unified representation of entities with heterogeneous attributes and their geographical context, is needed. To this end, we propose Geo-ER1, a joint framework that combines Transformer-based language models, that have been successfully applied in ER, with a novel learning-based architecture to represent the geospatial character of the entity. Different from existing solutions, Geo-ER does not rely on pre-defined rules and is able to capture information from surrounding entities in order to make context-based, accurate predictions. Extensive experiments on eight real world datasets demonstrate the effectiveness of our solution over state-of-the-art methods. Moreover, Geo-ER proves to be robust in settings where there is no available training data for a specific city.
Machine learning is widely used in e-commerce to analyze clickstream sessions and then to allocate marketing resources. Traditional neural learning can model long-term dependencies in clickstream data, yet it ignores the different shopping phases (i. e., goal-directed search vs. browsing) in user behavior as theorized by marketing research. In this paper, we develop a novel, theory-informed machine learning model to account for different shopping phases as defined in marketing theory. Specifically, we formalize a tailored attentive deep Markov model called ClickstreamDMM for predicting the risk of user exits without purchase in e-commerce web sessions. Our ClickstreamDMM combines (1) an attention network to learn long-term dependencies in clickstream data and (2) a latent variable model to capture different shopping phases (i. e., goal-directed search vs. browsing). Due to the interpretable structure, our ClickstreamDMM allows marketers to generate new insights on how shopping phases relate to actual purchase behavior. We evaluate our model using real-world clickstream data from a leading e-commerce platform consisting of 26,279 sessions with 250,287 page clicks. Thereby, we demonstrate that our model is effective in predicting user exits without purchase: compared to existing baselines, it achieves an improvement by 11.5 % in AUROC and 12.7 % in AUPRC. Overall, our model enables e-commerce platforms to detect users at the risk of exiting without purchase. Based on it, e-commerce platforms can then intervene with marketing resources to steer users toward purchasing.
Knowledge graph embedding aims to learn representations of entities and relations in low-dimensional space. Recently, extensive studies combine the characteristics of knowledge graphs with different geometric spaces, including Euclidean space, complex space, hyperbolic space and others, which achieves significant progress in representation learning. However, existing methods are subject to at least one of the following limitations: 1) ignoring the uncertainty, 2) incapability of complex relation patterns. To address the above issues simultaneously, we propose a novel model named DiriE, which embeds entities as Dirichlet distributions and relations as multinomial distributions. DiriE employs Bayesian inference to measure the relations between entities and learns binary embeddings of knowledge graphs for modeling complex relation patterns. Additionally, we propose a two-step negative triple generation method that generates negative triples of both entities and relations. We conduct a solid theoretical analysis to demonstrate the effectiveness and robustness of our method, including the expressiveness of complex relation patterns and the ability to model uncertainty. Furthermore, extensive experiments show that our method outperforms state-of-the-art methods in link prediction on benchmark datasets.
Auxiliary information, such as reviews, have been widely adopted to improve collaborative filtering (CF) algorithms, e.g., to boost the accuracy and provide explanations. However, most of the existing methods cannot distinguish between co-appearance and causality when learning from the reviews, so that they may rely on spurious correlations rather than causal relations in the recommendation — leading to poor generalization performance and unconvincing explanations. In this paper, we propose a Recommendation via Review Rationalization (R3) method including 1) a rationale generator to extract rationales from reviews to alleviate the effects of spurious correlations; 2) a rationale predictor to predict user ratings on items only from generated rationales; and 3) a correlation predictor upon both rationales and correlational features to ensure conditional independence between spurious correlations and rating predictions given causal rationales. Extensive experiments on real-world datasets show that the proposed method can achieve better generalization performance than state-of-the-art CF methods and provide causal-aware explanations even when the test data distribution changes.
Deep learning inspired by differential equations is a recent research trend and has marked the state of the art performance for many machine learning tasks. Among them, time-series modeling with neural controlled differential equations (NCDEs) is considered as a breakthrough. In many cases, NCDE-based models not only provide better accuracy than recurrent neural networks (RNNs) but also make it possible to process irregular time-series. In this work, we enhance NCDEs by redesigning their core part, i.e., generating a continuous path from a discrete time-series input. NCDEs typically use interpolation algorithms to convert discrete time-series samples to continuous paths. However, we propose to i) generate another latent continuous path using an encoder-decoder architecture, which corresponds to the interpolation process of NCDEs, i.e., our neural network-based interpolation vs. the existing explicit interpolation, and ii) exploit the generative characteristic of the decoder, i.e., extrapolation beyond the time domain of original data if needed. Therefore, our NCDE design can use both the interpolated and the extrapolated information for downstream machine learning tasks. In our experiments with 5 real-world datasets and 12 baselines, our extrapolation and interpolation-based NCDEs outperform existing baselines by non-trivial margins.
As recommendation is essentially a comparative (or ranking) process, a good explanation should illustrate to users why an item is believed to be better than another, i.e., comparative explanations about the recommended items. Ideally, after reading the explanations, a user should reach the same ranking of items as the system’s. Unfortunately, little research attention has yet been paid on such comparative explanations.
In this work, we develop an extract-and-refine architecture to explain the relative comparisons among a set of ranked items from a recommender system. For each recommended item, we first extract one sentence from its associated reviews that best suits the desired comparison against a set of reference items. Then this extracted sentence is further articulated with respect to the target user through a generative model to better explain why the item is recommended. We design a new explanation quality metric based on BLEU to guide the end-to-end training of the extraction and refinement components, which avoids generation of generic content. Extensive offline evaluations on two large recommendation benchmark datasets and serious user studies against an array of state-of-the-art explainable recommendation algorithms demonstrate the necessity of comparative explanations and the effectiveness of our solution.
Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields. In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. First, we design HTML tokens for each DOM node in the HTML by embedding representations from their neighboring tokens through graph attention. Second, we construct rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. We conduct an extensive set of experiments on SWDE and Common Crawl benchmarks. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods.
Digging informative knowledge and analyzing contents from the internet is a challenging task as web data may contain new concepts that are lack of sufficient labeled data as well as could be multimodal. Few-shot learning (FSL) has attracted significant research attention for dealing with scarcely labeled concepts. However, existing FSL algorithms have assumed a uniform task setting such that all samples in a few-shot task share a common feature space. Yet in the real web applications, it is usually the case that a task may involve multiple input feature spaces due to the heterogeneity of source data, that is, the few labeled samples in a task may be further divided and belong to different feature spaces, namely hybrid few-shot learning (hFSL). The hFSL setting results in a hybrid number of shots per class in each space and aggravates the data scarcity challenge as the number of training samples per class in each space is reduced. To alleviate these challenges, we propose the Task-adaptive Topological Transduction Network, namely TopoNet, which trains a heterogeneous graph-based transductive meta-learner that can combine information from both labeled and unlabeled data to enrich the knowledge about the task-specific data distribution and multi-space relationships. Specifically, we model the underlying data relationships of the few-shot task in a node-heterogeneous multi-relation graph, and then the meta-learner adapts to each task’s multi-space relationships as well as its inter- and intra-class data relationships, through an edge-enhanced heterogeneous graph neural network. Our experiments compared with existing approaches demonstrate the effectiveness of our method.
Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.1
Automatic extraction of product attributes from their textual descriptions is essential for online shopper experience. One inherent challenge of this task is the emerging nature of e-commerce products — we see new types of products with their unique set of new attributes constantly. Most prior works on this matter mine new values for a set of known attributes but cannot handle new attributes that arose from constantly changing data. In this work, we study the attribute mining problem in an open-world setting to extract novel attributes and their values. Instead of providing comprehensive training data, the user only needs to provide a few examples for a few known attribute types as weak supervision. We propose a principled framework that first generates attribute value candidates and then groups them into clusters of attributes. The candidate generation step probes a pre-trained language model to extract phrases from product titles. Then, an attribute-aware fine-tuning method optimizes a multitask objective and shapes the language model representation to be attribute-discriminative. Finally, we discover new attributes and values through the self-ensemble of our framework, which handles the open-world challenge. We run extensive experiments on a large distantly annotated development set and a gold standard human-annotated test set that we collected. Our model significantly outperforms strong baselines and can generalize to unseen attributes and product types.
Large-scale multi-label text classification (LMTC) aims to associate a document with its relevant labels from a large candidate set. Most existing LMTC approaches rely on massive human-annotated training data, which are often costly to obtain and suffer from a long-tailed label distribution (i.e., many labels occur only a few times in the training set). In this paper, we study LMTC under the zero-shot setting, which does not require any annotated documents with labels and only relies on label surface names and descriptions. To train a classifier that calculates the similarity score between a document and a label, we propose a novel metadata-induced contrastive learning (MICoL) method. Different from previous text-based contrastive learning techniques, MICoL exploits document metadata (e.g., authors, venues, and references of research papers), which are widely available on the Web, to derive similar document–document pairs. Experimental results on two large-scale datasets show that: (1) MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines; (2) MICoL is on par with the state-of-the-art supervised metadata-aware LMTC method trained on 10K–200K labeled documents; and (3) MICoL tends to predict more infrequent labels than supervised methods, thus alleviates the deteriorated performance on long-tailed labels.
Probabilistic time-series forecasting enables reliable decision making across many domains. Most forecasting problems have diverse sources of data containing multiple modalities and structures. Leveraging information from these data sources for accurate and well-calibrated forecasts is an important but challenging problem. Most previous works on multi-view time-series forecasting aggregate features from each data view by simple summation or concatenation and do not explicitly model uncertainty for each data view. We propose a general probabilistic multi-view forecasting framework CAMul, which can learn representations and uncertainty from diverse data sources. It integrates the information and uncertainty from each data view in a dynamic context-specific manner, assigning more importance to useful views to model a well-calibrated forecast distribution. We use CAMul for multiple domains with varied sources and modalities and show that CAMul outperforms other state-of-art probabilistic forecasting models by over 25% in accuracy and calibration.
Online controlled experiments, in which different variants of a product are compared based on an Overall Evaluation Criterion (OEC), have emerged as a gold standard for decision making in online services. It is vital that the OEC is aligned with the overall goal of stakeholders for effective decision making. However, this is a challenge when the overall goal is not immediately observable. For instance, we might want to understand the effect of deploying a feature on long-term retention, where the outcome (retention) is not observable at the end of an A/B test.
In this work, we examine long-term user engagement outcomes as a time-to-event problem and demonstrate the use of survival models for estimating long-term effects. We then discuss the practical challenges in using time-to-event metrics for decision making in online experiments. We propose a simple churn-based time-to-inactivity metric and describe a framework for developing & validating modeled metrics using survival models for predicting long-term retention. Then, we present a case study and provide practical guidelines on developing and evaluating a time-to-churn metric on a large scale real-world dataset of online experiments. Finally, we compare the proposed approach to existing alternatives in terms of sensitivity and directionality.
Although billions of COVID-19 vaccines have been administered, too many people remain hesitant. Misinformation about the COVID-19 vaccines, propagating on social media, is believed to drive hesitancy towards vaccination. However, exposure to misinformation does not necessarily indicate misinformation adoption. In this paper we describe a novel framework for identifying the stance towards misinformation, relying on attitude consistency and its properties. The interactions between attitude consistency, adoption or rejection of misinformation and the content of microblogs are exploited in a novel neural architecture, where the stance towards misinformation is organized in a knowledge graph. This new neural framework is enabling the identification of stance towards misinformation about COVID-19 vaccines with state-of-the-art results. The experiments are performed on a new dataset of misinformation towards COVID-19 vaccines, called CoVaxLies, collected from recent Twitter discourse. Because CoVaxLies provides a taxonomy of the misinformation about COVID-19 vaccines, we are able to show which type of misinformation is mostly adopted and which is mostly rejected.
With the prevalence of live broadcast business nowadays, a new type of recommendation service, called live broadcast recommendation, is widely used in many mobile e-commerce Apps. Different from classical item recommendation, live broadcast recommendation is to automatically recommend user anchors instead of items considering the interactions among triple-objects (i.e., users, anchors, items) rather than binary interactions between users and items. Existing methods based on binary objects, ranging from early matrix factorization to recently emerged deep learning, obtain objects’ embeddings by mapping from pre-existing features. Directly applying these techniques would lead to limited performance, as they are failing to encode collaborative signals among triple-objects. In this paper, we propose a novel TWo-side Interactive NetworkS (TWINS) for live broadcast recommendation. In order to fully use both static and dynamic information on user and anchor sides, we combine a product-based neural network with a recurrent neural network to learn the embedding of each object. In addition, instead of directly measuring the similarity, TWINS effectively injects the collaborative effects into the embedding process in an explicit manner by modeling interactive patterns between the user’s browsing history and the anchor’s broadcast history in both item and anchor aspects. Furthermore, we design a novel co-retrieval technique to select key items among massive historic records efficiently. Offline experiments on real large-scale data show the superior performance of the proposed TWINS, compared to representative methods; and further results of online experiments on Diantao App show that TWINS gains average performance improvement of around 8% on ACTR metric, 3% on UCTR metric, 3.5% on UCVR metric.
Graph neural network-based recommendation systems are blossoming recently, and its core component is aggregation methods that determine neighbor embedding learning. Prior arts usually focus on how to aggregate information from the perspective of spatial structure information, but temporal information about neighbors is left insufficiently explored.
In this work, we propose a spatiotemporal aggregation method STAM to efficiently incorporate temporal information into neighbor embedding learning. STAM generates spatiotemporal neighbor embeddings from the perspectives of spatial structure information and temporal information, facilitating the development of aggregation methods from spatial to spatiotemporal. STAM utilizes the Scaled Dot-Product Attention to capture temporal orders of one-hop neighbors and employs multi-head attention to perform joint attention over different latent subspaces. We utilize STAM for GNN-based recommendation to learn users and items embeddings. Extensive experiments demonstrate that STAM brings significant improvements on GNN-based recommendation compared with spatial-based aggregation methods, e.g., 24% for MovieLens, 8% for Amazon, and 13% for Taobao in terms of MRR@20.
Recommender System (RS) is ubiquitous on today’s Internet to provide multifaceted personalized information services. While an enormous success has been made in pushing forward high-accuracy recommendations, the other side of the coin — the recommendation explainability — needs to be better handled for pursuing persuasiveness, especially for the era of deep learning based recommendation. A few research efforts investigate interpretable recommendation from the feature and result levels. Compared with them, model-level explanation, which unfolds the reasoning process of recommendation through transparent models, still remains underexplored and deserves more attention.
In this paper, we propose a model-based explainable recommendation approach, i.e., NS-ICF, which stands for Neuro-Symbolic Interpretable Collaborative Filtering. Thanks to the recent advance on neuro-symbolic computation for automatic rule learning, NS-ICF learns interpretable recommendation rules (consisting of user and item attributes) based on neural networks with two innovations: (1) a three-tower architecture tailored for the user and item sides in the RS domain; (2) fusing the powerful personalized representations of users and items to achieve adaptive rule weights and without sacrificing interpretability. Comprehensive experiments on public datasets demonstrate NS-ICF is comparable to state-of-the-art deep recommendation models and is transparent for its unique neuro-symbolic architecture.
Multi-Task Learning (MTL) has attracted increasing attention in recommender systems. A crucial challenge in MTL is to learn suitable shared parameters among tasks and to avoid negative transfer of information. The most recent sparse sharing models use independent parameter masks, which only activate useful parameters for a task, to choose the useful subnet for each task. However, as all the subnets are optimized in parallel for each task independently, it is faced with the problem of conflict between parameter gradient updates (i.e, parameter conflict problem). To address this challenge, we propose a novel Contrastive Sharing Recommendation model in MTL learning (CSRec). Each task in CSRec learns from the subnet by the independent parameter mask as in sparse sharing models, but a contrastive mask is carefully designed to evaluate the contribution of the parameter to a specific task. The conflict parameter will be optimized relying more on the task which is more impacted by the parameter. Besides, we adopt an alternating training strategy in CSRec, making it possible to self-adaptively update the conflict parameters by fair competitions. We conduct extensive experiments on three real-world large scale datasets, i.e., Tencent Kandian, Ali-CCP and Census-income, showing better effectiveness of our model over state-of-the-art methods for both offline and online MTL recommendation scenarios.
Graph neural networks (GNNs) have been widely adopted for semi-supervised learning on graphs. A recent study shows that the graph random neural network (GRAND) model can generate state-of-the-art performance for this problem. However, it is difficult for GRAND to handle large-scale graphs since its effectiveness relies on computationally expensive data augmentation procedures. In this work, we present a scalable and high-performance GNN framework GRAND+ for semi-supervised graph learning. To address the above issue, we develop a generalized forward push (GFPush) algorithm in GRAND+ to pre-compute a general propagation matrix and employ it to perform graph data augmentation in a mini-batch manner. We show that both the low time and space complexities of GFPush enable GRAND+ to efficiently scale to large graphs. Furthermore, we introduce a confidence-aware consistency loss into the model optimization of GRAND+, facilitating GRAND+’s generalization superiority. We conduct extensive experiments on seven public datasets of different sizes. The results demonstrate that GRAND+ 1) is able to scale to large graphs and costs less running time than existing scalable GNNs, and 2) can offer consistent accuracy improvements over both full-batch and scalable GNNs across all datasets.
Recently, there has been growing interest in the ability of Transformer-based models to produce meaningful embeddings of text with several applications, such as text similarity. Despite significant progress in the field, the explanations for similarity predictions remain challenging, especially in unsupervised settings. In this work, we present an unsupervised technique for explaining paragraph similarities inferred by pre-trained BERT models. By looking at a pair of paragraphs, our technique identifies important words that dictate each paragraph’s semantics, matches between the words in both paragraphs, and retrieves the most important pairs that explain the similarity between the two. The method, which has been assessed by extensive human evaluations and demonstrated on datasets comprising long and complex paragraphs, has shown great promise, providing accurate interpretations that correlate better with human perceptions.
Urban dockless e-scooter sharing (DES) has become a popular Web-of-Things (WoT) service and widely adopted globally. Despite its early commercial success, conventional mobility demand and supply prediction based on machine learning and subsequent redistribution may favor advantaged socio-economic communities and tourist regions, at the expense of reducing mobility accessibility and resource allocation for historically disadvantaged communities. To address this unfairness, we propose a socially-Equitable Interactive Graph information fusion-based mobility flow prediction system for Dockless E-scooter Sharing (EIGDES). By considering city regions as nodes connected by trips, EIGDES learns and captures the complex interactions across spatial and temporal graph features through a novel interactive graph information dissemination and fusion structure. We further design a novel model learning objective with metrics that capture both the mobility distributions and the socio-economic factors, ensuring spatial fairness in the communities’ resource accessibility and their experienced DES prediction accuracy. Through its integration with the optimization regularizer, EIGDES jointly learns the DES flow patterns and socio-economic factors, and returns socially-equitable flow predictions. Our in-depth experimental study upon more than 2,122,270 DES trips from three metropolitan cities in North America has demonstrated EIGDES’s effectiveness in accurate prediction of DES flow patterns with substantial reduction of mobility unfairness.
Nowadays, the surge of Internet contents and the need for high Quality of Experience (QoE) put the backbone network under unprecedented pressure. The emerging edge caching solutions help ease the pressure by caching contents closer to users. However, these solutions suffer from two challenges: 1) a low hit ratio due to edges’ high density and small coverages. 2) unbalanced edges’ workloads caused by dynamic requests and heterogeneous edge capacities. In this paper, we formulate a typical cooperative edge caching problem and propose the MagNet, a decentralized and cooperative edge caching system to address these two challenges. The proposed MagNet system consists of two innovative mechanisms: 1) the Automatic Content Congregating (ACC), which utilizes a neural embedding algorithm to capture underlying patterns of historical traces to cluster contents into some types. The ACC then can guide requests to their optimal edges according to their types so that contents congregate automatically in different edges by type. This process forms a virtuous cycle between edges and requests, driving a high hit ratio. 2) the Mutual Assistance Group (MAG), which lets idle edges share overloaded edges’ workloads by forming temporary groups promptly. To evaluate the performance of MagNet, we conduct experiments to compare it with classical, Machine Learning (ML)-based and cooperative caching solutions using the real-world trace. The results show that the MagNet can improve the hit ratio from 40% and 60% to 75% for non-cooperative and cooperative solutions, respectively, and significantly improve the balance of edges’ workloads.
The rapid growth of video traffic imposes significant challenges on content delivery over the Internet. Meanwhile, edge computing is developed to accelerate video transmission as well as release the traffic load of origin servers. Although some related techniques (e.g., transcoding and prefetching) are proposed to improve edge services, they cannot fully utilize cached videos. Therefore, we propose a Learning-based Fuzzy Bitrate Matching scheme (LFBM) at the edge for adaptive video streaming, which utilizes the capacity of network and edge servers. In accordance with user requests, cache states and network conditions, LFBM utilizes reinforcement learning to make a decision, either fetching the video of the exact bitrate from the origin server or responding with a different representation from the edge server. In the simulation, compared with the baseline, LFBM improves cache hit ratio by 128%. Besides, compared with the scheme without fuzzy bitrate matching, it improves Quality of Experience (QoE) by 45%. Moreover, the real-network experiments further demonstrate the effectiveness of LFBM. It increases the hit ratio by 84% compared with the baseline and improves the QoE by 51% compared with the scheme without fuzzy bitrate matching.
Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libs and 15 diversified DL models. We then perform extensive experiments on 10 mobile devices, which help reveal a complete landscape of the current mobile DL libs ecosystem. For example, we find that the best-performing DL lib is severely fragmented across different models and hardware, and the gap between those DL libs can be rather huge. In fact, the impacts of DL libs can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Finally, atop the observations, we summarize practical implications to different roles in the DL lib ecosystem.
Satellite imagery depicts the earth’s surface remotely and provides comprehensive information for many applications, such as land use monitoring and urban planning. Existing studies on unsupervised representation learning for satellite images only take into account the images’ geographic information, ignoring human activity factors. To bridge this gap, we propose using Point-of-Interest (POI) data to capture human factors and design a contrastive learning-based framework to consolidate the representation of satellite imagery with POI information. Also, we design an attention model that merges the representations from the geographic and POI perspectives adaptively. On the basis of real-world datasets collected from Beijing, we evaluate our method for predicting socioeconomic indicators. The results show that the representation containing POI information outperforms the geographic representation in estimating commercial activity-related indicators. Our proposed framework can estimate the socioeconomic indicators with an R2 of 0.874 and outperforms the baseline methods.
In applications of Web of Things or Web of Events, a massive volume of multi-dimensional streaming data are automatically and continuously generated from different sources, such as GPS, sensors, and other measurement devices, which are essentially imprecise (inaccurate and/or uncertain). It is challenging to monitor and get insights over imprecise and low-level streaming data, in order to capture potentially important data changing trends and to initiate prompt responses. In this work, we investigate solutions for conducting multi-dimensional and multi-granularity probabilistic regression for the imprecise streaming data. The probabilistic nature of streaming data poses big computational challenges to the regression and its aggregation. In this paper, we study a series of techniques on multi-dimensional probabilistic regression, including aggregation, sketching, popular path materialization, and exception-driven querying. Extensive experiments on real and synthetic datasets show the efficiency and scalability of our proposals.
Online content sharing is a widely used feature in Android apps. In this paper, we observe a new Fake-Share attack that adversaries can abuse existing content sharing services to manipulate the displayed source of shared content to bypass the content review of targeted Online Social Apps (OSAs) and induce users to click on the shared fraudulent content. We show that seven popular content-sharing services (including WeChat, AliPay, and KakaoTalk) are vulnerable to such an attack. To detect this kind of attack and explore whether adversaries have leveraged it in the wild, we propose DeFash, a multi-granularity detection tool including static analysis and dynamic verification. The extensive in-the-lab and in-the-wild experiments demonstrate that DeFash is effective in detecting such attacks. We have identified 51 real-world apps involved in Fake-Share attacks. We have further harvested over 24K Sharing Identification Information (SIIs) that can be abused by attackers. It is hence urgent for our community to take actions to detect and mitigate this kind of attack.
Network traffic data facilitates understanding the Internet of Things (IoT) behaviors and improving IoT service quality in the real world. However, large-scale IoT traffic data is rarely accessible, and privacy issues also impede realistic data sharing even with anonymous personal identifiable information. Researchers propose to generate synthetic IoT traffic but fail to cover the multiple services provided by widespread real-world IoT devices. In this work, we take the first step to generate large-scale IoT traffic via a knowledge-enhanced generative adversarial network (GAN) framework, which introduces both the semantic knowledge (e.g., location and environment information) and the network structure knowledge for various IoT devices via a knowledge graph. We use a condition mechanism to incorporate the knowledge and device category for IoT traffic generation. Then, we adopt LSTM and a self-attention mechanism to capture the temporal correlation in the traffic series. Extensive experiment results show that the synthetic IoT traffic datasets generated by our proposed model outperform state-of-art baselines in terms of data fidelity and applications. Moreover, our proposed model is able to generate realistic data by only training on small real datasets with knowledge enhanced.
Split learning (SL) is a promising distributed learning framework that enables to utilize the huge data and parallel computing resources of mobile devices. SL is built upon a model-split architecture, wherein a server stores an upper model segment that is shared by different mobile clients storing its lower model segments. Without exchanging raw data, SL achieves high accuracy and fast convergence by only uploading smashed data from clients and downloading global gradients from the server. Nonetheless, the original implementation of SL sequentially serves multiple clients, incurring high latency with many clients. A parallel implementation of SL has great potential in reducing latency, yet existing parallel SL algorithms resort to compromising scalability and/or convergence speed. Motivated by this, the goal of this article is to develop a scalable parallel SL algorithm with fast convergence and low latency. As a first step, we identify that the fundamental bottleneck of existing parallel SL comes from the model-split and parallel computing architectures, under which the server-client model updates are often imbalanced, and the client models are prone to detach from the server’s model. To fix this problem, by carefully integrating local parallelism, federated learning, and mixup augmentation techniques, we propose a novel parallel SL framework, coined LocFedMix-SL. Simulation results corroborate that LocFedMix-SL achieves improved scalability, convergence speed, and latency, compared to sequential SL as well as the state-of-the-art parallel SL algorithms such as SplitFed and LocSplitFed.
Owing to the benefit of light weight, containers have become a promising enabler for cloud native computing. Container images composed of applications and dependencies support flexible service deployment and migration. Rapid adoption and integration of containers generate millions of images to be stored. Additionally, non-local images have to be frequently downloaded from the registry, resulting in huge amounts of traffic. Content Addressable Storage (CAS) has been adopted for saving storage and networking by enabling identical layers sharing across images. However, according to our measurements, the implication of CAS is significantly limited as layers are rarely fully identical in practice. In this paper, we propose to reconstruct the docker images to raise the number of identical layers and thereby reduce storage and network consumption. We explore the layered structure of images and define the commutativity of files to assure image validity. The image reconstruction is formulated as an integer nonlinear programming problem. Inspired by the observed similarity of layers, we design a similarity-aware online image reconstruction algorithm. Extensive evaluations are conducted to verify the performance of the proposed approach.
This paper studies the task of detecting bots for online mobile games. Considering the fact of lacking labeled cheating samples and restricted available data in the real detection systems, we aim to study the finger operations captured by screen sensors to infer the potential bots in an unsupervised way. In detail, we introduce a Transformer-style detection model, namely FingFormer. It studies the finger operations in the format of graph structure in order to capture the spatial and temporal relatedness between the two hands’ operations. To optimize the model in an unsupervised way, we introduce two contrastive learning strategies to refine both finger moving patterns and players’ operation habits. We conduct extensive experiments under different experimental environments, including the synthetic dataset, the offline dataset, as well as the large-scale online data flow from three mobile games. The multi-facet experiments illustrate the proposed model is both effective and general to detect the bots for different mobile games.
Because of the large number of online games available nowadays, online game recommender systems are necessary for users and online game platforms. The former can discover more potential online games of their interests, and the latter can attract users to dwell longer in the platform. This paper investigates the characteristics of user behaviors with respect to the online games on the Steam platform. Based on the observations, we argue that a satisfying recommender system for online games is able to characterize: personalization, game contextualization and social connection. However, simultaneously solving all is rather challenging for game recommendation. Firstly, personalization for game recommendation requires the incorporation of the dwelling time of engaged games, which are ignored in existing methods. Secondly, game contextualization should reflect the complex and high-order properties of those relations. Last but not least, it is problematic to use social connections directly for game recommendations due to the massive noise within social connections. To this end, we propose a Social-aware Contextualized Graph Neural Recommender System (SCGRec), which harnesses three perspectives to improve game recommendation. We conduct a comprehensive analysis of users’ online game behaviors, which motivates the necessity of handling those three characteristics in the online game recommendation.
With an increasing popularity, Multiplayer Online Battle Arena (MOBA) games where two opposing teams compete against each other, have played a major role in E-sports tournaments. Among game analysis, real-time winning prediction is an important but challenging problem, which is mainly due to the complicated coupling of the overall Confrontation1, the excessive noise of the player’s Movement, and unclear optimization goals. Existing research is difficult to solve this problem in a dynamic, comprehensive and systematic way. In this study, we design a unified framework, namely Winning Tracker (WT), for solving this problem. Specifically, offense and defense extractors are developed to extract the Confrontation of both sides. A well-designed trajectory representation algorithm is applied to extracting individual’s Movement information. Moreover, we design a hierarchical attention mechanism to capture team-level strategies and facilitate the interpretability of the framework. To optimize accurately, we adopt a multi-task learning method to design short-term and long-term goals, which are used to represent immediate state and make end-state prediction respectively. Intensive experiments on a real-world data set demonstrate that our proposed method WT outperforms state-of-the-art algorithms. Furthermore, our work has been practically deployed in real MOBA games, and provided case studies reflecting its outstanding commercial value.
Players of online games generate rich behavioral data during gaming. Based on these data, game developers can build a range of data science applications, such as bot detection and social recommendation, to improve the gaming experience. However, the development of such applications requires data cleansing, training sample labeling, feature engineering, and model development, which makes the use of such applications in small and medium-sized game studios still uncommon. While acquiring supervised learning data is costly, unlabeled behavioral logs are often continuously and automatically generated in games. Thus we resort to unsupervised representation learning of player behavioral data to optimize intelligent services in games. Behavioral data has many unique properties, including semantic complexity, excessive length, etc. A worth noting property within raw player behavioral data is that a lot of it is task-irrelevant. For these data characteristics, we introduce a BPE-enhanced compression method and propose a novel adaptive masking strategy called Masking by Token Confidence (MTC) for the Masked Language Modeling (MLM) pre-training task. MTC is designed to increase the masking probabilities of task-relevant tokens. Experiments on four downstream tasks and successful deployment in a world-renowned Massively Multiplayer Online Role-Playing Game (MMORPG) prove the effectiveness of the MTC strategy1.
Mobile cloud gaming enables high-end games on constrained devices by streaming the game content from powerful servers through mobile networks. Mobile networks suffer from highly variable bandwidth, latency, and losses that affect the gaming experience. This paper introduces , an end-to-end cloud gaming framework to minimize the impact of network conditions on the user experience. relies on an end-to-end distortion model adapting the video source rate and the amount of frame-level redundancy based on the measured network conditions. As a result, it minimizes the motion-to-photon (MTP) latency while protecting the frames from losses. We fully implement and evaluate its performance against the state-of-the-art techniques and latest research in real-time mobile cloud gaming transmission on a physical testbed over emulated and real wireless networks. consistently balances MTP latency (<140 ms) and visual quality (>31dB) even in highly variable environments. A user experiment confirms that maximizes the user experience with high perceived video quality, playability, and low user load.
Estimating a team’s win probability at any given point of a game is a common task for any sport, including esports, and is important for valuing player actions, assessing profitable bets, and engaging fans with interesting metrics. Past studies of win probability in esports have relied on data extracted from matches held in well-structured and organized professional tournaments. In these tournaments, players play on set teams, oftentimes where players are well acquainted with all participants. However, there has been little study of win probability modeling in casual gaming environments – those where players are randomly matched – even though these environments form the bulk of gaming hours played. Using interpretable win probability models trained on large CSGO data sets, we improve upon the current state of the art in Counter-Strike: Global Offensive (CSGO) win probability prediction. We identify important features, such as team HP and equipment value, across different skill levels. We also find a small benefit to using Elo-based player skill estimates in predicting win probability. Furthermore, we discuss how our win probability models can be used to investigate the problem of player-leaving in competitive matchmaking.
This paper presents a personalized character recommendation system for Multiplayer Online Battle Arena (MOBA) games which are considered as one of the most popular online video game genres around the world. When playing MOBA games, players go through a draft stage, where they alternately select a virtual character to play. When drafting, players select characters by not only considering their character preferences, but also the synergy and competence of their team’s character combination. However, the complexity of drafting induces difficulties for beginners to choose the appropriate characters based on the characters of their team while considering their own champion preferences. To alleviate this problem, we propose DraftRec, a novel hierarchical model which recommends characters by considering each player’s champion preferences and the interaction between the players. DraftRec consists of two networks: the player network and the match network. The player network captures the individual player’s champion preference, and the match network integrates the complex relationship between the players and their respective champions. We train and evaluate our model from a manually collected 280,000 matches of League of Legends and a publicly available 50,000 matches of Dota2. Empirically, our method achieved state-of-the-art performance in character recommendation and match outcome prediction task. Furthermore, a comprehensive user survey confirms that DraftRec provides convincing and satisfying recommendations to the real-world players. Our code and dataset are available at https://github.com/dojeon-ai/DraftRec .
In this study, we aim to shine a light on the tracking ecosystem in mobile games on Android and understand how different monetization models can impact user privacy. We introduce a pipeline that collects both free and paid games and we use the static analysis provided by the Exodus audit platform to detect the trackers present in them. We analyse a total of 6,751 games, including 396 paid games. Our results show that paying for a game does not necessarily shield users from data collection. We find that 87% of free games include at least one tracker, compared to 65% of paid games that do. On average, free games have 3.4 times more trackers and request twice more dangerous permissions than paid games. We also notice that the genre of the game and its targeted audience impact the number of trackers. Games in the Casual category presents the most trackers while those in the Educational one have the least.
The development of hypertext and the World Wide Web is most frequently explained by reference to changes in underlying technologies — Moore's Law giving rise to faster computers, more ample memory, increased bandwidth, inexpensive color displays. That story is true, but it is not complete: hypertext and the Web are also built on a foundation of ideas. Specifically, I believe the Web we know arose from ideas rooted in the disasters of the short twentieth century, 1914–1989. The experience of these disasters differed in the Americas and in Eurasia, and this distinction helps explain many long-standing tensions in research and practice alike.
During the last three decades, the Web has been growing considerably in terms of number of available resources, traffic, types of media, usages, etc. In parallel, with 30+ editions, the WebConf series (ex. WWW, soon-to-be ACM WebConf) has witnessed how academia has been dealing with the Web as an object of research. In this study, we focus on the small story within the great one of the Web. In particular, by analysing the WebConf’s accepted papers and yearly events, we review how the conference has evolved across these decades and “driven” the evolution of the Web.
One of the most important developments in the history of the Web was the development of the status update. Although social media has been approached by a number of critical theorists as an instrument of control and surveillance, it should be remembered that social media began as liberatory technology harnessed by social movements. In this essay, we trace the origin of the status update for spreading news from protest-driven community networks like Indymedia and text messages for protest coordination via TxtMob. In fact, the use of status updates by Indymedia and the wider anti-globalization movement prefigured their usage in Tahrir Square and the Black Lives Matter movement in the USA. This historical link goes through Twitter itself, as the early Twitter engineers were veterans of Indymedia. There is still much to learn from Indymedia: Framing social media as invented and then harnessed by social movements may even provide innovative solutions to issues of content moderation and censorship. Exploring the origin of social media in social movements provides a perspective on the history of the Web from the tradition of the oppressed.
In this paper we summarized the main historical steps in making the Web, its foundational principles and its evolution. First we mention some of the influences and streams of thought that interacted to bring the Web about. Then we recall that its birthplace, the CERN, had a need for a global hypertext system and at the same time was the perfect microcosm to provide a cradle for the Web. We stress how this invention required to strike a balance between the integration of and the departure from the existing and emerging paradigms of the day. We then review the pillars of the Web architecture and the features that made the Web so viral compared to competitors. Finally we survey the multiple mutations the Web underwent no sooner it was born, evolving in multiple directions. We conclude on the fact the Web is now an architecture, an artefact, a science object and a research and development object, and of which we haven’t seen the full potential yet.
Algorithmic discrimination is one of the significant concerns in applying machine learning models to a real-world system. Many researchers have focused on developing fair machine learning algorithms without discrimination based on legally protected attributes. However, the existing research has barely explored various security issues that can occur while evaluating model fairness and verifying fair models. In this study, we propose a fairness audit framework that assesses the fairness of ML algorithms while addressing potential security issues such as data privacy, model secrecy, and trustworthiness. To this end, our proposed framework utilizes confidential computing and builds a chain of trust through enclave attestation primitives combined with public scrutiny and state-of-the-art software-based security techniques, enabling fair ML models to be securely certified and clients to verify a certified one. Our micro-benchmarks on various ML models and real-world datasets show the feasibility of the fairness certification implemented with Intel SGX in practice. In addition, we analyze the impact of data poisoning, which is an additional threat during data collection for fairness auditing. Based on the analysis, we illustrate the theoretical curves of fairness gap and minimal group size and the empirical results of fairness certification on poisoned datasets.
Generating a deep learning-based fake video has become no longer rocket science. The advancement of automated Deepfake (DF) generation tools that mimic certain targets has rendered society vulnerable to fake news or misinformation propagation. In real-world scenarios, DF videos are compressed to low-quality (LQ) videos, taking up less storage space and facilitating dissemination through the web and social media. Such LQ DF videos are much more challenging to detect than high-quality (HQ) DF videos. To address this challenge, we rethink the design of standard deep learning-based DF detectors, specifically exploiting feature extraction to enhance the features of LQ images. We propose a novel LQ DF detection architecture, multi-scale Branch Zooming Network (BZNet), which adopts an unsupervised super-resolution (SR) technique and utilizes multi-scale images for training. We train our BZNet only using highly compressed LQ images and experiment under a realistic setting, where HQ training data are not readily accessible. Extensive experiments on the FaceForensics++ LQ and GAN-generated datasets demonstrate that our BZNet architecture improves the detection accuracy of existing CNN-based classifiers by 4.21% on average. Furthermore, we evaluate our method against a real-world Deepfake-in-the-Wild dataset collected from the internet, which contains 200 videos featuring 50 celebrities worldwide, outperforming the state-of-the-art methods by 4.13%.
We design and evaluate VICTOR, an easy-to-apply module on top of a recommender system to mitigate misinformation. VICTOR takes an elegant, implicit approach to deliver fake-news verifications, such that readers of fake news can continuously access more verified news articles about fake-news events without explicit correction. We frame fake-news intervention within VICTOR as a graph-based question-answering (QA) task, with Q as a fake-news article and A as the corresponding verified articles. Specifically, VICTOR adopts reinforcement learning: it first considers fake-news readers’ preferences supported by underlying news recommender systems and then directs their reading sequence towards the verified news articles. To verify the performance of VICTOR, we collect and organize VERI, a new dataset consisting of real-news articles, user browsing logs, and fake-real news pairs for a large number of misinformation events. We evaluate zero-shot and few-shot VICTOR on VERI to simulate the never-exposed-ever and seen-before conditions of users while reading a piece of fake news. Results demonstrate that compared to baselines, VICTOR proactively delivers 6% more verified articles with a diversity increase of 7.5% to over 68% of at-risk users who have been exposed to fake news. Moreover, we conduct a field user study in which 165 participants evaluated fake news articles. Participants in the VICTOR condition show better exposure rates, proposal rates, and click rates on verified news articles than those in the other two conditions. Altogether, our work demonstrates the potentials of VICTOR, i.e., combat fake news by delivering verified information implicitly.
The learning-to-rank problem aims at ranking items to maximize exposure of those most relevant to a user query. A desirable property of such ranking systems is to guarantee some notion of fairness among specified item groups. While fairness has recently been considered in the context of learning-to-rank systems, current methods cannot provide guarantees on the fairness of the predicted rankings. This paper addresses this gap and introduces Smart Predict and Optimize for Fair Ranking (SPOFR), an integrated optimization and learning framework for fairness-constrained learning to rank. The end-to-end SPOFR framework includes a constrained optimization sub-model and produces ranking policies that are guaranteed to satisfy fairness constraints, while allowing for fine control of the fairness-utility tradeoff. SPOFR is shown to significantly improve on current state-of-the-art fair learning-to-rank systems with respect to established performance metrics.
Trust is an important component of human-AI relationships and plays a major role in shaping the reliance of users on online algorithmic decision support systems. With recent advances in natural language processing, text and voice-based conversational interfaces have provided users with new ways of interacting with such systems. Despite the growing applications of conversational user interfaces (CUIs), little is currently understood about the suitability of such interfaces for decision support and how CUIs inspire trust among humans engaging with decision support systems. In this work, we aim to address this gap and answer the following question: to what extent can a conversational interface build user trust in decision support systems in comparison to a conventional graphical user interface? To this end, we built a text-based conversational interface, and a conventional web-based graphical user interface. These served as the means for users to interact with an online decision support system to help them find housing, given a fixed set of constraints. To understand how the accuracy of the decision support system moderates user behavior and trust across the two interfaces, we considered an accurate and inaccurate system. We carried out a 2 × 2 between-subjects study (N = 240) on the Prolific crowdsourcing platform. Our findings show that the conversational interface was significantly more effective in building user trust and satisfaction in the online housing recommendation system when compared to the conventional web interface. Our results highlight the potential impact of conversational interfaces for trust development in decision support systems.
Network algorithms play a critical role in various applications, such as recommendations, diffusion maximization, and web search. In this paper, we focus on the fairness of such algorithms and in particular of PageRank. PageRank fairness refers to a fair allocation of the PageRank weights among the nodes. We consider the effect of the network structure on PageRank fairness. Concretely, we provide analytical formulas for computing the effect of edge additions on fairness and for the conditions that an edge must satisfy so that its addition improves fairness. We also provide analytical formulas for evaluating the role of existing edges in fairness. We use our findings to propose efficient linear time link recommendation algorithms for maximizing fairness, and we evaluate them on real datasets. Our approach can be seen as an effort towards making the network itself fairer as opposed to making fairer the network algorithms, or their outputs.
Recent studies have shown Graph Neural Networks (GNNs) are extremely vulnerable to attribute inference attacks. To tackle this challenge, existing privacy-preserving GNNs research assumes that the sensitive attributes of all users are known beforehand. However, due to different privacy preferences, some users (i.e., private users) may prefer not to reveal sensitive information that others (i.e., non-private users) would not mind disclosing. For example, in social networks, male users are typically less sensitive to their age information than female users. The age disclosure of male users can lead to the age information of female users in the network exposed. This is partly because social media users are connected, the homophily property and message-passing mechanism of GNNs can exacerbate individual privacy leakage. In this work, we study a novel and practical problem of learning privacy-preserving GNNs with partially observed sensitive attributes.
In particular, we propose a novel privacy-preserving GCN model coined DP-GCN, which effectively protects private users’ sensitive information that has been revealed by non-private users in the same network. DP-GCN consists of two modules: First, Disentangled Representation Learning Module (DRL), which disentangles the original non-sensitive attributes into sensitive and non-sensitive latent representations that are orthogonal to each other. Second, Node Classification Module (NCL), which trains the GCN to classify unlabeled nodes in the graph with non-sensitive latent representations. Experimental results on five benchmark datasets demonstrate the effectiveness of DP-GCN in preserving private users’ sensitive information while maintaining high node classification accuracy.
Modern recommender systems learn user representations from historical interactions, which suffer from the problem of user feature shifts, such as an income increase. Historical interactions will inject out-of-date information into the representation in conflict with the latest user feature, leading to improper recommendations. In this work, we consider the Out-Of-Distribution (OOD) recommendation problem in an OOD environment with user feature shifts. To pursue high fidelity, we set additional objectives for representation learning as: 1) strong OOD generalization and 2) fast OOD adaptation.
This work formulates and solves the problem from a causal view. We formulate the user feature shift as an intervention and OOD recommendation as post-intervention inference of the interaction probability. Towards the learning objectives, we embrace causal modeling of the generation procedure from user features to interactions. However, the unobserved user features cannot be ignored, which make the estimation of the interaction probability intractable. We thus devise a new Variational Auto-Encoder for causal modeling by incorporating an encoder to infer unobserved user features from historical interactions. We further perform counterfactual inference to mitigate the effect of out-of-date interactions. Moreover, a decoder is used to model the interaction generation procedure and perform post-intervention inference. Fast adaptation is inherent owing to the reuse of partial user representations. Lastly, we devise an extension to encode fine-grained causal relationships from user features to preference. Empirical results on three datasets validate the strong OOD generalization and fast adaptation abilities of the proposed method.
Fair learning has received a lot of attention in recent years since machine learning models can be unfair in automated decision-making systems with respect to sensitive attributes such as gender, race, etc. However, to mitigate the discrimination on the sensitive attributes and train a fair model, most fair learning methods have required to get access to the sensitive attributes in training or validation phases. In this study, we propose a privacy-preserving training algorithm for a fair support vector machine classifier based on Homomorphic Encryption (HE), where the privacy of both sensitive information and model secrecy can be preserved. The expensive computational costs of HE can be significantly improved by protecting only the sensitive information, introducing refined formulation and low-rank approximation using shared eigenvectors. Through experiments on the synthetic and real-world data, we demonstrate the effectiveness of our algorithm in terms of accuracy and fairness and show that our method significantly outperforms other privacy-preserving solutions in terms of better trade-offs between accuracy and fairness. To the best of our knowledge, our algorithm is the first privacy-preserving fair learning algorithm using HE.
In the United States, regulations have been established in the past to oversee political advertising in TV and radio. The laws governing these marketplaces were enacted with the fundamental premise that important political information is provided to voters through advertising, and politicians should be able to easily inform the public. Today, online advertising constitutes a large fraction of all political ad spending, but lawmakers have not been able to keep up with this rapid change. In the online advertising marketplace, ads are typically allocated to the highest bidder through an auction. Auction mechanisms provide benefits to platforms in terms of revenue maximization and automation, but they operate very differently to offline advertising, and existing approaches to regulation cannot be easily implemented in auction-based environments. We first provide a theoretical model and deliver key insights that can be used to regulate online ad auctions for political ads, and analyze the implications of the proposed interventions empirically. We characterize the optimal auction mechanisms where the regulator takes into account both the ad revenues collected and societal objectives (such as the share of ads allocated to politicians, or the prices paid by them). We use bid data generated from Twitter’s political advertising database to analyze the implications of implementing these changes. The results suggest that achieving favorable societal outcomes at a small revenue cost is possible through easily implementable, simple regulatory interventions.
Perturbation-based techniques are promising for explaining black-box machine learning models due to their effectiveness and ease of implementation. However, prior works have faced the problem of Out-of-Distribution (OoD) — an artifact of randomly perturbed data becoming inconsistent with the original dataset, degrading the reliability of generated explanations, which is still under-explored according to our best knowledge. This work addresses the OoD issue by designing a simple yet effective module that can quantify the affinity between the perturbed data and the original dataset distribution. Specifically, we penalize the influences of unreliable OoD data for the perturbed samples by integrating the inlier scores and prediction results of the target models, thereby making the final explanations more robust. Our solution is shown to be compatible with the most popular perturbation-based XAI algorithms: RISE, OCCLUSION, and LIME. Extensive experiments confirmed that our methods exhibit superior performance in most cases with computational and cognitive metrics. In particular, we point out the degradation problem of RISE algorithm for the first time. With our design, the performance of RISE can be boosted significantly. Besides, our solution also resolves a fundamental problem with a faithfulness indicator, a commonly used evaluation metric of XAI algorithms that appears sensitive to the OoD issue.
Modern recommender systems have evolved rapidly along with deep learning models that are well-optimized for overall performance, especially those trained under Empirical Risk Minimization (ERM). However, a recommendation algorithm that focuses solely on the average performance may reinforce the exposure bias and exacerbate the “rich-get-richer” effect, leading to unfair user experience. In a simulation study, we demonstrate that such performance gap among various user groups is enlarged by an ERM-trained recommender in the long-term. To mitigate such amplification effects, we propose to optimize for the worst-case performance under the Distributionally Robust Optimization (DRO) framework, with the goal of improving long-term fairness for disadvantaged subgroups. In addition, we propose a simple-yet-effective streaming optimization improvement called Streaming-DRO (S-DRO), which effectively reduces loss variances for recommendation problems with sparse and long-tailed data distributions. Our results on two large-scale datasets suggest that (1) DRO is a flexible and effective technique for improving worst-case performance, and (2) Streaming-DRO outperforms vanilla DRO and other strong baselines by improving the worst-case and overall performance at the same time.
Human face images represent a rich set of visual information for online social media platforms to optimize the machine learning (ML)/AI models in their data-driven facial applications (e.g., face detection, face recognition). However, there exists a growing privacy concern from social media users to share their online face images that will be annotated by unknown crowd workers and analyzed by ML/AI researchers in the model training and optimization process. In this paper, we focus on a privacy-aware face recognition problem where the goal is to empower the facial applications to train their face recognition models with images shared by social media users while protecting the identity of the users. Our problem is motivated by the limitation of current privacy-aware face recognition approaches that mainly prevent algorithmic attacks by manipulating face images but largely ignore the potential privacy leakage related to human activities (e.g., crowdsourcing annotation). To address such limitations, we develop FaceCrowd, a web crowdsourcing based face partition approach to improve the performance of current face recognition models by designing a novel crowdsourced partial face graph generated from privacy-preserved social media face images. We evaluate the performance of FaceCrowd using two real-world human face datasets that consist of large-scale human face images. The results show that FaceCrowd not only improves the accuracy of the face recognition models but also effectively protects the identity information of the social media users who share their face images.
This paper focuses on a critical problem of explainable multimodal COVID-19 misinformation detection where the goal is to accurately detect misleading information in multimodal COVID-19 news articles and provide the reason or evidence that can explain the detection results. Our work is motivated by the lack of judicious study of the association between different modalities (e.g., text and image) of the COVID-19 news content in current solutions. In this paper, we present a generative approach to detect multimodal COVID-19 misinformation by investigating the cross-modal association between the visual and textual content that is deeply embedded in the multimodal news content. Two critical challenges exist in developing our solution: 1) how to accurately assess the consistency between the visual and textual content of a multimodal COVID-19 news article? 2) How to effectively retrieve useful information from the unreliable user comments to explain the misinformation detection results? To address the above challenges, we develop a duo-generative explainable misinformation detection (DGExplain) framework that explicitly explores the cross-modal association between the news content in different modalities and effectively exploits user comments to detect and explain misinformation in multimodal COVID-19 news articles. We evaluate DGExplain on two real-world multimodal COVID-19 news datasets. Evaluation results demonstrate that DGExplain significantly outperforms state-of-the-art baselines in terms of the accuracy of multimodal COVID-19 misinformation detection and the explainability of detection explanations.
With social media being a major force in information consumption, accelerated propagation of fake news has presented new challenges for platforms to distinguish between legitimate and fake news. Effective fake news detection is a non-trivial task due to the diverse nature of news domains and expensive annotation costs. In this work, we address the limitations of existing automated fake news detection models by incorporating auxiliary information (e.g., user comments and user-news interactions) into a novel reinforcement learning-based model called REinforced Adaptive Learning Fake News Detection (REAL-FND). REAL-FND exploits cross-domain and within-domain knowledge that makes it robust in a target domain, despite being trained in a different source domain. Extensive experiments on real-world datasets illustrate the effectiveness of the proposed model, especially when limited labeled data is available in the target domain.
Microblogging platforms like Twitter have been heavily leveraged to report and exchange information about natural disasters. The real-time data on these sites is highly helpful in gaining situational awareness and planning aid efforts. However, disaster-related messages are immersed in a high volume of irrelevant information. The situational data of disaster events also vary greatly in terms of information types ranging from general situational awareness (caution, infrastructure damage, casualties) to individual needs or not related to the crisis. It thus requires efficient methods to handle data overload and prioritize various types of information. This paper proposes an interpretable classification-summarization framework that first classifies tweets into different disaster-related categories and then summarizes those tweets. Unlike existing work, our classification model can provide explanations or rationales for its decisions. In the summarization phase, we employ an Integer Linear Programming (ILP) based optimization technique along with the help of rationales to generate summaries of event categories. Extensive evaluation on large-scale disaster events shows (a). our model can classify tweets into disaster-related categories with an 85% Macro F1 score and high interpretability (b). the summarizer achieves (5-25%) improvement in terms of ROUGE-1 F-score over most state-of-the-art approaches.
Hateful meme detection is a new multimodal task that has gained significant traction in academic and industry research communities. Recently, researchers have applied pre-trained visual-linguistic models to perform the multimodal classification task, and some of these solutions have yielded promising results. However, what these visual-linguistic models learn for the hateful meme classification task remains unclear. For instance, it is unclear if these models are able to capture the derogatory or slurs references in multimodality (i.e., image and text) of the hateful memes. To fill this research gap, this paper propose three research questions to improve our understanding of these visual-linguistic models performing the hateful meme classification task. We found that the image modality contributes more to the hateful meme classification task, and the visual-linguistic models are able to perform visual-text slurs grounding to a certain extent. Our error analysis also shows that the visual-linguistic models have acquired biases, which resulted in false-positive predictions.
Social media has become an indispensable channel for political communication. However, the political discourse is increasingly characterized by hate speech, which affects not only the reputation of individual politicians but also the functioning of society at large. In this work, we empirically analyze how the amount of hate speech in replies to posts from politicians on Twitter depends on personal characteristics, such as their party affiliation, gender, and ethnicity. For this purpose, we employ Twitter’s Historical API to collect every tweet posted by members of the 117th U. S. Congress for an observation period of more than six months. Additionally, we gather replies for each tweet and use machine learning to predict the amount of hate speech they embed. Subsequently, we implement hierarchical regression models to analyze whether politicians with certain characteristics receive more hate speech. We find that tweets are particularly likely to receive hate speech in replies if they are authored by (i) persons of color from the Democratic party, (ii) white Republicans, and (iii) women. Furthermore, our analysis reveals that more negative sentiment (in the source tweet) is associated with more hate speech (in replies). However, the association varies across parties: negative sentiment attracts more hate speech for Democrats (vs. Republicans). Altogether, our empirical findings imply significant differences in how politicians are treated on social media depending on their party affiliation, gender, and ethnicity.
Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranked results is an important and relatively underexplored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25. We then improve upon this approach through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis of various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.
Despite the tremendous efforts by social media platforms and fact-check services for fake news detection, fake news and misinformation still spread wildly on social media platforms (e.g., Twitter). Consequently, fake news mitigation strategies are urgently needed. Most of the existing work on fake news mitigation focuses on the overall mitigation on a whole social network while ignoring developing concrete mitigation strategies to deter individual users from sharing fake news. In this paper, we propose a novel veracity-aware and event-driven recommendation model to recommend personalised corrective true news to individual users for effectively debunking fake news. Our proposed model Rec4Mit (Recommendation for Mitigation) not only effectively captures a user’s current reading preference with a focus on which event, e.g., US election, from her/his recent reading history containing true and/or fake news, but also accurately predicts the veracity (true or fake) of candidate news. As a result, Rec4Mit can recommend the most suitable true news to best match the user’s preference as well as to mitigate fake news. In particular, for those users who have read fake news of a certain event, Rec4Mit is able to recommend the corresponding true news of the same event. Extensive experiments on real-world datasets show Rec4Mit significantly outperforms the state-of-the-art news recommendation methods in terms of the capability to recommend personalized true news for fake news mitigation.
Individuals can be misled by fake news and spread it unintentionally without knowing it is false. This phenomenon has been frequently observed but has not been investigated. Our aim in this work is to assess the intent of fake news spreaders. To distinguish between intentional versus unintentional spreading, we study the psychological explanations of unintentional spreading. With this foundation, we then propose an influence graph, using which we assess the intent of fake news spreaders. Our extensive experiments show that the assessed intent can help significantly differentiate between intentional and unintentional fake news spreaders. Furthermore, the estimated intent can significantly improve the current techniques that detect fake news. To our best knowledge, this is the first work to model individuals’ intent in fake news spreading.
In traditional (desktop) e-commerce search, a customer issues a specific query and the system returns a ranked list of products in order of relevance to the query. An increasingly popular alternative in e-commerce search is to issue a voice-query to a smart speaker (e.g., Amazon Echo) powered by a voice assistant (VA, e.g., Alexa). In this situation, the VA usually spells out the details of only one product, an explanation citing the reason for its selection, and a default action of adding the product to the customer’s cart. This reduced autonomy of the customer in the choice of a product during voice-search makes it necessary for a VA to be far more responsible and trustworthy in its explanation and default action.
In this paper, we ask whether the explanation presented for a product selection by the Alexa VA installed on an Amazon Echo device is consistent with human understanding as well as with the observations on other traditional mediums (e.g., desktop e-commerce search). Through a user survey, we find that in 81% cases the interpretation of ‘a top result’ by the users is different from that of Alexa. While investigating for the fairness of the default action, we observe that over a set of as many as 1000 queries, in ≈ 68% cases, there exist one or more products which are more relevant (as per Amazon’s own desktop search results) than the product chosen by Alexa. Finally, we conducted a survey over 30 queries for which the Alexa-selected product was different from the top desktop search result, and observed that in ≈ 73% cases, the participants preferred the top desktop search result as opposed to the product chosen by Alexa. Our results raise several concerns and necessitates more discussions around the related fairness and interpretability issues of VAs for e-commerce search.
While false rumors pose a threat to the successful overcoming of the COVID-19 pandemic, an understanding of how rumors diffuse in online social networks is – even for non-crisis situations – still in its infancy. Here we analyze a large sample consisting of COVID-19 rumor cascades from Twitter that have been fact-checked by third-party organizations. The data comprises N = 10,610 rumor cascades that have been retweeted more than 24 million times. We investigate whether COVID-19 misinformation spreads more viral than the truth and whether the differences in the diffusion of true vs. false rumors can be explained by the moral emotions they carry. We observe that, on average, COVID-19 misinformation is more likely to go viral than truthful information. However, the veracity effect is moderated by moral emotions: false rumors are more viral than the truth if the source tweets embed a high number of other-condemning emotion words, whereas a higher number of self-conscious emotion words is linked to a less viral spread. The effects are pronounced both for health misinformation and false political rumors. These findings offer insights into how true vs. false rumors spread and highlight the importance of considering emotions from the moral emotion families in social media content.
The COVID-19 pandemic has necessitated rapid top-down dissemination of reliable and actionable information. This presents unique challenges in engaging low-literate communities that live in poverty and lack access to the Internet. We describe the design and deployment of a voice-based social media platform, accessible over simple phones, for actively engaging such communities in Pakistan with reliable COVID information. We developed three strategies to overcome users’ hesitation, mistrust, and skepticism in engaging with COVID content. Users were: (1) encouraged to listen to reliable COVID advisory, (2) incentivized to share authentic content with others, and (3) prompted to critically think about COVID-related information behaviors. Using a mixed-methods evaluation, we show that users approached with all three strategies had a significantly higher engagement with COVID content compared to others. We discuss how new designs of social media can enable users to engage with and propagate authentic information.
In this paper, we highlight the use of Instagram for social activism, taking 2019 Hong Kong protests as a case study. Instagram focuses on image content and provides users with few features to share or repost, limiting information propagation. Nevertheless, users who are politically active offline also share their activism on Instagram. We first evaluate the effect of protests on social media activity for protesters and non-protesters over two significant protests. Protesters’ exposure to protest-related posts is much higher than non-protesters, and their network activity follows the protest schedule. They are also much more active on posts related to the protest that they participate in than the other protest. We then analyze the images posted by the users. Users predominantly use symbols related to protests and share personal thoughts on its primary actors. Users primarily share content to raise their network’s awareness, and the content choice is directly affected by Instagram’s intrinsic interaction modalities.
Many information access and machine learning systems, including recommender systems, lack transparency and accountability. High-quality recommendation explanations are of great significance to enhance the transparency and interpretability of such systems. However, evaluating the quality of recommendation explanations is still challenging due to the lack of human-annotated data and benchmarks. In this paper, we present a large explanation dataset named RecoExp, which contains thousands of crowdsourced ratings of perceived quality in explaining recommendations. To measure explainability in a comprehensive and interpretable manner, we propose ExpScore, a novel machine learning-based metric that incorporates the definition of explainability from various perspectives (e.g., relevance, readability, subjectivity, and sentiment polarity). Experiments demonstrate that ExpScore not only vastly outperforms existing metrics and but also keeps itself explainable. Both the RecoExp dataset and open-source implementation of ExpScore will be released for the whole community. These resources and our findings can serve as forces of public good for scholars as well as recommender systems users.
Typical recommender systems try to mimic the past behaviors of users to make future recommendations. For example, in food recommendations, they tend to recommend the foods the user prefers. While the recommended foods may be easily accepted by the user, it cannot improve the user’s dietary habits for a specific goal such as weight control. In this paper, we build a food recommendation system that can be used on the web or in a mobile app to help users meet their goals on body weight, while also taking into account their health information (BMI) and the nutrition information of foods (calories). Instead of applying dietary guidelines as constraints, we build recommendation models from the successful behaviors of comparable users: the weight loss model is trained using the historical food consumption data of similar users who successfully lost weight. By combining such a goal-oriented recommendation model with a general model, the recommendations can be smoothly tuned toward the goal without disruptive food changes. We tested the approach on real data collected from a popular weight management app. It is shown that our recommendation approach can better predict the foods for test periods where the user truly meets the goal, than the typical existing approaches.
Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in predictions of the model to identify potentially inaccurate labels and correct for them using self-supervision or relabeling. The framework will incorporate social context of the post in terms of the community of its associated user for surfacing inaccurate labels towards building a large-scale dataset with minimum human effort. To provide labeled datasets with distinction of misleading narratives where information might be missing significant context or has inaccurate ancillary details, the proposed framework will use the few labeled samples as class prototypes to separate high confidence samples into false, unproven, mixture, mostly false, mostly true, true, and debunk information. The approach is demonstrated for providing a large-scale misinformation dataset on COVID-19 vaccines.