Anomaly mining finds high-stakes applications in various real-world domains such as cybersecurity, finance, environmental monitoring, to name a few. Therefore, it has been studied widely and a large body of detection techniques exists . Today, many real-world settings necessitate detection at speed for streaming/evolving data, and/or detection at scale for massive datasets stored in a distributed environment . Despite the plethora of detection algorithms, selecting an algorithm to use on a new task as well as setting the values for its hyperparameter(s), known as the model selection problem, is an open challenge for unsupervised anomaly detection. This issue is only to be exacerbated with the recent advent of detectors based on deep neural networks that exhibit a long list of hyperparameters. The challenge stems from two main factors: the lack of labeled data and the lack of a widely accepted anomaly loss function. Toward automation, one can explore internal evaluation strategies , or capitalize on the experience from historical detection tasks through meta-learning . However, the problem remains far from solved. In deployment, many real-world use cases of anomaly detection require the flagged anomalies from a detector to be screened or audited by a human expert, typically for vetting purposes, where taking automatic actions can be costly (e.g. directly charging a flagged medical provider with fraud). While a vast majority of the literature focuses on novel detection algorithms, as humans are often involved with(in) the process, anomaly mining also concerns various human-centric problems that are beyond mere detection, namely explanation [5, 6], human interaction , and fairness . These aspects of the field are under-studied and pose many open challenges.
Computational Linguistics and Natural Language Processing have changed considerably in the past few decades. Early research focused on representing and using linguistic knowledge in computational processes such as parsers, while these days the field focuses on practically-useful tasks such as information retrieval and chatbots. Currently our Deep Learning models have little to do with linguistic theory
For example, the Oracle Digital Assistant is built on top of generic "Foundation" Deep Learning models. An intermediate Focusing step adapts these models to specific enterprise domains. Transfer Learning is used to refocus these models onto specific customer-oriented tasks such as Intent Classification, Named Entity Recognition, as well as more advanced models such as text-to-SQL sequence-to-sequence models. These technologies have revolutionised the application of NLP to practical problems with commercial relevance, enabling us to build better systems faster and cheaper than ever before.
Linguistic insights aren't gone from the field, however; they play a critical role in data manufacturing and evaluation. This talk explain how we use hundreds of different evaluations to understand the strengths and weaknesses of our models in the Oracle Digital Assistant, and how we automatically use this in hyper-parameter tuning. It also describes areas where additional research is still required before we can claim that NLP has become an engineering field.
As predictive models are increasingly being deployed in high-stakes decision making (e.g., loan approvals), there has been growing interest in developing post hoc techniques which provide recourse to individuals who have been adversely impacted by predicted outcomes. For example, when an individual is denied loan by a predictive model deployed by a bank, they should be informed about reasons for this decision and what can be done to reverse it. While several approaches have been proposed to tackle the problem of generating recourses, these techniques rely heavily on various restrictive assumptions. For instance, these techniques generate recourses under the assumption that the underlying predictive models do not change. In practice, however, models are often updated for a variety of reasons including data distribution shifts. There is little to no research that systematically investigates and addresses these limitations.
In this talk, I will discuss some of our recent work that sheds light on and addresses the aforementioned challenges, thereby paving the way for making algorithmic recourse practicable and reliable. First, I will present theoretical and empirical results which demonstrate that the recourses generated by state-of-the-art approaches are often invalidated due to model updates. Next, I will introduce a novel algorithmic framework based on adversarial training to generate recourses that remain valid even if the underlying models are updated. I will conclude the talk by presenting theoretical and empirical evidence for the efficacy of our solutions, and also discussing other open problems in the burgeoning field of algorithmic recourse.
Product retrieval systems have served as the main entry for customers to discover and purchase products online. With increasing concerns on the transparency and accountability of AI systems, studies on explainable information retrieval has received more and more attention in the research community. Interestingly, in the domain of e-commerce, despite the extensive studies on explainable product recommendation, the studies of explainable product search is still in an early stage. In this paper, we study how to construct effective explainable product search by comparing model-agnostic explanation paradigms with model-intrinsic paradigms and analyzing the important factors that determine the performance of product search explanations. We propose an explainable product search model with model-intrinsic interpretability and conduct crowdsourcing to compare it with the state-of-the-art explainable product search model with model-agnostic interpretability. We observe that both paradigms have their own advantages and the effectiveness of search explanations on different properties are affected by different factors. For example, explanation fidelity is more important for user's overall satisfaction on the system while explanation novelty may be more useful in attracting user purchases. These findings could have important implications for the future studies and design of explainable product search engines.
Information seeking conversations between users and Conversational Search Agents (CSAs) consist of multiple turns of interaction. While users initiate a search session, ideally a CSA should sometimes take the lead in the conversation by obtaining feedback from the user by offering query suggestions or asking for query clarifications i.e. mixed initiative. This creates the potential for more engaging conversational searches, but substantially increases the complexity of modelling and evaluating such scenarios due to the large interaction space coupled with the trade-offs between the costs and benefits of the different interactions. In this paper, we present a model for conversational search -- from which we instantiate different observed conversational search strategies, where the agent elicits: (i) Feedback-First, or (ii) Feedback-After. Using 49 TREC WebTrack Topics, we performed an analysis comparing how well these different strategies combine with different mixed initiative approaches: (i) Query Suggestions vs. (ii) Query Clarifications. Our analysis reveals that there is no superior or dominant combination, instead it shows that query clarifications are better when asked first, while query suggestions are better when asked after presenting results. We also show that the best strategy and approach depends on the trade-offs between the relative costs between querying and giving feedback, the performance of the initial query, the number of assessments per query, and the total amount of gain required. While this work highlights the complexities and challenges involved in analyzing CSAs, it provides the foundations for evaluating conversational strategies and conversational search agents in batch/offline settings.
Organizations often seek to extend their data by integration with available datasets originating from external sources. While there are many tools that recommend how to perform the integration for given datasets, the selection of what datasets to integrate is often challenging in itself. First, the relevant candidates must be efficiently identified among irrelevant ones. Next, relevant datasets need to be evaluated according to issues such as low quality or poor matching to the target data and schema. Last, jointly integrating multiple datasets may have significant benefits such as increasing completeness and information gain, but may also greatly complicate the task due to dependencies in the integration process.
To assist administrators in this task, we quantify to what extent an integration of multiple datasets is valuable as an extension of an initial dataset and formalize the computational problem of finding the most valuable subset to integrate by this measure. We formally analyze the problem, showing that it is NP-hard; we nevertheless introduce heuristic efficient algorithms, which our experiments show to be near-optimal in practice and highly effective in finding the most valuable integration.
Given an input of a set of objects each one represented as a vector of features in a feature space, the problem of finding the skyline is the problem of determining the subset of objects that are not dominated by any other input object. An example of an application is to find the best hotel(s) with respect to some features (location, price, cleanliness, etc.)
The use of the crowd for solving this problem is useful when a score of items according to their features is not available. Yet the crowd can give inconsistent answers. In this paper we study the computation of the skyline when the comparisons between objects are performed by humans. We model the problem using the threshold model [Ajtai et al, TALG 2015] in which the comparison of two objects may create errors/inconsistencies if the objects are close to each other. We provide algorithms for the problem and we analyze the required number of human comparisons and lower bounds. We also evaluate the effectiveness and efficiency of our algorithms using synthetic and real-world data.
With local differential privacy (LDP), users can privatize their data and thus guarantee privacy properties before transmitting it to the server (a.k.a. the aggregator). One primary objective of LDP is frequency (or histogram) estimation, in which the aggregator estimates the number of users for each possible value. In practice, when a study with rich content on a population is desired, the interest is in the multiple attributes of the population, that is to say, in multidimensional data (d ≥ 2). However, contrary to the problem of frequency estimation of a single attribute (the majority of the works), the multidimensional aspect imposes to pay particular attention to the privacy budget. This one can indeed grow extremely quickly due to the composition theorem. To the authors' knowledge, two solutions seem to stand out for this task: 1) splitting the privacy budget for each attribute, i.e., send each value with ε d ≥-LDP (Spl), and 2) random sampling a single attribute and spend all the privacy budget to send it with ε-LDP (Smp). AlthoughSmp adds additional sampling error, it has proven to provide higher data utility than the formerSpl solution. However, we argue that aggregators (who are also seen as attackers) are aware of the sampled attribute and its LDP value, which is protected by a "less strict" eε probability bound (rather than e^ε/d ). This way, we propose a solution named Random S ampling plus Fake Data (RS+FD), which allows creatinguncertainty over the sampled attribute by generating fake data for each non-sampled attribute; RS+FD further benefits from amplification by sampling. We theoretically and experimentally validate our proposed solution on both synthetic and real-world datasets to show that RS+FD achieves nearly the same or better utility than the state-of-the-artSmp solution.
Video accessibility is crucial for blind screen-reader users as online videos are increasingly playing an essential role in education, employment, and entertainment. While there exist quite a few techniques and guidelines that focus on creating accessible videos, there is a dearth of research that attempts to characterize the accessibility of existing videos. Therefore in this paper, we define and investigate a diverse set of video and audio-based accessibility features in an effort to characterize accessible and inaccessible videos. As a ground truth for our investigation, we built a custom dataset of 600 videos, in which each video was assigned an accessibility score based on the number of its wins in a Swiss-system tournament, where human annotators performed pairwise accessibility comparisons of videos. In contrast to existing accessibility research where the assessments are typically done by blind users, we recruited sighted users for our effort, since videos comprise a special case where sight could be required to better judge if any particular scene in a video is presently accessible or not. Subsequently, by examining the extent of association between the accessibility features and the accessibility scores, we could determine the features that significantly (positively or negatively) impact video accessibility and therefore serve as good indicators for assessing the accessibility of videos. Using the custom dataset, we also trained machine learning models that leveraged our handcrafted features to either classify an arbitrary video as accessible/inaccessible or predict an accessibility score for the video. Evaluation of our models yielded an F1 score of 0.675 for binary classification and a mean absolute error of 0.53 for score prediction, thereby demonstrating their potential in video accessibility assessment while also illuminating their current limitations and the need for further research in this area.
We present Gradient Activation Maps (GAM) - a machinery for explaining predictions made by visual similarity and classification models. By gleaning localized gradient and activation information from multiple network layers, GAM offers improved visual explanations, when compared to existing alternatives. The algorithmic advantages of GAM are explained in detail, and validated empirically, where it is shown that GAM outperforms its alternatives across various tasks and datasets.
We present Variational Bayesian Network (VBN) - a novel Bayesian entity representation learning model that utilizes hierarchical and relational side information and is particularly useful for modeling entities in the "long-tail'', where the data is scarce. VBN provides better modeling for long-tail entities via two complementary mechanisms: First, VBN employs informative hierarchical priors that enable information propagation between entities sharing common ancestors. Additionally, VBN models explicit relations between entities that enforce complementary structure and consistency, guiding the learned representations towards a more meaningful arrangement in space. Second, VBN represents entities by densities (rather than vectors), hence modeling uncertainty that plays a complementary role in coping with data scarcity. Finally, we propose a scalable Variational Bayes optimization algorithm that enables fast approximate Bayesian inference. We evaluate the effectiveness of VBN on linguistic, recommendations, and medical inference tasks. Our findings show that VBN outperforms other existing methods across multiple datasets, and especially in the long-tail.
Recently, several Knowledge Graph Embedding (KGE) approaches have been devised to represent entities and relations in a dense vector space and employed in downstream tasks such as link prediction. A few KGE techniques address interpretability, i.e., mapping the connectivity patterns of the relations (symmetric/asymmetric, inverse, and composition) to a geometric interpretation such as rotation. Other approaches model the representations in higher dimensional space such as four-dimensional space (4D) to enhance the ability to infer the connectivity patterns (i.e., expressiveness). However, modeling relation and entity in a 4D space often comes at the cost of interpretability. We propose HopfE, a novel KGE approach aiming to achieve the interpretability of inferred relations in the four-dimensional space. HopfE models the structural embeddings in 3D Euclidean space. Next, we map the entity embedding vector from a 3D Euclidean space to a 4D hypersphere using the inverse Hopf Fibration, in which we embed the semantic information from the KG ontology. Thus, HopfE considers the structural and semantic properties of the entities without losing expressivity and interpretability. Our empirical results on four well-known benchmarks achieve state-of-the-art performance for KG completion.
In the classical influence maximization problem we aim to select a set of nodes, called seeds, to start an efficient information diffusion process. More precisely, the goal is to select seeds such that the expected number of nodes reached by the diffusion process is maximized. In this work we study a variant of this problem where an unknown (up to a probability distribution) set of nodes, referred to as co-existing seeds, joins in starting the diffusion process even if not selected. This setting allows to model that, in certain situations, some nodes are willing to act as "voluntary seeds'' even if not chosen by the campaign organizer. This may for example be due to the positive nature of the information campaign (e.g., public health awareness programs, HIV prevention, financial aid programs), or due to external social driving effects (e.g., nodes are friends of selected seeds in real life or in other social media).
In this setting, we study two types of optimization problems. While the first one aims to maximize the expected number of reached nodes, the second one endeavors to maximize the expected increment in the number of reached nodes in comparison to a non-intervention strategy. The problems (particularly the second one) are motivated by cooperative game theory. For various probability distributions on co-existing seeds, we obtain several algorithms with approximation guarantees as well as hardness and hardness of approximation results. We conclude with experiments that demonstrate the usefulness of our approach when co-existing seeds exist.
We study the problem of recommending relevant products to users in relatively resource-scarce markets by leveraging data from similar, richer in resource auxiliary markets. We hypothesize that data from one market can be used to improve performance in another. Only a few studies have been conducted in this area, partly due to the lack of publicly available experimental data. To this end, we collect and release XMarket, a large dataset covering 18 local markets on 16 different product categories, featuring 52.5 million user-item interactions.
We introduce and formalize the problem of cross-market product recommendation, i.e., market adaptation. We explore different market-adaptation techniques inspired by state-of-the-art domain-adaptation and meta-learning approaches and propose a novel neural approach for market adaptation, named FOREC. Our model follows a three-step procedure - pre-training, forking, and fine-tuning - in order to fully utilize the data from an auxiliary market as well as the target market. We conduct extensive experiments studying the impact of market adaptation on different pairs of markets. Our proposed approach demonstrates robust effectiveness, consistently improving the performance on target markets compared to competitive baselines selected for our analysis. In particular, FOREC improves on average 24% and up to 50% in terms of nDCG@10, compared to the NMF baseline. Our analysis and experiments suggest specific future directions in this research area. We release our data and code for academic purposes.
The ever increasing complexity of machine learning techniques used more and more in practice, gives rise to the need to explain the outcomes of these models, often used as black-boxes. Explainable AI approaches are either numerical feature-based aiming to quantify the contribution of each feature in a prediction or symbolic providing certain forms of symbolic explanations such ascounterfactuals. This paper proposes a generic agnostic approach named ASTERYX allowing to generate both symbolic explanations and score-based ones. Our approach is declarative and it is based on the encoding of the model to be explained in an equivalent symbolic representation. This latter serves to generate in particular two types of symbolic explanations which aresufficient reasons andcounterfactuals. We then associate scores reflecting the relevance of the explanations and the features w.r.t to some properties. Our experimental results show the feasibility of the proposed approach and its effectiveness in providing symbolic and score-based explanations.
Spamming reviews are prevalent in review systems to manipulate seller reputation and mislead customers. patterns to achieve state-of-the-art detection accuracy. The detection can influence a large number of real-world entities and it is ethical to treat different groups of entities as equally as possible. However, due to skewed distributions of the graphs, GNN can fail to meet diverse fairness criteria designed for different parties. We formulate linear systems of the input features and the adjacency matrix of the review graphs for the certification of multiple fairness criteria. When the criteria are competing, we relax the certification and design a multi-objective optimization (MOO) algorithm to explore multiple efficient trade-offs, so that no objective can be improved without harming another objective. We prove that the algorithm converges to a Pareto efficient solution using duality and the implicit function theorem. Since there can be exponentially many trade-offs of the criteria, we propose a data-driven stochastic search algorithm to approximate Pareto fronts consisting of multiple efficient trade-offs. Experimentally, we show that the algorithms converge to solutions that dominate baselines based on fairness regularization and adversarial training.
This paper introduces ScarceGAN which focuses on identification of extremely rare or scarce samples from multi-dimensional longitudinal telemetry data with small and weak label prior. We specifically address: (i) severe scarcity in positive class, stemming from both underlying organic skew in the data, as well as extremely limited labels; (ii) multi-class nature of the negative samples, with uneven density distributions and partially overlapping feature distributions; and (iii) massively unlabelled data leading to tiny and weak prior on both positive and negative classes, and possibility of unseen or unknown behavior in the unlabelled set, especially in the negative class. Although related to PU learning problems, we contend that knowledge (or lack of it) on the negative class can be leveraged to learn the compliment of it (i.e., the positive class) better in a semi-supervised manner. To this effect, ScarceGAN re-formulates semi-supervised GAN by accommodating weakly labelled multi- class negative samples and the available positive samples. It relaxes the supervised discriminator's constraint on exact differentiation be- tween negative samples by introducing a 'leeway' term for samples with noisy prior. We propose modifications to the cost objectives of discriminator, in supervised and unsupervised path as well as that of the generator. For identifying risky players in skill gaming, this formulation in whole gives us a recall of over 85% (~60% jump over vanilla semi-supervised GAN) on our scarce class with very minimal verbosity in the unknown space. Further ScarceGAN out- performs the recall benchmarks established by recent GAN based specialized models for the positive imbalanced class identification and establishes a new benchmark in identifying one of rare attack classes (0.09%) in the intrusion dataset from the KDDCUP99 challenge. We establish ScarceGAN to be one of new competitive benchmark frameworks in the rare class identification for longitudinal telemetry data.
Motivated by a network fault detection problem, we study how recall can be boosted in a decision tree classifier, without sacrificing too much precision. This problem is relevant and novel in the context of transfer learning(TL), in which few target domain training samples are available. We define a geometric optimization problem for boosting the recall of a decision tree classifier, and show it is NP-hard. To solve it efficiently, we propose several near-linear time heuristics, and experimentally validate these heuristics in the context of TL. Our evaluation includes 7 public datasets, as well as 6 network fault datasets, and we compare our heuristics with several existing TL algorithms, as well as exact mixed integer linear programming(MILP) solutions to our optimization problem. We find that our heuristics boost recall in a manner similar to optimal MILP solutions, yet require several orders of magnitude less compute time. In many cases the F1 score of our approach is competitive, and often better, than other TL algorithms. Moreover, our approach can be used as a building block to apply transfer learning to more powerful ensemble methods, such as random forests.
Transformers have shown great potential for modeling long-term dependencies for natural language processing and computer vision. However, little study has applied transformers to graphs, which is challenging due to the poor scalability of the attention mechanism and the under-exploration of graph inductive bias. To bridge this gap, we propose a Lite Graph Transformer (LiteGT) that learns on arbitrary graphs efficiently. First, a node sampling strategy is proposed to sparsify the considered nodes in self-attention with only O (Nlog N) time. Second, we devise two kernelization approaches to form two-branch attention blocks, which not only leverage graph-specific topology information, but also reduce computation further to O (1 over 2 Nlog N). Third, the nodes are updated with different attention schemes during training, thus largely mitigating over-smoothing problems when the model layers deepen. Extensive experiments demonstrate that LiteGT achieves competitive performance on both node classification and link prediction on datasets with millions of nodes. Specifically, Jaccard + Sampling + Dim. reducing setting reduces more than 100x computation and halves the model size without performance degradation.
While batch evaluation plays a central part in Information Retrieval (IR) research, most evaluation metrics are based on user models which mainly focus on browsing and clicking behaviors. As users' perceived satisfaction may also be impacted by their search intent, constructing different user models across various search intent may help design better evaluation metrics. However, user intents are usually unobservable in practice. As query reformulating behaviors may reflect their search intents to a certain extent and highly correlate with users' perceived satisfaction for a specific query, these observable factors may be beneficial for the design of evaluation metrics. How to incorporate the search intent behind query reformulation into user behavior and satisfaction models remains under-investigated. To investigate the relationships among query reformulations, search intent, and user satisfaction, we explore a publicly available web search dataset and find that query reformulations can be a good proxy for inferring user intent, and therefore, reformulating actions may be beneficial for designing better web search effectiveness metrics. A group of Reformulation-Aware Metrics (RAMs) is then proposed to improve existing click model-based metrics. Experimental results on two public session datasets have shown that RAMs have significantly higher correlations with user satisfaction than existing evaluation metrics. In the robustness test, we have found that RAMs can achieve good performance when only a small proportion of satisfaction training labels are available. We further show that RAMs can be directly applied in a new dataset for offline evaluation once trained. This work shows the possibility of designing better evaluation metrics by incorporating fine-grained search context factors.
Question Answering (QA), a popular and promising technique for intelligent information access, faces a dilemma about data as most other AI techniques. On one hand, modern QA methods rely on deep learning models which are typically data-hungry. Therefore, it is expected to collect and fuse all the available QA datasets together in a common site for developing a powerful QA model. On the other hand, real-world QA datasets are typically distributed in the form of isolated islands belonging to different parties. Due to the increasing awareness of privacy security, it is almost impossible to integrate the data scattered around, or the cost is prohibited. A possible solution to this dilemma is a new approach known as federated learning, which is a privacy-preserving machine learning technique over distributed datasets. In this work, we propose to adopt federated learning for QA with the special concern on the statistical heterogeneity of the QA data. Here the heterogeneity refers to the fact that annotated QA data are typically with non-identical and independent distribution (non-IID) and unbalanced sizes in practice. Traditional federated learning methods may sacrifice the accuracy of individual models under the heterogeneous situation. To tackle this problem, we propose a novel Federated Matching framework for QA, named FedMatch, with a backbone-patch architecture. The shared backbone is to distill the common knowledge of all the participants while the private patch is a compact and efficient module to retain the domain information for each participant. To facilitate the evaluation, we build a benchmark collection based on several QA datasets from different domains to simulate the heterogeneous situation in practice. Empirical studies demonstrate that our model can achieve significant improvements against the baselines over all the datasets.
Most of existing gradient-based meta-learning approaches to few-shot learning assume that all tasks have the same input feature space. However, in the real world scenarios, there are many cases that the input structures of tasks can be different, that is, different tasks may vary in the number of input modalities or data types. Existing meta-learners cannot handle the heterogeneous task distribution (HTD) as there is not only global meta-knowledge shared across tasks but also type-specific knowledge that distinguishes each type of tasks. To deal with task heterogeneity and promote fast within-task adaptions for each type of tasks, in this paper, we propose HetMAML, a task-heterogeneous model-agnostic meta-learning framework, which can capture both the type-specific and globally shared knowledge and can achieve the balance between knowledge customization and generalization. Specifically, we design a multi-channel backbone module that encodes the input of each type of tasks into the same length sequence of modality-specific embeddings. Then, we propose a task-aware iterative feature aggregation network which can automatically take into account the context of task-specific input structures and adaptively project the heterogeneous input spaces to the same lower-dimensional embedding space of concepts. Our experiments on six task-heterogeneous datasets demonstrate that HetMAML successfully leverages type-specific and globally shared meta-parameters for heterogeneous tasks and achieves fast within-task adaptions for each type of tasks.
Deep reinforcement learning enables an agent to capture users' interest through dynamic interactions with the environment. It uses a reward function to learn user's interest and to control the learning process, attracting great interest in recommendation research. However, most reward functions are manually designed; they are either too unrealistic or imprecise to reflect the variety, dimensionality, and non-linearity of the recommendation problem. This impedes the agent from learning an optimal policy in highly dynamic online recommendation scenarios. To address the above issue, we propose a generative inverse reinforcement learning approach that avoids the need of defining an elaborative reward function. In particular, we model the recommendation problem as an automatic policy learning problem. We first generate policies based on observed users' preferences and then evaluate the learned policy by a measurement based on a discriminative actor-critic network. We conduct experiments on an online platform, VirtualTB, and demonstrate the feasibility and effectiveness of our proposed approach via comparisons with several state-of-the-art methods.
In this work, we propose a robust road network representation learning framework called Toast, which comes to be a cornerstone to boost the performance of numerous demanding transport planning tasks. Specifically, we first propose a traffic context aware skip-gram module to incorporate auxiliary tasks of predicting the traffic context of a target road segment. Furthermore, we propose a trajectory-enhanced Transformer module that utilizes trajectory data to extract traveling semantics on road networks. Apart from obtaining effective road segment representations, this module also enables us to obtain the route representations. With these two modules, we can learn representations which can capture multi-faceted characteristics of road networks to be applied in both road segment based applications and trajectory based applications. Last, we design a benchmark containing four typical transport planning tasks to evaluate the usefulness of Toast and comprehensive experiments verify that Toast consistently outperforms the state-of-the-art baselines across all tasks.
User response prediction, which aims to predict the probability that a user will provide a predefined positive response in a given context such as clicking on an ad or purchasing an item, is crucial to many industrial applications such as online advertising, recommender systems, and search ranking. For these tasks and many other machine learning tasks, an indispensable part of success is feature engineering, where cross features are a significant type of feature transformations. However, due to the high dimensionality and super sparsity of the data collected in these tasks, handcrafting cross features is inevitably time expensive. Prior studies in predicting user response leveraged the feature interactions by enhancing feature vectors with products of features to model second-order or high-order cross features, either explicitly or implicitly. However, these existing methods can be hindered by not learning sufficient cross features due to model architecture limitations or modeling all high-order feature interactions with equal weights. Different features should contribute differently to the prediction, and not all cross features are with the same prediction power.
This work aims to fill this gap by proposing a novel architecture Deep Cross Attentional Product Network (DCAP), which keeps cross network's benefits in modeling high-order feature interactions explicitly at the vector-wise level. By computing the inner product or outer product between attentional feature embeddings and original input embeddings as each layer's output, we can model cross features with a higher degree of order as the network's depth increases. We concatenate all the outputs from each layer, which further helps the model capture much information on cross features of different orders. Beyond that, it can differentiate the importance of different cross features in each network layer inspired by the multi-head attention mechanism and Product Neural Network (PNN), allowing practitioners to perform a more in-depth analysis of user behaviors. Additionally, our proposed model can be easily implemented and train in parallel. We conduct comprehensive experiments on three real-world datasets. The results have robustly demonstrated that our proposed model DCAP achieves superior prediction performance compared with the state-of-the-art models. Public codes are available at https://github.com/zachstarkk/DCAP.
Sequential Recommendation aims to recommend items that a target user will interact with in the near future based on the historically interacted items. While modeling temporal dynamics is crucial for sequential recommendation, most of the existing studies concentrate solely on the user side while overlooking the sequential patterns existing in the counterpart, i.e., the item side. Although a few studies investigate the dynamics involved in the dual sides, the complex user-item interactions are not fully exploited from a global perspective to derive dynamic user and item representations. In this paper, we devise a novel Dynamic Representation Learning model for Sequential Recommendation (DRL-SRe). To better model the user-item interactions for characterizing the dynamics from both sides, the proposed model builds a global user-item interaction graph for each time slice and exploit time-sliced graph neural networks to learn user and item representations. Moreover, to enable the model to capture fine-grained temporal information, we propose an auxiliary temporal prediction task over consecutive time slices based on temporal point process. Comprehensive experiments on three public real-world datasets demonstrate DRL-SRe outperforms the state-of-the-art sequential recommendation models with a large margin.
Spoken Language Understanding (SLU), a core component of the task-oriented dialogue system, expects a shorter inference latency due to the impatience of humans. Non-autoregressive SLU models clearly increase the inference speed but suffer uncoordinated-slot problems caused by the lack of sequential dependency information among each slot chunk. To gap this shortcoming, in this paper, we propose a novel non-autoregressive SLU model named Layered-Refine Transformer, which contains a Slot Label Generation (SLG) task and a Layered Refine Mechanism (LRM). SLG is defined as generating the next slot label with the token sequence and generated slot labels. With SLG, the non-autoregressive model can efficiently obtain dependency information during training and spend no extra time in inference. LRM predicts the preliminary SLU results from Transformer's middle states and utilizes them to guide the final prediction. Experiments on two public datasets indicate that our model significantly improves SLU performance (1.5% on Overall accuracy) while substantially speed up (more than 10 times) the inference process over the state-of-the-art baseline.
Collaborative filtering (CF) is a long-standing problem of recommender systems. Many novel methods have been proposed, ranging from classical matrix factorization to recent graph convolutional network-based approaches. After recent fierce debates, researchers started to focus on linear graph convolutional networks (GCNs) with a layer combination, which show state-of-the-art accuracy in many datasets. In this work, we extend them based on neural ordinary differential equations (NODEs), because the linear GCN concept can be interpreted as a differential equation, and present the method of Learnable-Time ODE-based Collaborative Filtering (LT-OCF). The main novelty in our method is that after redesigning linear GCNs on top of the NODE regime, i) we learn the optimal architecture rather than relying on manually designed ones, ii) we learn smooth ODE solutions that are considered suitable for CF, and iii) we test with various ODE solvers that internally build a diverse set of neural network connections. We also present a novel training method specialized to our method. In our experiments with three benchmark datasets, our method consistently outperforms existing methods in terms of various evaluation metrics. One more important discovery is that our best accuracy was achieved by dense connections.
Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as ĸ, Krippendorff's α and Φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless" problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, Φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.
Given an input dataset (i.e., a set of tuples), query definability in Ontology-based Data Management (OBDM) amounts to finding a query over the ontology whose certain answers coincide with the tuples in the given dataset. We refer to such a query as a characterization of the dataset with respect to the OBDM system. Our first contribution is to propose approximations of perfect characterizations in terms of recall (complete characterizations) and precision (sound characterizations). A second contribution is to present a thorough complexity analysis of three computational problems, namely verification (check whether a given query is a perfect, or an approximated characterization of a given dataset), existence (check whether a perfect, or a best approximated characterization of a given dataset exists), and computation (compute a perfect, or best approximated characterization of a given dataset).
We introduce the novel and challenging task of answering Points-of-interest (POI) recommendation questions, using a collection of reviews that describe candidate answer entities (POIs). We harvest a QA dataset that contains 47,124 paragraph-sized user questions from travelers seeking POI recommendations for hotels, attractions and restaurants. Each question can have thousands of candidate entities to choose from and each candidate is associated with a collection of unstructured reviews. Questions can include requirements based on physical location, budget, timings as well as other subjective considerations related to ambience, quality of service etc. Our dataset requires reasoning over a large number of candidate answer entities (over 5300 per question on average) and we find that running commonly used neural architectures for QA is prohibitively expensive. Further, commonly used retriever-ranker based methods also do not work well for our task due to the nature of review-documents. Thus, as a first attempt at addressing some of the novel challenges of reasoning-at-scale posed by our task, we present a task specific baseline model that uses a three-stage cluster-select-rerank architecture. The model first clusters text for each entity to identify exemplar sentences describing an entity. It then uses a neural information retrieval (IR) module to select a set of potential entities from the large candidate set. A reranker uses a deeper attention-based architecture to pick the best answers from the selected entities. This strategy performs better than a pure retrieval or a pure attention-based reasoning approach yielding nearly 25% relative improvement in Hits@3 over both approaches. To the best of our knowledge we are the first to present an unstructured QA-style task for POI-recommendation, using real-world tourism questions and POI-reviews.
The ongoing COVID-19 pandemic has dramatically changed people's daily lives. A robust forecasting model for COVID-19 infections is essential for governments and institutions to plan timely and perform accurate interventions. Mainstream solutions for COVID-19 prediction fit reported data only by considering observed cases. However, the neglected facts that positive samples are incomplete and many facts of the novel disease are unknown may be prone to cause severe error accumulation, especially in long-term predictions. To fully understand the spreading patterns of the virus, we propose an encoder-decoder framework: (i) in the encoder we embed historical case data into multiple expose-infection ranges and learn message passing between time slices and across ranges with coarse-grained human mobility data incorporated; (ii) in the decoder, we decode the embedded features based on reported cases as well as deaths to jointly consider the effect of both observed and hidden data. We model the spreading of disease in over 60 counties of California and New York, which are two of the most metropolitan areas in the US. The proposed framework significantly outperforms state-of-the-art baselines on JHU COVID-19 dataset on both weekly prediction and daily prediction tasks. We design detailed ablation studies to verify the effectiveness of each key module and find the model not only works with the assistance of mobility data but also with purely cases and deaths, which implies its broad application scenarios.
Graph Neural Networks (GNNs), which generalize the deep neural networks to graph-structured data, have achieved great success in modeling graphs. However, as an extension of deep learning for graphs, GNNs lack explainability, which largely limits their adoption in scenarios that demand the transparency of models. Though many efforts are taken to improve the explainability of deep learning, they mainly focus on i.i.d data, which cannot be directly applied to explain the predictions of GNNs because GNNs utilize both node features and graph topology to make predictions. There are only very few work on the explainability of GNNs and they focus on post-hoc explanations. Since post-hoc explanations are not directly obtained from the GNNs, they can be biased and misrepresent the true explanations. Therefore, in this paper, we study a novel problem of self-explainable GNNs which can simultaneously give predictions and explanations. We propose a new framework which can find K-nearest labeled nodes for each unlabeled node to give explainable node classification, where nearest labeled nodes are found by interpretable similarity module in terms of both node similarity and local structure similarity. Extensive experiments on real-world and synthetic datasets demonstrate the effectiveness of the proposed framework for explainable node classification.
Core decomposition is a fundamental operator in network analysis. In this paper, we study a problem of computing distance-generalized core decomposition on a network. A distance-generalized core, also termed (k, h)-core, is a maximal subgraph in which every vertex has at least k other vertices at distance no larger than h. The state-of-the-art algorithm for solving this problem is based on a peeling technique which iteratively removes the vertex (denoted by v) from the graph that has the smallest h-hop degree. The h-hop degree of a vertex v denotes the number of other vertices that are reachable from v within h hops. Such a peeling algorithm, however, needs to frequently recompute the h-hop degrees of v's neighbors after deleting v, which is typically very costly for a large h. To overcome this limitation, we propose an efficient peeling algorithm based on a novel h-hop degree updating technique. Instead of recomputing the h-hop degrees, our algorithm can dynamically maintain the h-hop degrees for all vertices via exploring a very small subgraph, after peeling a vertex. We show that such an h-hop degree updating procedure can be efficiently implemented by an elegant bitmap technique. In addition, we also propose a sampling-based algorithm and a parallelization technique to further improve the efficiency. Finally, we conduct extensive experiments on 12 real-world graphs to evaluate our algorithms. The results show that, when h≥3, our exact and sampling-based algorithms can achieve up to 10x and 100x speedup over the state-of-the-art algorithm, respectively.
Multi-task learning has attracted much attention in recent years, where the goal is to learn multiple tasks by exploiting the similarities and differences between the tasks. Previous researches on multi-task learning mainly focus on flexible methods for feature sharing (e.g., soft sharing) under resource-sufficient settings (e.g., on GPU servers). However, in many real-world applications, we often need to deploy multi-task learning models on resource-constrained platforms (e.g., mobile devices). The high resource requirement of soft-sharing methods can make them hard to deploy on mobile devices. In this paper, we study the problem of Resource-efficient Multi-Task Learning (MTL), where the goal is to design a resource-friendly model that suits resource-constrained inference environment, e.g., security camera or mobile devices. We formulate the Resource-efficient MTL problem as a fine-grained filter sharing problem, i.e., learning how to share filters at any given convolutional layers among multiple tasks. We proposed a novel solution for parameter sharing, called FiShNet. Different from soft-sharing approaches, where the computational cost per task is growing w.r.t. the number of other tasks, FiShNet can achieve high accuracy comparable to soft-sharing approaches, while only consuming a constant computational cost per task. Different from hard-sharing approaches, where the parameter sharing structures are hand-picked, FiShNet can learn how to share parameters directly on the training data with finer-grained sharing. We evaluate FiShNet on a number of problem settings and datasets for multi-task learning. We show that FiShNet achieves high accuracy when compared with state-of-the-art methods in multi-task learning, while only requiring a fraction of the computational resource.
The way how the influence of a set of users is diffused in a social network has been widely studied in the last decades. Most of the work focused on maximizing the spread of influence or the diffusion of information (e.g., a viral marketing message) starting from a set of initial nodes called seeds. Unfortunately, malicious users can use these algorithms to spread negative messages, consisting of racist or hateful contents, misinformation, or fake news. We consider a scenario in which a malicious entity, the attacker, spreads a negative message and another entity, the defender, tries to mitigate the effects of the negative message by spreading another message that invalidates the former with some evidence that its content is wrong. The attacker has the advantage of playing first, knowing that the defender will play afterward, while the defender has the advantage of observing the attacker's spread. We define two optimization problems: the attacker, who is aware of the defender and her budget, selects a set of seeds to maximize the number of influenced nodes; when the attacker's diffusion process is finished, the defender selects her own seeds with the aim of minimizing the number of nodes that remain influenced by the attacker.
Deep learning models have been studied to forecast human events using vast volumes of data, yet they still cannot be trusted in certain applications such as healthcare and disaster assistance due to the lack of interpretability. Providing explanations for event predictions not only helps practitioners understand the underlying mechanism of prediction behavior but also enhances the robustness of event analysis. Improving the transparency of event prediction models is challenging given the following factors: (i) multilevel features exist in event data which creates a challenge to cross-utilize different levels of data; (ii) features across different levels and time steps are heterogeneous and dependent; and (iii) static model-level interpretations cannot be easily adapted to event forecasting given the dynamic and temporal characteristics of the data. Recent interpretation methods have proven their capabilities in tasks that deal with graph-structured or relational data. In this paper, we present a Contextualized Multilevel Feature learning framework, CMF, for interpretable temporal event prediction. It consists of a predictor for forecasting events of interest and an explanation module for interpreting model predictions. We design a new context-based feature fusion method to integrate multiple levels of heterogeneous features. We also introduce a temporal explanation module to determine sequences of text and subgraphs that have crucial roles in a prediction. We conduct extensive experiments on several real-world datasets of political and epidemic events. We demonstrate that the proposed method is competitive compared with the state-of-the-art models while possessing favorable interpretation capabilities.
Network alignment, in general, seeks to discover the hidden underlying correspondence between nodes across two (or more) networks when given their network structure. However, most existing network alignment methods have added assumptions of additional constraints to guide the alignment, such as having a set of seed node-node correspondences across the networks or the existence of side-information. Instead, we seek to develop a general unsupervised network alignment algorithm that makes no additional assumptions. Recently, network embedding has proven effective in many network analysis tasks, but embeddings of different networks are not aligned. Thus, we present our Deep Adversarial Network Alignment (DANA) framework that first uses deep adversarial learning to discover complex mappings for aligning the embedding distributions of the two networks. Then, using our learned mapping functions, DANA performs an efficient nearest neighbor node alignment. Furthermore, we present an unsupervised heuristic to perform model selection for DANA. We perform experiments on real world datasets to show the effectiveness of our framework for first aligning the graph embedding distributions and then discovering node alignments that outperform existing methods.
Data repair, i.e., the identification and fix of errors in the data, is a central component of the Data Science cycle. As such, significant research effort has been devoted to automate the repair process. Yet it still requires significant manual labor by the Data Scientists, tweaking and optimizing repair modules (up to 80% of their time, according to surveys).
To this end, we propose in this paper a novel framework for explaining the results of any data repair module. Explanations involve identifying the table cells and database constraints having the strongest influence on the process. Influence, in turn, is quantified through the game-theoretic notion of Shapley values, commonly used for explaining Machine Learning classifier results. The main technical challenge is that exact computation of Shapley values incurs exponential time. We consequently devise and optimize novel approximation algorithms, and analyze them both theoretically and empirically. Our results show the efficiency of our approach when compared to the alternative of adapting existing Shapley value computation techniques to the data repair settings.
Hierarchical community detection, which aims at discovering the hierarchical structure of a graph, attracts increasing attention due to its wide range of applications. However, due to the difficulty of parametrizing the community tree, existing methods mainly rely on heuristic algorithms, which are limited by their low accuracy and inability to handle new observations. As far as we know, how to leverage deep learning techniques to better discover hierarchical communities remains almost blank in the existing literature. In this paper, we present the first deep learning framework called ReinCom for hierarchical community detection. To address the challenge of parametrizing the community tree, we propose a novel growing-up process where, at each step, we first partition nodes into the community tree and then adjust the community tree according to the partition results. To learn an optimal growing-up process, we propose an embedding agent and a community agent to implement the two sub-steps respectively. Furthermore, we also propose an online learning strategy for new observations on the graph. Empirical results show that our proposed model has better modeling effectiveness than the state-of-the-art methods. For example, in terms of modularity, the performance of ReinCom is 33% higher than previous community detection works. Besides, with the aid of the learned node embeddings, we also devise a graph visualization algorithm which can consistently reflect the latent hierarchical structure of a graph.
Variational AutoEncoder (VAE) is a popular deep generative framework with a solid theoretical basis. There are many research efforts on improving VAE. Among the existing works, a recently proposed deterministic Regularized AutoEncoder (RAE) provides a new scheme for generative modeling. RAE fixes the variance of the inferred Gaussian approximate posterior distribution as a hyperparameter, and substitutes the stochastic encoder by injecting noise into the input of a deterministic decoder. However, the deterministic RAE has three limitations: 1) RAE needs to fit the variance; 2) RAE requires ex-post density estimation to ensure sample quality; 3) RAE employs an additional gradient regularization to ensure training smoothness. Thus, it raises an interesting research question: Can we maintain the flexibility of variational inference while simplifying VAE, and at the same time ensuring a smooth training process to obtain good generative performance? Based on the above motivation, in this paper, we propose a novel Semi-deterministic and Contrastive Variational Graph autoencoder (SCVG) for item recommendation. The core design of SCVG is to learn the variance of the approximate Gaussian posterior distribution in a semi-deterministic manner by aggregating inferred mean vectors from other connected nodes via graph convolution operation. We analyze the expressive power of SCVG for the Weisfeiler-Lehman graph isomorphism test, and we deduce the simplified form of the evidence lower bound of SCVG. Besides, we introduce an efficient contrastive regularization instead of gradient regularization. We empirically show that the contrastive regularization makes learned user/item latent representation more personalized and helps to smooth the training process. We conduct extensive experiments on three real-world datasets to show the superiority of our model over state-of-the-art methods for the item recommendation task. Codes are available at https://github.com/syxkason/SCVG.
Graph Neural Networks have recently become a prevailing paradigm for various high-impact graph analytical problems. Existing efforts can be mainly categorized as spectral-based and spatial-based methods. The major challenge for the former is to find an appropriate graph filter to distill discriminative information from input signals for learning. Recently, myriads of explorations are made to achieve better graph filters, e.g., Graph Convolutional Network (GCN), which leverages Chebyshev polynomial truncation to seek an approximation of graph filters and bridge these two families of methods. Nevertheless, it has been shown in recent studies that GCN and its variants are essentially employing fixed low-pass filters to perform information denoising. Thus their learning capability is rather limited and may over-smooth node representations at deeper layers. To tackle these problems, we develop a novel graph neural network framework AdaGNN with a well-designed adaptive frequency response filter. At its core, AdaGNN leverages a simple but elegant trainable filter that spans across multiple layers to capture the varying importance of different frequency components for node representation learning. The inherent differences among different feature channels are also well captured by the filter. As such, it empowers AdaGNN with stronger expressiveness and naturally alleviates the over-smoothing problem. We empirically validate the effectiveness of the proposed framework on various benchmark datasets. Theoretical analysis is also provided to show the superiority of the proposed AdaGNN. The open-source implementation of AdaGNN can be found here: https://github.com/yushundong/AdaGNN.
Time series has wide applications in the real world and is known to be difficult to forecast. Since its statistical properties change over time, its distribution also changes temporally, which will cause severe distribution shift problem to existing methods. However, it remains unexplored to model the time series in the distribution perspective. In this paper, we term this as Temporal Covariate Shift (TCS). This paper proposes Adaptive RNNs (AdaRNN) to tackle the TCS problem by building an adaptive model that generalizes well on the unseen test data. AdaRNN is sequentially composed of two novel algorithms. First, we propose Temporal Distribution Characterization to better characterize the distribution information in the TS. Second, we propose Temporal Distribution Matching to reduce the distribution mismatch in TS to learn the adaptive TS model. AdaRNN is a general framework with flexible distribution distances integrated. Experiments on human activity recognition, air quality prediction, and financial analysis show that AdaRNN outperforms the latest methods by a classification accuracy of 2.6% and significantly reduces the RMSE by 9.0%. We also show that the temporal distribution matching algorithm can be extended in Transformer structure to boost its performance.
Online advertising is an important revenue source for many IT companies. In the search advertising scenario, advertisement text that meets the need of the search query would be more attractive to the user. However, the manual creation of query-variant advertisement texts for massive items is expensive. Traditional text generation methods tend to focus on the general searching needs with high frequency while ignoring the diverse personalized searching needs with low frequency. In this paper, we propose the query-variant advertisement text generation task that aims to generate candidate advertisement texts for different web search queries with various needs based on queries and item keywords. To solve the problem of ignoring low-frequency needs, we propose a dynamic association mechanism to expand the receptive field based on external knowledge, which can obtain associated words to be added to the input. These associated words can serve as bridges to transfer the ability of the model from the familiar high-frequency words to the unfamiliar low-frequency words. With association, the model can make use of various personalized needs in queries and generate query-variant advertisement texts. Both automatic and human evaluations show that our model can generate more attractive advertisement text than baselines.
Computational argumentation and especially argument mining together with retrieval enjoys increasing popularity. In contrast to standard search engines that focus on finding documents relevant to a query, argument retrieval aims at finding the best supporting and attacking premises given a query claim, e.g., from a predefined collection of arguments. Here, a claim is the central part of an argument representing the standpoint of a speaker with the goal to persuade the audience, and a premise serves as evidence to the claim. In addition to the actual retrieval process, existing work has focused on (1) classifying polarities of arguments into supporting or opposing, (2) classifying arguments by their frames (such as economic or environmental), and (3) clustering similar arguments by their meaning to avoid repetitions in the result list. For experiments, either hand-made argument collections or arguments extracted from debate portals were used. In this paper, we extend existing work on argument clustering, making the following contributions: First, we introduce a novel pipeline for clustering arguments. While previous work classified arguments either by polarity, frame, or meaning, our pipeline incorporates these three, allowing a more systematic presentation of arguments. Second, we introduce a new dataset consisting of 365 argument graphs accompanying more than 11,000 high-quality arguments that, contrary to previous datasets, have been generated, displayed, and verified by journalists and were published in newspapers. A thorough evaluation with this dataset provides a first baseline for future work.
In order to model the evolution of user preference, we should learn user/item embeddings based on time-ordered item purchasing sequences, which is defined as Sequential Recommendation~(SR) problem. Existing methods leverage sequential patterns to model item transitions. However, most of them ignore crucial temporal collaborative signals, which are latent in evolving user-item interactions and coexist with sequential patterns. Therefore, we propose to unify sequential patterns and temporal collaborative signals to improve the quality of recommendation, which is rather challenging. Firstly, it is hard to simultaneously encode sequential patterns and collaborative signals. Secondly, it is non-trivial to express the temporal effects of collaborative signals.
Hence, we design a new framework Temporal Graph Sequential Recommender (TGSRec) upon our defined continuous-time bipartite graph. We propose a novel Temporal Collaborative Transformer TCT layer in TGSRec, which advances the self-attention mechanism by adopting a novel collaborative attention. TCT layer can simultaneously capture collaborative signals from both users and items, as well as considering temporal dynamics inside sequential patterns. We propagate the information learned from TCT layer over the temporal graph to unify sequential patterns and temporal collaborative signals. Empirical results on five datasets show that modelname significantly outperforms other baselines, in average up to 22.5% and 22.1% absolute improvements in Recall@10 and MRR, respectively.
Privacy-preserving machine learning has drawn increasingly attention recently, especially with kinds of privacy regulations come into force. Under such situation, Federated Learning (FL) appears to facilitate privacy-preserving joint modeling among multiple parties. Although many federated algorithms have been extensively studied, there is still a lack of secure and practical gradient tree boosting models (e.g., XGB) in literature. In this paper, we aim to build large-scale secure XGB under vertically federated learning setting. We guarantee data privacy from three aspects. Specifically, (1) we employ secure multi-party computation techniques to avoid leaking intermediate information during training, (2) we store the output model in a distributed manner in order to minimize information release, and (3) we provide a novel algorithm for secure XGB predict with the distributed model. Furthermore, by proposing secure permutation protocols, we can improve the training efficiency and make the framework scale to large dataset. We conduct extensive experiments on both public datasets and real-world datasets, and the results demonstrate that our proposed XGB models provide not only competitive accuracy but also practical performance.
Attributed Graph Clustering (AGC) and Attributed Hypergraph Clustering (AHC) are important topics in graph mining with many applications. For AGC, amongst the unsupervised methods that combine the graph structure with node attributes, graph convolution has been shown to achieve impressive results. However, the effects of graph convolution on AGC have not yet been adequately studied. In this paper, we show that graph convolution attempts to find the best trade-off between node attribute distance and the number of inter-cluster edges. On the one hand, we show that compared to clustering node attributes directly, graph convolution produces a greater distance between node attributes in the same cluster and a smaller distance between node attributes in different clusters (which is detrimental for clustering). On the other hand, we show that graph convolution benefits clustering by considerably reducing the number of edges among different clusters. We then extend our result on AGC to AHC and leverage the hypergraph convolution to propose an unsupervised, fast, and memory-efficient algorithm (GRAC) for AHC, which achieves excellent performance on popular supervised clustering measures.
Unsupervised domain adaptation (UDA) methods aim to transfer knowledge from a labeled source domain to an unlabeled target domain. Most existing UDA methods try to learn domain-invariant features so that the classifier trained by the source labels can automatically be adapted to the target domain. However, recent works have shown the limitations of these methods when label distributions differ between the source and target domains. Especially, in partial domain adaptation (PDA) where the source domain holds plenty of individual labels (private labels) not appeared in the target domain, the domain-invariant features can cause catastrophic performance degradation. In this paper, based on the originally favorable underlying structures of the two domains, we learn two kinds of target features, i.e., the source-approximate features and target-approximate features instead of the domain-invariant features. The source-approximate features utilize the consistency of the two domains to estimate the distribution of the source private labels. The target-approximate features enhance the feature discrimination in the target domain while detecting the hard (outlier) target samples. A novel Coupled Approximation Neural Network (CANN) has been proposed to co-train the source-approximate and target-approximate features by two parallel sub-networks without sharing the parameters. We apply CANN to three prevalent transfer learning benchmark datasets, Office-Home, Office-31, and Visda2017 with both UDA and PDA settings. The results show that CANN outperforms all baselines by a large margin in PDA and also performs best in UDA.
User behavior has been validated to be effective in revealing personalized preferences for commercial recommendations. However, few user-item interactions can be collected for new users, which results in a nullspace for their interests, ie, the cold-start dilemma. In this paper, a two-tower framework, namely, the model-agnostic interest learning (MAIL) framework, is proposed to address the cold-start recommendation (CSR) problem for recommender systems. In MAIL, one unique tower is constructed to tackle the CSR from a zero-shot view, and the other tower focuses on the general ranking task. Specifically, the zero-shot tower first performs cross-modal reconstruction with dual autoencoders to obtain virtual behavior data from highly aligned hidden features for new users; and the ranking tower can then output recommendations for users based on the completed data by the zero-shot tower. Practically, the ranking tower in MAIL is model-agnostic and can be implemented with any embedding-based deep models. Based on the cotraining of the two towers, the MAIL presents an end-to-end method for recommender systems that shows an incremental performance improvement. The proposed method has been successfully deployed on the live recommendation system of NetEase Cloud Music to achieve a click-through rate improvement of 13% to 15% for millions of users. Offline experiments on real-world datasets also show its superior performance in CSR. Our code is available.
Practical recommender systems experience a cold-start problem when observed user-item interactions in the history are insufficient. Meta learning, especially gradient based one, can be adopted to tackle this problem by learning initial parameters of the model and thus allowing fast adaptation to a specific task from limited data examples. Though with significant performance improvement, it commonly suffers from two critical issues: the non-compatibility with mainstream industrial deployment and the heavy computational burdens, both due to the inner-loop gradient operation. These two issues make them hard to be applied in practical recommender systems. To enjoy the benefits of meta learning framework and mitigate these problems, we propose a recommendation framework called Contextual Modulation Meta Learning (CMML). CMML is composed of fully feed-forward operations so it is computationally efficient and completely compatible with the mainstream industrial deployment. CMML consists of three components, including a context encoder that can generate context embedding to represent a specific task, a hybrid context generator that aggregates specific user-item features with task-level context, and a contextual modulation network, which can modulate the recommendation model to adapt effectively. We validate our approach on both scenario-specific and user-specific cold-start setting on various real-world datasets, showing CMML can achieve comparable or even better performance with gradient based methods yet with higher computational efficiency and better interpretability.
Recent conversational recommender systems (CRS) provide a promising solution to accurately capture a user's preferences by communicating with users in natural language to interactively guide them while pro-actively eliciting their current interests. Previous research on this mainly focused on either learning a supervised model with semantic features extracted from the user's responses, or training a policy network to control the dialogue state. However, none of them has considered the issue of popularity bias in a CRS. This paper proposes a human-in-the-loop popularity debiasing framework that integrates real-time semantic understanding of open-ended user utterances as well as historical records, while also effectively managing the dialogue with the user. This allows the CRS to balance the recommendation performance as well as the item popularity so as to avoid the well-known "long-tail'' effect. We demonstrate the effectiveness of our approach via experiments on two conversational recommendation datasets, and the results confirm that our proposed approach achieves high-accuracy recommendation while mitigating popularity bias.
Anchor graphs are a popular tool used in label prediction of sparsely labeled data. In anchor graphs, labels of labeled data are propagated to unlabeled data via anchor points; anchor points are the centers of k-means clusters. Anchor graph-based label prediction determines local weights between data points and anchor points by exploiting Nesterov's method to obtain the graph's adjacency matrix, and it inverts a matrix obtained from the adjacency matrix to predict labels., however, incurs high computation cost since (1) Nesterov's method is applied to all closest anchor points to compute local weights, and (2) the computation cost of the inversion matrix is cubic in the number of anchor points. We propose an approach that can efficiently perform anchor graph-based label prediction because of its two key advances: (1) it prunes unnecessary anchor points so they are not passed to Nesterov's method, and (2) it applies the conjugate gradient method in computing labels of data points to avoid matrix inversion. In addition, we propose to exploit basis vectors computed by SVD as anchor points to improve label prediction accuracy. Experiments show that our approach outperforms the previous approaches in terms of efficiency and accuracy.
Adapting neural networks to unseen tasks with few training samples on resource-constrained devices benefits various Internet-of-Things applications. Such neural networks should learn the new tasks in few shots and be compact in size. Meta-learning enables few-shot learning, yet the meta-trained networks can be over-parameterised. However, naive combination of standard compression techniques like network pruning with meta-learning jeopardises the ability for fast adaptation. In this work, we propose adaptation-aware network pruning (ANP), a novel pruning scheme that works with existing meta-learning methods for a compact network capable of fast adaptation. ANP uses weight importance metric that is based on the sensitivity of the meta-objective rather than the conventional loss function, and adopts approximation of derivatives and layer-wise pruning techniques to reduce the overhead of computing the new importance metric. Evaluations on few-shot classification benchmarks show that ANP can prune meta-trained convolutional and residual networks by 85% without affecting their fast adaptation.
One of the core problems in large-scale recommendations is to retrieve top relevant candidates accurately and efficiently, preferably in sub-linear time. Previous approaches are mostly based on a two-step procedure: first learn an inner-product model, and then use some approximate nearest neighbor (ANN) search algorithm to find top candidates. In this paper, we present Deep Retrieval (DR), to learn a retrievable structure directly with user-item interaction data (e.g. clicks) without resorting to the Euclidean space assumption in ANN algorithms. DR's structure encodes all candidate items into a discrete latent space. Those latent codes for the candidates are model parameters and learnt together with other neural network parameters to maximize the same objective function. With the model learnt, a beam search over the structure is performed to retrieve the top candidates for reranking. Empirically, we first demonstrate that DR, with sub-linear computational complexity, can achieve almost the same accuracy as the brute-force baseline on two public datasets. Moreover, we show that, in a live production recommendation system, a deployed DR approach significantly outperforms a well-tuned ANN baseline in terms of engagement metrics. To the best of our knowledge, DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.
AI chatbots can offer suggestions to help humans answer questions by reducing text entry effort and providing relevant knowledge for unfamiliar questions. We study whether chatbot suggestions can help people answer knowledge-demanding questions in a conversation and influence response quality and efficiency. We conducted a large-scale crowdsourcing user study and evaluated 20 hybrid system variants and a human-only baseline. The hybrid systems used four chatbots of varied response quality and differed in the number of suggestions and whether to preset the message box with top suggestions.
Experimental results show that chatbot suggestions---even using poor-performing chatbots---have consistently improved response efficiency. Compared with the human-only setting, hybrid systems have reduced response time by 12%--35% and keystrokes by 33%--60%, and users have adopted a suggestion for the final response without any changes in 44%--68% of the cases. In contrast, crowd workers in the human-only setting typed most of the response texts and copied 5% of the answers from other sites.
However, we also found that chatbot suggestions did not always help response quality. Specifically, in hybrid systems equipped with poor-performing chatbots, users responded with lower-quality answers than others in the human-only setting. It seems that users would not simply ignore poor suggestions and compose responses as they could without seeing the suggestions. Besides, presetting the message box has improved reply efficiency without hurting response quality. We did not find that showing more suggestions helps or hurts response quality or efficiency consistently. Our study reveals how and when AI chatbot suggestions can help people answer questions in hybrid conversational systems.
Knowledge graphs (KG) model relationships between entities as labeled edges (or facts). They are mostly constructed using a suite of automated extractors, thereby inherently leading to uncertainty in the extracted facts. Modeling the uncertainty as probabilistic confidence scores results in a probabilistic knowledge graph. Graph queries over such probabilistic KGs require answer computation along with the computation of result probabilities, i.e., probabilistic inference. We propose a system, HAPPI (How Provenance of Probabilistic Inference), to handle such query processing and inference. Complying with the standard provenance semiring model, we propose a novel commutative semiring to symbolically compute the probability of the result of a query. These provenance-polynomial-like symbolic expressions encode fine-grained information about the probability computation process. We leverage this encoding to efficiently compute as well as maintain probabilities of results even as the underlying KG changes. Focusing on conjunctive basic graph pattern queries, we observe that HAPPI is more efficient than knowledge compilation for answering commonly occurring queries with lower range of probability derivation complexity. We propose an adaptive system that leverages the strengths of both HAPPI and compilation based techniques, for not only to perform efficient probabilistic inference and compute their provenance, but also to incrementally maintain them.
EXtreme Multi-label Learning (XML) aims to predict each instance its most relevant subset of labels from an extremely huge label space, often exceeding one million or even larger in many real applications. In XML scenarios, the labels exhibit a long tail distribution, where a significant number of labels appear in very few instances, referred to as tail labels. Unfortunately, due to the lack of positive instances, the tail labels are intractable to learn as well as predict. Several previous studies even suggested that the tail labels can be directly removed by referring to their label frequencies. We consider that such violent principle may miss many significant tail labels, because the predictive accuracy is not strictly consistent with the label frequency especially for tail labels. In this paper, we are interested in finding a reasonable principle to determine whether a tail label should be removed, not only depending on their label frequencies. To this end, we investigate a method named Nearest Neighbor Positive Proportion Score (N2P2S) to score the tail labels by annotations of the instance neighbors. Extensive empirical results indicate that the proposed N2P2S can effectively screen the tail labels, where many preserved tail labels can be learned and accurately predicted even with very few positive instances.
Spatio-temporal data are semantically valuable information used for various analytical tasks to identify spatially relevant and temporally limited correlations within a domain. The increasing availability and data acquisition from multiple sources with their typically high heterogeneity are getting more and more attention. However, these sources often lack interconnecting shared keys, making their integration a challenging problem. For example, publicly available parking data that consist of point data on parking facilities with fluctuating occupancy and static location data on parking spaces cannot be directly correlated. Both data sets describe two different aspects from distinct sources in which parking spaces and fluctuating occupancy are part of the same semantic model object. Especially for ad hoc analytical tasks on integrated models, these missing relationships cannot be handled using join operations as usual in relational databases. The reason lies in the lack of equijoin relationships, comparing for equality of strings and additional overhead in loading data up before processing. This paper addresses the optimization problem of finding suitable partners in the absence of equijoin relations for heterogeneous spatio-temporal data, applicable to ad hoc analytics. We propose a graph-based approach that achieves good recall and performance scaling via hierarchically separating the semantics along spatial, temporal, and domain-specific dimensions. We evaluate our approach using public data, showing that it is suitable for many standard join scenarios and highlighting its limitations.
We propose a zero-shot learning relation classification (ZSLRC) framework that improves on state-of-the-art by its ability to recognize novel relations that were not present in training data. The zero-shot learning approach mimics the way humans learn and recognize new concepts with no prior knowledge. To achieve this, ZSLRC uses advanced prototypical networks that are modified to utilize weighted side (auxiliary) information. ZSLRC's side information is built from keywords, hypernyms of name entities, and labels and their synonyms. ZSLRC also includes an automatic hypernym extraction framework that acquires hypernyms of various name entities directly from the web. ZSLRC improves on state-of-the-art few-shot learning relation classification methods that rely on labeled training data and is therefore applicable more widely even in real-world scenarios where some relations have no corresponding labeled examples for training. We present results using extensive experiments on two public datasets (NYT and FewRel) and show that ZSLRC significantly outperforms state-of-the-art methods on supervised learning, few-shot learning, and zero-shot learning tasks. Our experimental results also demonstrate the effectiveness and robustness of our proposed model.
In competitive search settings such as the Web, many documents' authors (publishers) opt to have their documents highly ranked for some queries. To this end, they modify the documents --- specifically, their content --- in response to induced rankings. Thus, the search engine affects the content in the corpus via its ranking decisions. We present a first study of the ability of search engines to drive pre-defined, targeted, content effects in the corpus using simple techniques. The first is based on the herding phenomenon --- a celebrated result from the economics literature --- and the second is based on biasing the relevance ranking function. The types of content effects we study are either topical or touch on specific document properties --- length and inclusion of query terms. Analysis of ranking competitions we organized between incentivized publishers shows that the types of content effects we target can indeed be attained by applying our suggested techniques. These findings have important implications with regard to the role of search engines in shaping the corpus.
Empirical studies and theoretical models both highlight burstinessas a common temporal pattern in online behavior. A key driver for burstiness is the self-exciting nature of online interactions. For example, posts in online groups often incite posts in response. Such temporal dependencies are easily lost when interaction data is aggregated in snapshots which are subsequently analyzed independently. An alternative is to model individual interactions as a multi-dimensional self-exciting process, thus, enforcing both temporal and network dependencies. Point processes, however, are challenging to employ for large real-world datasets as fitting them incurs super-linear cost in the number of events. How can we efficiently detect online groups exhibiting bursty self-exciting temporal behavior in large real-world datasets?
We propose a bursty group detection framework, called MYRON, which explicitly models self-exciting behavior within groups while also accounting for network-wide baseline activity. MYRON imposes bursty temporal structure within a scalable tensor factorization framework to decouple within-group interactions as interpretable factors. Our framework can incorporate different "shapes"of temporal burstiness via wavelet decomposition or kernels forself-exciting behavior. Our evaluation on both synthetic and real-world data demonstrates MYRON's utility in community detection.It is up to 40% more effective in detecting ground truth groups compared to state-of-the-art baselines. In addition, MYRON is able to uncover interpretable bursty patterns of behavior from user-photo interactions in Flickr.
Machine learning models achieve state-of-the-art performance on many supervised learning tasks. However, prior evidence suggests that these models may learn to rely on "shortcut" biases or spurious correlations (intuitively, correlations that do not hold in the test as they hold in train) for good predictive performance. Such models cannot be trusted in deployment environments to provide accurate predictions. While viewing the problem from a causal lens is known to be useful, the seamless integration of causation techniques into machine learning pipelines remains cumbersome and expensive. In this work, we study and extend a causal pre-training debiasing technique called causal bootstrapping (CB) under five practical confounded-data generation-acquisition scenarios (with known and unknown confounding). Under these settings, we systematically investigate the effect of confounding bias on deep learning model performance, demonstrating their propensity to rely on shortcut biases when these biases are not properly accounted for. We demonstrate that such a causal pre-training technique can significantly outperform existing base practices to mitigate confounding bias on real-world domain generalization benchmarking tasks. This systematic investigation underlines the importance of accounting for the underlying data-generating mechanisms and fortifying data-preprocessing pipelines with a causal framework to develop methods robust to confounding biases.
Paper-publication venue prediction aims to predict candidate publication venues that effectively suit given submissions. This technology is developing rapidly with the popularity of machine learning models. However, most previous methods ignore the structure information of papers, while modeling them with graphs can naturally solve this drawback. Meanwhile, they either use hand-crafted or bag-of-word features to represent the papers, ignoring the ones that involve high-level semantics. Moreover, existing methods assume that the venue where a paper is published as a correct venue for the data annotation, which is unrealistic. One paper can be relevant to many venues. In this paper, we attempt to address these problems above and develop a novel prediction model, namelyVenue Prediction with Abstract-Level Graph (Vpalg xspace), which can serve as an effective decision-making tool for venue selections. Specifically, to achieve more discriminative paper abstract representations, we construct each abstract as a semantic graph and perform a dual attention message passing neural network for representation learning. Then, the proposed model can be trained over the learned abstract representations with their labels and generalized via self-training. Empirically, we employ the PubMed dataset and further collect two new datasets from the top journals and conferences in computer science. Experimental results indicate the superior performance of Vpalg xspace, consistently outperforming the existing baseline methods.
Explainable machine learning methods have attracted increased interest in recent years. In this work, we pose and study the niche detection problem, which imposes an explainable lens on the classical problem of co-clustering interactions across two modes. In the niche detection problem, our goal is to identify niches, or co-clusters with node-attribute oriented explanations. Niche detection is applicable to many social content consumption scenarios, where an end goal is to describe and distill high-level insights about user-content associations: not only that certain users like certain types of content, but rather the types of users and content, explained via node attributes. Some examples are an e-commerce platform with who-buys-what interactions and user and product attributes, or a mobile call platform with who-calls-whom interactions and user attributes. Discovering and characterizing niches has powerful implications for user behavior understanding, as well as marketing and targeted content production. Unlike prior works, ours focuses on the intersection of explainable methods and co-clustering. First, we formalize the niche detection problem and discuss preliminaries. Next, we design an end-to-end framework, NED, which operates in two steps: discovering co-clusters of user behaviors based on interaction densities, and explaining them using attributes of involved nodes. Finally, we show experimental results on several public datasets, as well as a large-scale industrial dataset from Snapchat, demonstrating that NED improves in both co-clustering (20% accuracy) and explanation-related objectives (12% average precision) compared to state-of-the-art methods.
Few-shot relation extraction (FSRE) aims to predict the relation for a pair of entities in a sentence by exploring a few labeled instances for each relation type. Current methods mainly rely on meta-learning to learn generalized representations by optimizing the network parameters based on various collections of tasks sampled from training data. However, these methods may suffer from two main issues. 1) Insufficient supervision of meta-learning to learn discriminative representations on very few training instances, which are sampled from a large amount of base class data. 2) Spurious correlations between entities and relation types due to the biased training procedure that focuses more on entity pair rather than context. To learn more discriminative and unbiased representations for FSRE, this paper proposes a two-stage approach via supervised contrastive learning and sentence- and entity-level prototypical networks. In the first (pre-training) stage, we introduce a supervised contrastive pre-training method, which is able to yield more discriminative representations by learning from the entire training instances, such that the semantically related representations are close to each other, and far away otherwise. In the second (meta-learning) stage, we propose a novel sentence- and entity-level prototypical network equipped with fine-grained feature-wise fusion strategy to learn unbiased representations, where the networks are initialized with the parameters trained in the first stage. Specifically, the proposed network consists of a sentence branch and an entity branch, taking entire sentences and entity mentions as inputs, respectively. The entity branch explicitly captures the correlation between entity pairs and relations, and then dynamically adjusts the sentence branch's prediction distributions. By doing so, the spurious correlations issue caused by biased training samples can be properly mitigated. Extensive experiments on two FSRE benchmarks demonstrate the effectiveness of our approach.
Conventional deep learning-based Relation Classification (RC) methods heavily rely on large-scale training dataset and fail to generalize to unseen classes when training data is scant. This work concentrates on RC tasks in few-shot scenarios in which models classify the unlabelled samples given only few labeled samples. Existing few-shot RC models consider the dataset as a series of individual instances and have not fully utilized interaction information among them. Interaction information is conducive to indicate the important areas and produce discriminating representations. So this paper proposes a novel interactive attention network (IAN) which uses inter-instance and intra-instance interactive information to classify the relations. Inter-instance interactive information is first introduced to solve the low-resource problem by capturing the semantic relevance between an instance pair. Intra-instance interactive information is then introduced to address the ambiguous relation classification issue by extracting the entity information inner an instance. Extensive numerical experimental results demonstrate the proposed method promotes the accuracy of down-stream task.
Outcome estimation of treatments for individual targets is a crucial foundation for decision making based on causal relations. Most of the existing outcome estimation methods deal with binary or multiple-choice treatments; however, in some applications, the number of interventions can be very large, while the treatments themselves have rich information. In this study, we consider one important instance of such cases, that is, the outcome estimation problem of graph-structured treatments such as drugs. Due to the large number of possible interventions, the counterfactual nature of observational data, which appears in conventional treatment effect estimation, becomes a more serious issue in this problem. Our proposed method GraphITE (pronounced 'graphite') obtains the representations of the graph-structured treatments using graph neural networks, and also mitigates the observation biases by using the HSIC regularization that increases the independence of the representations of the targets and the treatments. In contrast with the existing methods, which cannot deal with "zero-shot" treatments that are not included in observational data, GraphITE can efficiently handle them thanks to its capability of incorporating graph-structured treatments. The experiments using the two real-world datasets show GraphITE outperforms baselines especially in cases with a large number of treatments.
Representation learning has always played an important role in information retrieval (IR) systems. Most retrieval models, including recent neural approaches, use representations to calculate similarities between queries and documents to find relevant information from a corpus. Recent models use large-scale pre-trained language models for query representation. The typical use of these models, however, has a major limitation in that they generate only a single representation for a query, which may have multiple intents or facets. The focus of this paper is to address this limitation by considering neural models that support multiple intent representations for each query. Specifically, we propose the NMIR (Neural Multiple Intent Representations) model that can generate semantically different query intents and their appropriate representations. We evaluate our model on query facet generation using a large-scale dataset of real user queries sampled from the Bing search logs. We also provide an extrinsic evaluation of the proposed model using a clarifying question selection task. The results show that NMIR significantly outperforms competitive baselines.
Decision-making often requires accurate estimation of treatment effects from observational data. This is challenging as outcomes of alternative decisions are not observed and have to be estimated. Previous methods estimate outcomes based on unconfoundedness but neglect any constraints that unconfoundedness imposes on the outcomes. In this paper, we propose a novel regularization framework for estimating average treatment effects that exploits unconfoundedness. To this end, we formalize unconfoundedness as an orthogonality constraint, which ensures that the outcomes are orthogonal to the treatment assignment. This orthogonality constraint is then included in the loss function via a regularization. Based on our regularization framework, we develop deep orthogonal networks for unconfounded treatments (DONUT), which learn outcomes that are orthogonal to the treatment assignment. Using a variety of benchmark datasets for estimating average treatment effects, we demonstrate that DONUT outperforms the state-of-the-art substantially.
Advertising is critical to many online e-commerce platforms such as e-Bay and Amazon. One of the important signals that these platforms rely upon is the click-through rate (CTR) prediction. The recent popularity of multi-modal sharing platforms such as TikTok has led to an increased interest in online micro-videos. It is, therefore, useful to consider micro-videos to help a merchant target micro-video advertising better and find users' favourites to enhance user experience. Existing works on CTR prediction largely exploit unimodal content to learn item representations. A relatively minimal effort has been made to leverage multi-modal information exchange among users and items. We propose a model to exploit the temporal user-item interactions to guide the representation learning with multi-modal features, and further predict the user click rate of the micro-video item. We design a Hypergraph Click-Through Rate prediction framework (HyperCTR) built upon the hyperedge notion of hypergraph neural networks, which can yield modal-specific representations of users and micro-videos to better capture user preferences. We construct a time-aware user-item bipartite network with multi-modal information and enrich the representation of each user and item with the generated interests-based user hypergraph and item hypergraph. Through extensive experiments on three public datasets, we demonstrate that our proposed model significantly outperforms various state-of-the-art methods.
Stock trend prediction plays a crucial role in quantitative investing. Given the prediction task on a certain granularity (e.g., daily trend), a large portion of existing studies merely leverage market data of the same granularity (e.g., daily market data). In financial investment scenarios, however, there exist amounts of finer-grained information (e.g., high-frequency data) that contain more detailed investment signals beyond the original granularity data. This motivates us to investigate how to leverage multi-granularity market data to enhance the accuracy of stock trend prediction. Some straightforward methods, such as concatenating finer-grained data as features or fusing with a model based on finer-grained features, may not lead to more precise stock trend prediction due to some unique challenges. First, the inconsistency of granularity between the target trend and finer-grained data could substantially increase optimization difficulty, such as the relative sparsity of the target trend compared with higher dimensions of finer-grained features. Moreover, the continuously changing financial market state could result in varying efficacy of heterogeneous multi-granularity information, which consequently requires a dynamic approach for proper fusion among them. In this paper, we propose the Contrastive Multi-Granularity Learning Framework (CMLF) to address these challenges. Particularly, we first design two novel contrastive learning objectives at the pre-training stage to address the inconsistency issue by constructing additional self-supervised signals relying on the inherent character of stock data. We also design a gate mechanism based on market-aware technical indicators to fuse the multi-granularity features at each time step adaptively. Extensive experiments on three real-world datasets show significant improvements of our approach over the state-of-the-art baselines on stock trend prediction and profitability in real investing scenarios.
Hard interaction learning between source sequences and their next targets is challenging, which exists in a myriad of sequential prediction tasks. During the training process, most existing methods focus on explicitly hard interactions caused by wrong responses. However, a model might conduct correct responses by capturing a subset of learnable patterns, which results in implicitly hard interactions with some unlearned patterns. As such, its generalization performance is weakened. The problem gets more serious in sequential prediction due to the interference of substantial similar candidate targets.
To this end, we propose a Hardness Aware Interaction Learning framework (HAIL) that mainly consists of two base sequential learning networks and mutual exclusivity distillation (MED). The base networks are initialized differently to learn distinctive view patterns, thus gaining different training experiences. The experiences in the form of the unlikelihood of correct responses are drawn from each other by MED, which provides mutual exclusivity knowledge to figure out implicitly hard interactions. Moreover, we deduce that the unlikelihood essentially introduces additional gradients to push the pattern learning of correct responses. Our framework can be easily extended to more peer base networks. Evaluation is conducted on four datasets covering cyber and physical spaces. The experimental results demonstrate that our framework outperforms several state-of-the-art methods in terms of top-k based metrics.
Graph Convolutional Networks (GCNs) are powerful representation learning methods for non-Euclidean data. Compared with the Euclidean data, labeling the non-Euclidean data is more expensive. Meanwhile, most existing GCNs only utilize few labeled data but ignore most of the unlabeled data. To address this issue, we design a novel end-to-end Iterative Feature Clustering Graph Convolutional Networks (IFC-GCN) that enhances the standard GCN with an Iterative Feature Clustering (IFC) module. The proposed IFC module constrains node features iteratively based on the predicted pseudo labels and feature clustering. Further, we design an EM-like framework for IFC-GCN training, which improves the network performance by rectifying the pseudo labels and the node features alternately. Theoretical analysis and experimental results show that our proposed IFC module can effectively modify the node features. Experimental results on public datasets demonstrate that IFC-GCN outperforms state-of-the-art methods on the semi-supervised node classification task.
The cross-device user matching task is to identify the behavior-logs (i.e., behavior sequences) on multiple devices that belong to one real person. Due to its anonymous and long-term properties, most previous methods of learning behavior embeddings cannot effectively capture two important features in the sequences, namely high-order connections and long-range dependencies. To this end, we propose a novel framework called Two-tier Graph Contextual Embedding (TGCE) to solve the above problems simultaneously. In the first tier, we construct behavior evolutionary graphs (BEGs) for behavior sequences and design an order-preserving neighbor aggregation network to collectively model transitions of behaviors with their neighbors. As repeated behaviors can be grouped into single nodes, our model joints neighboring environments around behaviors in a collective way, and behavior embeddings can be enriched. In the second tier, we further build scaled shortcut graphs (SSGs) by refining BEGs with random walk-based edge addition, then a position-aware graph attention network is further imposed on SSGs to facilitate fast information propagation. As distant graph nodes can be directly connected by shortcut edges, we can further capture long-range dependencies. By stacking two graph tiers, our approach can obtain graph contextual embeddings for behaviors to further improve user matching. Experimental results on the benchmark dataset show that our model outperforms various baselines in the user matching task. Our code is released on https://github.com/13061051/TGCE_2021.
Signed networks are such social networks having both positive and negative links. A lot of theories and algorithms have been developed to model such networks (e.g., balance theory). However, previous work mainly focuses on the unipartite signed networks where the nodes have the same type. Signed bipartite networks are different from classical signed networks, which contain two different node sets and signed links between two node sets. Signed bipartite networks can be commonly found in many fields including business, politics, and academics, but have been less studied. In this work, we firstly define the signed relationship of the same set of nodes and provide a new perspective for analyzing signed bipartite networks. Then we do some comprehensive analysis of balance theory from two perspectives on several real-world datasets. Specifically, in the peer review dataset, we find that the ratio of balanced isomorphism in signed bipartite networks increased after rebuttal phases. Guided by these two perspectives, we propose a novel Signed Bipartite Graph Neural Networks (SBGNNs) to learn node embeddings for signed bipartite networks. SBGNNs follow most GNNs message-passing scheme, but we design new message functions, aggregation functions, and update functions for signed bipartite networks. We validate the effectiveness of our model on four real-world datasets on Link Sign Prediction task, which is the main machine learning task for signed networks. Experimental results show that our SBGNN model achieves significant improvement compared with strong baseline methods, including feature-based methods and network embedding methods.
The arrival of 5G networks has extensively promoted the growth of content delivery services (CDSs). Understanding and predicting the spatio-temporal distribution of CDSs are beneficial to mobile users, Internet Content Providers and carriers. Conventional methods for predicting the spatio-temporal distribution of CDSs are mostly base-stations (BSs) centric, leading to weak generalization and spatio coarse-grained. To improve the spatio accuracy and generalization of modeling, we propose user-centric methods for CDSs spatio-temporal analysis. With geocoding and spatio-temporal graphs modeling algorithms, CDSs records collected from mobile devices are modeled as dynamic graphs with spatio-temporal attributes. Moreover, we propose a spatio-temporal-social multi-feature extraction framework for spatio fine-grained CDSs hot spots prediction. Specifically, an edge-enhanced graph convolutional block is designed to encode CDSs information based on the social relations and the spatio dependence features. Besides, we introduce the Long Short Term Memory (LSTM) to further capture the temporal dependence. Experiments on two real-world CDSs datasets verified the effectiveness of the proposed framework, and ablation studies are taken to evaluate the importance of each feature.
Pre-trained contextualized representations offer great success for many downstream tasks, including document ranking. The multilingual versions of such pre-trained representations provide a possibility of jointly learning many languages with the same model. Although it is expected to gain big with such joint training, in the case of cross-lingual information retrieval (CLIR), the models under a multilingual setting are not achieving the same level of performance as those under a monolingual setting. We hypothesize that the performance drop is due to thetranslation gap between query and documents. In the monolingual retrieval task, because of the same lexical inputs, it is easier for model to identify the query terms that occurred in documents. However, in the multilingual pre-trained models that the words in different languages are projected into the same hyperspace, the model tends to "translate" query terms into related terms - i.e., terms that appear in a similar context - in addition to or sometimes rather than synonyms in the target language. This property is creating difficulties for the model to connect terms that co-occur in both query and document. To address this issue, we propose a novel Mixed Attention Transformer (MAT) that incorporates external word-level knowledge, such as a dictionary or translation table. We design a sandwich-like architecture to embed MAT into the recent transformer-based deep neural models. By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence. Experimental results demonstrate the effectiveness of the external knowledge and the significant improvement of MAT-embedded neural reranking model on CLIR task.
Conversation generation as a challenging task in Natural Language Generation (NLG) has been increasingly attracting attention over the last years. A number of recent works adopted sequence-to-sequence structures along with external knowledge, which successfully enhanced the quality of generated conversations. Nevertheless, few works utilized the knowledge extracted from similar conversations for utterance generation. Taking conversations in customer service and court debate domains as examples, it is evident that essential entities/phrases, as well as their associated logic and inter-relationships, can be extracted and borrowed from similar conversation instances. Such information could provide useful signals for improving conversation generation. In this paper, we propose a novel reading and memory framework called Deep Reading Memory Network (DRMN) which is capable of remembering useful information of similar conversations for improving utterance generation. We apply our model to two large-scale conversation datasets of justice and e-commerce fields. Experiments prove that the proposed model outperforms the state-of-the-art approaches.
Recommender system plays a crucial role in modern E-commerce platform. Due to the lack of historical interactions between users and items, cold-start recommendation is a challenging problem. In order to alleviate the cold-start issue, most existing methods introduce content and contextual information as the auxiliary information. Nevertheless, these methods assume the recommended items behave steadily over time, while in a typical E-commerce scenario, items generally have very different performances throughout their life period. In such a situation, it would be beneficial to consider the long-term return from the item perspective, which is usually ignored in conventional methods. Reinforcement learning (RL) naturally fits such a long-term optimization problem, in which the recommender could identify high potential items, proactively allocate more user impressions to boost their growth, therefore improve the multi-period cumulative gains. Inspired by this idea, we model the process as a Partially Observable and Controllable Markov Decision Process (POC-MDP), and propose an actor-critic RL framework (RL-LTV) to incorporate the item lifetime values (LTV) into the recommendation. In RL-LTV, the critic studies historical trajectories of items and predict the future LTV of fresh item, while the actor suggests a score-based policy which maximizes the future LTV expectation. Scores suggested by the actor are then combined with classical ranking scores in a dual-rank framework, therefore the recommendation is balanced with the LTV consideration. Our method outperforms the strong live baseline with a relative improvement of 8.67% and 18.03% on IPV and GMV of cold-start items, on one of the largest E-commerce platform.
Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class of practical importance, but have not received much attention in research. This work presents EXAQT, the first end-to-end system for answering complex temporal questions that have multiple entities and predicates, and associated temporal conditions. EXAQT answers natural language questions over KGs in two stages, one geared towards high recall, the other towards precision at top ranks. The first step computes question-relevant compact subgraphs within the KG, and judiciously enhances them with pertinent temporal facts, using Group Steiner Trees and fine-tuned BERT models. The second step constructs relational graph convolutional networks (R-GCNs) from the first step's output, and enhances the R-GCNs with time-aware entity embeddings and attention over temporal relations. We evaluate EXAQT on TimeQuestions, a large dataset of 16k temporal questions we compiled from a variety of general purpose KG-QA benchmarks. Results show that EXAQT outperforms three state-of-the-art systems for answering complex questions over KGs, thereby justifying specialized treatment of temporal QA.
While graph neural networks (GNNs) emerge as the state-of-the-art representation learning methods on graphs, they often require a large amount of labeled data to achieve satisfactory performance, which is often expensive or unavailable. To relieve the label scarcity issue, some pre-training strategies have been devised for GNNs, to learn transferable knowledge from the universal structural properties of the graph. However, existing pre-training strategies are only designed for homogeneous graphs, in which each node and edge belongs to the same type. In contrast, a heterogeneous graph embodies rich semantics, as multiple types of nodes interact with each other via different kinds of edges, which are neglected by existing strategies. In this paper, we propose a novel Contrastive Pre-Training strategy of GNNs on Heterogeneous Graphs (CPT-HG), to capture both the semantic and structural properties in a self-supervised manner. Specifically, we design semantic-aware pre-training tasks at both the relation- and subgraph-levels, and further enhance their representativeness by employing contrastive learning. We conduct extensive experiments on three real-world heterogeneous graphs, and promising results demonstrate the superior ability of our CPT-HG to transfer knowledge to various downstream tasks via pre-training.
Graph neural networks (GNNs) have received tremendous attention due to their power in learning effective representations for graphs. Most GNNs follow a message-passing scheme where the node representations are updated by aggregating and transforming the information from the neighborhood. Meanwhile, they adopt the same strategy in aggregating the information from different feature dimensions. However, suggested by social dimension theory and spectral embedding, there are potential benefits to treat the dimensions differently during the aggregation process. In this work, we investigate to enable heterogeneous contributions of feature dimensions in GNNs. In particular, we propose a general graph feature gating network (GFGN) based on the graph signal denoising problem and then correspondingly introduce three graph filters under GFGN to allow different levels of contributions from feature dimensions. Extensive experiments on various real-world datasets demonstrate the effectiveness and robustness of the proposed frameworks.
Recent advances in Deep Neural Networks (DNNs) have dramatically improved the accuracy of DNN inference, but also introduce larger latency. In this paper, we investigate how to utilize early exit, a novel method that allows inference to exit at earlier exit points at the cost of an acceptable amount of accuracy. Scheduling the optimal exit point on a per-instance basis is challenging because the realized performance (i.e., confidence and latency) of each exit point is random and the statistics vary in different scenarios. Moreover, the performance has dependencies among the exit points, further complicating the problem. Therefore, the optimal exit scheduling decision cannot be known in advance but should be learned in an online fashion. To this end, we propose Dynamic Early Exit (DEE), a real-time online learning algorithm based on contextual bandit analysis. DEE observes the performance at each exit point as context and decides whether to exit or keep processing. Unlike standard contextual bandit analyses, the rewards of the decisions in our problem are temporally dependent. Furthermore, the performances of the earlier exit points are inevitably explored more compared to the later ones, which poses an unbalance exploration-exploitation trade-off. DEE addresses the aforementioned challenges, where its regret per inference asymptotically approaches zero. We compare DEE with four benchmark schemes in the real-world experiment. The experiment result shows that DEE can improve the overall performance by up to 98.1% compared to the best benchmark scheme.
Unsupervised domain adaptation is the problem of transferring extracted knowledge from a labeled source domain to an unlabeled target domain. To achieve discriminative domain adaptation recent studies take advantage of target sample pseudo-labels to impose class-aware distribution alignment across the source and target domains. Still, they have some shortcomings such as making decisions based on inaccurate pseudo-labeled samples that mislead the adaptation process. In this paper, we propose a progressive deep feature alignment, called Norma, to tackle class-aware unsupervised domain adaptation for image classification by enforcing inter-class compactness and intra-class discrepancy through a hybrid learning process. To this end, Norma's optimization process is defined based on a novel triplet loss which not only addresses soft prototype alignment but also pushes away multiple negative centroids. Also, to extract maximum discriminative domain knowledge per iteration, we propose a joint positive and negative learning procedure along with an uncertainty-guided progressive pseudo-labeling on the basis of prototype-based clustering and conditional probability. Our experimental results on several benchmarks demonstrate that Norma outperforms the state-of-the-art methods.
Determining the semantic concepts of columns in tabular data is of use for many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Several prior works have proposed supervised learning-based or heuristic-based approaches to semantic type annotation. These techniques suffer from poor generalizability over a large number of concepts or examples. Recent neural network based supervised learning methods generalize to different datasets but require large amounts of curated training data and also present scalability issues. Furthermore, none of the known methods works well for numerical data. We present C2, a system that maps each column to a concept based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs (Wikidata and DBpedia), to perform effective and efficient concept annotation for tabular data. Specifically, we utilize a collection of 32 million openly available webtables from several sources. We also present efficient indexing techniques for categorical string, numeric and mixed-type data, and novel techniques for table context utilization. We demonstrate the effectiveness and efficiency of C2 over available techniques on 9 real-world datasets containing a wide variety of concepts.
Query reformulation (QR) is a key factor in overcoming the problems faced by the lexical chasm in information retrieval (IR) systems. In particular, when searching for jargon, people tend to use descriptive queries, such as "a medical examination of the colon" rather than "colonoscopy," or they often use them interchangeably. Thus, transforming users' descriptive queries into appropriate jargon queries helps to retrieve more relevant documents. In this paper, we propose a new graph-based QR system that uses a dictionary, where the model does not require human-labeled data. Given a descriptive query, our system predicts the corresponding jargon word over a graph consisting of pairs of a headword and its description in the dictionary. First, we train a graph neural network to represent the relational properties between words and to infer a jargon word using compositional information of the descriptive query's words. Moreover, we propose a graph search model that finds the target node in real time using the relevance scores of neighborhood nodes. By adding this fast graph search model to the front of the proposed system, we reduce the reformulating time significantly. Experimental results on two datasets show that the proposed method can effectively reformulate descriptive queries to corresponding jargon words as well as improve retrieval performance under several search frameworks.
To speed up the training of massive deep neural network (DNN) models, distributed training has been widely studied. In general, a centralized training, a type of distributed training, suffers from the communication bottleneck between a parameter server (PS) and workers. On the other hand, a decentralized training suffers from increased parameter variance among workers that causes slower model convergence. Addressing this dilemma, in this work, we propose a novel centralized training algorithm, ALADDIN, employing "asymmetric" communication between PS and workers for the PS bottleneck problem and novel updating strategies for both local and global parameters to mitigate the increased variance problem. Through a convergence analysis, we show that the convergence rate of ALADDIN is O(1 ønk ) on the non-convex problem, where n is the number of workers and k is the number of training iterations. The empirical evaluation using ResNet-50 and VGG-16 models demonstrates that (1) ALADDIN shows significantly better training throughput with up to 191% and 34% improvement compared to a synchronous algorithm and the state-of-the-art decentralized algorithm, respectively, (2) models trained by ALADDIN converge to the accuracies, comparable to those of the synchronous algorithm, within the shortest time, and (3) the convergence of ALADDIN is robust under various heterogeneous environments.
The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11 ~ 13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVe and fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.
Recently, an early exit network, which dynamically adjusts the model complexity during inference time, has achieved remarkable performance and neural network efficiency to be used for various applications. So far, many researchers have been focusing on reducing the redundancy of input sample or model architecture. However, they were unsuccessful at resolving the performance drop of early classifiers that make predictions with insufficient high-level feature information. Consequently, the performance degradation of early classifiers had a devastating effect on the entire network performance sharing the backbone. Thus, in this paper, we propose an Efficient Multi-Scale Feature Generation Adaptive Network (EMGNet), which not only reduced the redundancy of the architecture but also generates multi-scale features to improve the performance of the early exit network. Our approach renders multi-scale feature generation highly efficient through sharing weights in the center of the convolution kernel. Also, our gating network effectively learns to automatically determine the proper multi-scale feature ratio required for each convolution layer in different locations of the network. We demonstrate that our proposed model outperforms the state-of-the-art adaptive networks on CIFAR10, CIFAR100, and ImageNet datasets. The implementation code is available at https://github.com/lee-gwang/EMGNet
Technology-assisted review (TAR) workflows based on iterative active learning are widely used in document review applications. Most stopping rules for one-phase TAR workflows lack valid statistical guarantees, which has discouraged their use in some legal contexts. Drawing on the theory of quantile estimation, we provide the first broadly applicable and statistically valid sample-based stopping rules for one-phase TAR. We further show theoretically and empirically that overshooting a recall target, which has been treated as innocuous or desirable in past evaluations of stopping rules, is a major source of excess cost in one-phase TAR workflows. Counterintuitively, incurring a larger sampling cost to reduce excess recall leads to lower total cost in almost all scenarios.
Ambiguous Label Learning (ALL), as an emerging paradigm of weakly supervised learning, aims to induce the prediction model from training datasets with ambiguous supervision, where, specifically, each training instance is annotated with a set of candidate labels but only one is valid. To handle this task, the existing shallow methods mainly disambiguate the candidate labels by leveraging various regularization techniques. Inspired by the great success of deep generative adversarial networks, we apply it to perform effective candidate label disambiguation from a new instance-pivoted perspective. Specifically, for each ALL instance, we recombine its feature representation with each of candidate labels to generate a set of candidate instances, where only one is real and all others are fake. We formulate a unified adversarial objective with respect to three players, i.e., a discriminator, a generator, and a classifier. The discriminator is used to detect the fake candidate instances, so that the classifier can be trained without them. With this insight, we develop a novel ALL method, namely Adversarial Ambiguous Label Learning with Candidate Instance Detection (A2L2CID). Theoretically, we analyze that there is a global equilibrium point between the three players. Empirically, extensive experimental results indicate that A2L2CID outperforms the state-of-the-art ALL methods.
Machine learning techniques have shown promise in predicting clinical deterioration of hospitalized patients based on electronic health record (EHR). However, building accurate early warning systems (EWS) remains challenging in practice. EHRs are heterogeneous, comprising both static and time-series data. Moreover, missing values are prevalent in both static and time-series data, and the missingness of certain data can be correlated to clinical outcomes. This paper proposes a novel approach for integrating static and time-series clinical data in deep recurrent models through multi-modal fusion. Furthermore, we exploit the correlation of static and time-series data through cross-modal imputation in an integrated recurrent model. We apply the proposed approaches to a dataset extracted from the EHR of 20,700 hospitalizations of adult oncology patients in a research hospital. The experiments demonstrate the proposed approaches outperform the state-of-the-art models in terms of predictive accuracy in generating early warnings for clinical deterioration. A case study further establishes the efficacy of the predictive model for early warning systems under realistic clinical settings.
Graph Neural Networks (GNNs) have achieved great success in downstream applications due to their ability to learn node representations. However, in many applications, graphs are not static. They often evolve with changes, such as the adjustment of node attributes or graph structures. These changes require node representations to be updated accordingly. It is non-trivial to apply current GNNs to update node representations in a scalable manner. Recent research proposes two types of solutions. The first solution, sampling neighbors for the influenced nodes, requires expensive processing for each node. The second solution, reducing the repeated computations by merging the shared neighbors, cannot speed up the updating process if the influenced nodes do not share neighbors. Most importantly, the above solutions ignore the hidden representations obtained in the previous times that can be reused to accelerate the representation updating. In this paper, we propose a general cache-based GNN system to accelerate the representation updating. Specifically, we cache a set of hidden representations obtained in the previous times, and then reuse them in the next time. To identify valuable hidden representations, we first estimate the number of hidden representations and their combinations that can be reused. Secondly, we formulate the k-assembler problem that selects k representations to maximize the saved time for the next updating process. Experiments on three real-world graphs show that the cache-based GNN system can significantly speed up the representation updating for various GNNs.
In this paper, we study the privacy-preserving task assignment problem in spatial crowdsourcing, where the locations of both workers and tasks, prior to their release to the server, are perturbed with Geo-Indistinguishability (a differential privacy notion for location-based systems). Different from the previously studied online setting, where each task is assigned immediately upon arrival, we target the batch-based setting, where the server maximizes the number of successfully assigned tasks after a batch of tasks arrive. To achieve this goal, we propose the k-Switch solution, which first divides the workers into small groups based on the perturbed distance between workers/tasks, and then utilizes Homomorphic Encryption (HE) based secure computation to enhance the task assignment. Furthermore, we expedite HE-based computation by limiting the size of the small groups under k. Extensive experiments demonstrate that, in terms of the number of successfully assigned tasks, the k-Switch solution improves batch-based baselines by 5.9X and the existing online solution by 1.74X, with no privacy leak.
The discovery and prediction of block access patterns in hybrid storage systems is of crucial importance for effective tier management. Existing methods are usually based on heuristics and unable to handle complex patterns. This work newly introduces transformer to block access pattern prediction. We remark that block accesses in the tier management systems are aggregated temporally and spatially as multivariate time series of block access frequency, so the runtime requirements are relaxed, making complex models applicable for the deployment. Moreover, enormous and rarely accessed blocks in storage systems and the structure of traditional transformer models would result in millions of redundant parameters and make them impractical to be deployed. We incorporate Tensor-Train Decomposition (TTD) with transformer and propose the Compressed Full Tenor Transformer (CFTT), in which all linear layers in the vanilla transformer are replaced with tensor-train layers. Weights of input and output layers are shared to further reduce parameters and reuse knowledge implicitly. CFTT can significantly reduce the model size and computation cost, which is critical to save storage space and inference time. Extensive experiments are conducted on synthetic and real-world datasets. The results demonstrate that transformers achieve state-of-the-art performance stably in terms of top-k hit rates. Moreover, the proposed CFTT compresses transformers 16× to 461× and speeds up inference 5× without sacrificing performance on the whole, which facilitates its applications in tier management in hybrid storage systems.
Modern deep neural networks (DNNs) have greatly facilitated the development of sequential recommender systems by achieving state-of-the-art recommendation performance on various sequential recommendation tasks. Given a sequence of interacted items, existing DNN-based sequential recommenders commonly embed each item into a unique vector to support subsequent computations of the user interest. However, due to the potentially large number of items, the over-parameterised item embedding matrix of a sequential recommender has become a memory bottleneck for efficient deployment in resource-constrained environments, e.g., smartphones and other edge devices. Furthermore, we observe that the widely-used multi-head self-attention, though being effective in modelling sequential dependencies among items, heavily relies on redundant attention units to fully capture both global and local item-item transition patterns within a sequence.
In this paper, we introduce a novel lightweight self-attentive network (LSAN) for sequential recommendation. To aggressively compress the original embedding matrix, LSAN leverages the notion of compositional embeddings, where each item embedding is composed by merging a group of selected base embedding vectors derived from substantially smaller embedding matrices. Meanwhile, to account for the intrinsic dynamics of each item, we further propose a temporal context-aware embedding composition scheme. Besides, we develop an innovative twin-attention network that alleviates the redundancy of the traditional multi-head self-attention while retaining full capacity for capturing long- and short-term (i.e., global and local) item dependencies. Comprehensive experiments demonstrate that LSAN significantly advances the accuracy and memory efficiency of existing sequential recommenders.
We study the problem of learning to cluster data points using an oracle which can answer same-cluster queries. Different from previous approaches, we do not assume that the total number of clusters is known at the beginning and do not require that the true clusters are consistent with a predefined objective function such as the K-means. These relaxations are critical from the practical perspective and, meanwhile, make the problem more challenging. We propose two algorithms with provable theoretical guarantees and verify their effectiveness via an extensive set of experiments on both synthetic and real-world data.
Hypergraphs have been becoming a popular choice to model complex, non-pairwise, and higher-order interactions for recommender systems. However, compared with traditional graph-based methods, the constructed hypergraphs are usually much sparser, which leads to a dilemma when balancing the benefits of hypergraphs and the modelling difficulty. Moreover, existing sequential hypergraph recommendation overlooks the temporal modelling among user relationships, which neglects rich social signals from the recommendation data. To tackle the above shortcomings of the existing hypergraph-based sequential recommendations, we propose a novel architecture named Hyperbolic Hypergraph representation learning method for Sequential Recommendation (H2SeqRec) with the pre-training phase. Specifically, we design three self-supervised tasks to obtain the pre-training item embeddings to feed or fuse into the following recommendation architecture (with two ways to use the pre-trained embeddings). In the recommendation phase, we learn multi-scale item embeddings via a hierarchical structure to capture multiple time-span information. To alleviate the negative impact of sparse hypergraphs, we utilize a hyperbolic space-based hypergraph convolutional neural network to learn the dynamic item embeddings. Also, we design an item enhancement module to capture dynamic social information at each timestamp to improve effectiveness. Extensive experiments are conducted on two real-world datasets to prove the effectiveness and high performance of the model.
In collaborative filtering, it is an important way to make full use of social information to improve the recommendation quality, which has been proved to be effective because user behavior will be affected by her friends. However, existing works leverage the social relationship to aggregate user features from friends' historical behavior sequences in a user-levelindirect paradigm. A significant defect of the indirect paradigm is that it ignores the temporal relationships between behavior events across users. In this paper, we propose a novel time-aware sequential recommendation framework called Social Temporal Excitation Networks (STEN), which introduces temporal point processes to model the fine-grained impact of friends' behaviors on the user's dynamic interests in an event-leveldirect paradigm. Moreover, we propose to decompose the temporal effect in sequential recommendation into social mutual temporal effect and ego temporal effect. Specifically, we employ a social heterogeneous graph embedding layer to refine user representation via structural information. To enhance temporal information propagation, STEN directly extracts the fine-grained temporal mutual influence of friends' behaviors through themutually exciting temporal network. Besides, user's dynamic interests are captured through theself-exciting temporal network. Extensive experiments on three real-world datasets show that STEN outperforms state-of-the-art baseline methods. Moreover, STEN provides event-level recommendation explainability, which is also illustrated experimentally.
Nowadays, it is common for a person to possess different identities on multiple social platforms. Social network alignment aims to match the identities that from different networks. Recently, unsupervised network alignment methods have received significant attention since no identity anchor is required. However, to capture the relevance between identities, the existing unsupervised methods generally rely heavily on user profiles, which is unobtainable and unreliable in real-world scenarios. In this paper, we propose an unsupervised alignment framework named Large-Scale Network Alignment (LSNA) to integrate the network information and reduce the requirement on user profile. The embedding module of LSNA, named Cross Network Embedding Model (CNEM), aims to integrate the topology information and the network correlation to simultaneously guide the embedding process. Moreover, in order to adapt LSNA to large-scale networks, we propose a network disassembling strategy to divide the costly large-scale network alignment problem into multiple executable sub-problems. The proposed method is evaluated over multiple real-world social network datasets, and the results demonstrate that the proposed method outperforms the state-of-the-art methods.
Grammatical Error Correction (GEC) task is always considered as low resource machine translation task which translates a sentence in an ungrammatical language to a grammatical language. As the state-of-the-art approach to GEC task, transformer-based neural machine translation model takes input sentence as a token sequence without sentence's structure information, and may be misled by some strange ungrammatical contexts. In response, to lay more attention on a given token's correct collocation rather than the misleading tokens, we propose dependent self-attention to relatively increase the attention score between correct collocations according to the dependency distance between tokens. However, as the source sentence is ungrammatical in GEC task, the correct collocations can hardly be extracted by normal dependency parser. Therefore, we propose dependency parser for ungrammatical sentence to get the dependency distance between tokens in the ungrammatical sentence. Our method achieves competitive results on both BEA-2019 shared task, CoNLL-2014 shared task and JFLEG test sets.
Hashing technology has been widely used in image retrieval due to its computational and storage efficiency. Recently, deep unsupervised hashing methods have attracted increasing attention due to the high cost of human annotations in the real world and the superiority of deep learning technology. However, most deep unsupervised hashing methods usually pre-compute a similarity matrix to model the pairwise relationship in the pre-trained feature space. Then this similarity matrix would be used to guide hash learning, in which most of the data pairs are treated equivalently. The above process is confronted with the following defects:1) The pre-computed similarity matrix is inalterable and disconnected from the hash learning process, which cannot explore the underlying semantic information. 2) The informative data pairs may be buried by the large number of less-informative data pairs. To solve the aforementioned problems, we propose a Deep Self-Adaptive Hashing(DSAH) model to adaptively capture the semantic information with two special designs: Adaptive Neighbor Discovery(AND) and Pairwise Information Content(PIC). Firstly, we adopt the AND to initially construct a neighborhood-based similarity matrix, and then refine this initial similarity matrix with a novel update strategy to further investigate the semantic structure behind the learned representation. Secondly, we measure the priorities of data pairs with PIC and assign adaptive weights to them, which is relies on the assumption that more dissimilar data pairs contain more discriminative information for hash learning. Extensive experiments on several datasets demonstrate that the above two technologies facilitate the deep hashing model to achieve superior performance.
Various data mining tasks have been proposed to study Community Question Answering (CQA) platforms like Stack Overflow. The relatedness between some of these tasks provides useful learning signals to each other via Multi-Task Learning (MTL). However, due to the high heterogeneity of these tasks, few existing works manage to jointly solve them in a unified framework. To tackle this challenge, we develop a multi-relational graph based MTL model called Heterogeneous Multi-Task Graph Isomorphism Network (HMTGIN) which efficiently solves heterogeneous CQA tasks. In each training forward pass, HMTGIN embeds the input CQA forum graph by an extension of Graph Isomorphism Network and skip connections. The embeddings are then shared across all task-specific output layers to compute respective losses. Moreover, two cross-task constraints based on the domain knowledge about tasks' relationships are used to regularize the joint learning. In the evaluation, the embeddings are shared among different task-specific output layers to make corresponding predictions. To the best of our knowledge, HMTGIN is the first MTL model capable of tackling CQA tasks from the aspect of multi-relational graphs. To evaluate HMTGIN's effectiveness, we build a novel large-scale multi-relational graph CQA dataset with over two million nodes from Stack Overflow. Extensive experiments show that: (1) HMTGIN is superior to all baselines on five tasks; (2) The proposed MTL strategy and cross-task constraints have substantial advantages.
Identifying the dynamic functions of different urban zones enables a variety of smart city applications, such as intelligent urban planning, real-time traffic scheduling, and community precision management. Traditional urban function research using government administrative zoning systems is often conducted in a coarse resolution with fixed split, and ignore the reshaping of zones by city growth. To solve this problem, we propose a two-stage framework in order to represent the high-definition distribution of urban function across the city, by analyzing continuous human traces extracted from the dense, widespread, and full-time cellular data. At the representation stage, we embed the locations of base stations by modeling the user movements with staying and transfer events, along with the consideration of dynamic trip purposes in continuous human traces. At the annotation stage, we first divide the city into the finest unit zones and each covers at least one base station. By clustering the base stations, we further group the unit zones into functional zones. Last, we annotate functional zones based on the local point-of-interest (POI) information. In experiments, we evaluate the proposed high-definition function study in two tasks: (i) in-zone crowd flow prediction, and (ii) zone-enhanced POI recommendation. The results demonstrate the advantage of the proposed method with both the effectiveness of city split and the high-quality function annotation.
Social summarization aims to produce a concise summary that describes the core content of a collection of posts on a specific topic. Existing methods tend to produce sparse or ambiguous representations of posts due to only using short and informal text content. Latest researches use social relations to improve diversity of summaries, yet they model social relations as a regularization item, which has poor flexibility and generalization. Those methods could not embody the deep semantic and social interactions among posts, making summaries still suffer from redundancy. We propose to use Social Context and Multi-Granularity Relations (SCMGR) to improve unsupervised social summarization. It learns more informative representations of posts considering both text semantics and social structure information without any annotated data. First, we design two sociologically motivated meta-paths to construct a social context graph among posts, and adopt a graph convolutional network to aggregate social context information from neighbors. Second, we design a multi-granularity relation decoder to capture the deeper semantic and social interactions from post-word and post-post aspects respectively, which can provide guidance for summary selection from semantic and social structure perspectives. Finally, a sparse reconstruction-based extractor is used to select posts that can best reconstruct original content and social network structure as summaries. Our approach improves the coverage and diversity of summaries. Experimental results on both English and Chinese corpora prove the effectiveness of our model.
For reliability, machine learning models in some areas, e.g., finance and healthcare, require to be both accurate and globally interpretable. Among them, credit risk assessment is a major application of machine learning for financial institutions to evaluate credit of users and detect default or fraud. Simple white-box models, such as Logistic Regression (LR), are usually used for credit risk assessment, but not powerful enough to model complex nonlinear interactions among features. In contrast, complex black-box models are powerful at modeling, but lack of interpretability, especially global interpretability. Fortunately, automatic feature crossing is a promising way to find cross features to make simple classifiers to be more accurate without heavy handcrafted feature engineering. However, existing automatic feature crossing methods have problems in efficiency on credit risk assessment, for corresponding data usually contains hundreds of feature fields.
In this work, we find local interpretations in Deep Neural Networks (DNNs) of a specific feature are usually inconsistent among different samples. We demonstrate this is caused by nonlinear feature interactions in the hidden layers of DNN. Thus, we can mine feature interactions in DNN, and use them as cross features in LR. This will result in mining cross features more efficiently. Accordingly, we propose a novel automatic feature crossing method called DNN2LR. The final model, which is a LR model empowered with cross features, generated by DNN2LR is a white-box model. We conduct experiments on both public and business datasets from real-world credit risk assessment applications, which show that, DNN2LR outperform both conventional models used for credit assessment and several feature crossing methods. Moreover, comparing with state-of-the-art feature crossing methods, i.e., AutoCross, the proposed DNN2LR method accelerates the speed by about 10 to 40 times on financial credit assessment datasets, which contain hundreds of feature fields.
Sequential recommendation systems seek to learn users' preferences to predict their next actions based on the items engaged recently. Static behavior of users requires a long time to form, but short-term interactions with items usually meet some actual needs in reality and are more variable. RNN-based models are always constrained by the strong order assumption and are hard to model the complex and changeable data flexibly. Most of the CNN-based models are limited to the fixed convolutional kernel. All these methods are suboptimal when modeling the dynamics of item-to-item transitions. It is difficult to describe the items with complex relations and extract the fine-grained user preferences from the interaction sequence. To address these issues, we propose a knowledge-aware sequential recommender with the attention-enhanced dynamic convolutional network (KAeDCN). Our model combines the dynamic convolutional network with attention mechanisms to capture changing dependencies in the sequence. Meanwhile, we enhance the representations of items with Knowledge Graph (KG) information through an information fusion module to capture the fine-grained user preferences. The experiments on four public datasets demonstrate that KAeDCN outperforms most of the state-of-the-art sequential recommenders. Furthermore, experimental results also prove that KAeDCN can enhance the representations of items effectively and improve the extractability of sequential dependencies.
With the popularity of mobile devices equipped with positioning devices, it is convenient to obtain enormous amounts of trajectory data. The development promotes the study of extracting moving patterns from trajectory data of moving objects. One such pattern is the convoy, which refers to a group of objects moving together for a period of time. The existing convoy mining algorithms have a large time cost because they adopt a density-based clustering algorithm over global objects. In this paper, we propose an efficient convoy mining algorithm (ECMA) that adopts the divide-and-conquer methodology. A block-based partition model (BP-Model) is designed to divide objects into multiple maximized connected nonempty block areas (MOBAs). The convoy mining problem is then solved by processing each MOBA sequentially, which significantly reduces the time cost of convoy mining. In the experiments, we evaluate the performance of our algorithm on real-world datasets. The results show that the ECMA is more efficient than existing convoy mining algorithms.
Recently, micro-video sharing platforms such as Kuaishou and Tiktok have become a major source of information for people's lives. Thanks to the large traffic volume, short video lifespan and streaming fashion of these services, it has become more and more pressing to improve the existing recommender systems to accommodate these challenges in a cost-effective way. In this paper, we propose a novel concept-aware denoising graph neural network (named Conde) for micro-video recommendation. Conde consists of a three-phase graph convolution process to derive user and micro-video representations: warm-up propagation, graph denoising and preference refinement. A heterogeneous tripartite graph is constructed by connecting user nodes with video nodes, and video nodes with associated concept nodes, extracted from captions and comments of the videos. To address the noisy information in the graph, we introduce a user-oriented graph denoising phase to extract a subgraph which can better reflect the user's preference. Despite the main focus of micro-video recommendation in this paper, we also show that our method can be generalized to other types of tasks. Therefore, we also conduct empirical studies on a well-known public E-commerce dataset. The experimental results suggest that the proposed Conde achieves significantly better recommendation performance than the existing state-of-the-art solutions.
Hyperparameter optimization (HPO), aiming at automatically searching optimal hyperparameter configurations, has attracted increasing attention in the machine learning community. HPO generally suffers from high searching costs when dealing with large-scale real-world datasets since training the model with a certain hyperparameter configuration is time-consuming. Existing works suggest sampling subsets uniformly to represent the full dataset for HPO but ignoring the complex and dynamic distribution in real-world scenarios and the exploration of hyperparameter transfer. To tackle this problem, we propose a novel meta hyperparameter optimization model with an adversarial proxy subsets sampling strategy (Meta-HPO), which can transfer hyperparameters optimized on the sampled proxy subsets to the full dataset and further adapt to the new data in an out-of-sample updating manner. In particular, a perturbation-aware adversarial sampling strategy is designed to select the proxy subsets that significantly influence the model performance. With the searched hyperparameter configurations and corresponding performance scores on the proxy subsets, we propose a meta transfer framework, named "hp-learner'', to build the connection between the distribution of dataset and the optimal hyperparameter configuration. Our Meta-HPO provides a flexible and efficient hyperparameter optimization algorithm. Extensive experiments on real-world datasets validate the advantages of our proposed Meta-HPO model against existing state-of-the-art benchmarks.
Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE, a simple yet effective automatic evaluation method for conversational search. The proposed embedding-based metric takes the influence of part of speech (POS) of the terms in the response into account. To the best knowledge, our work is the first to systematically demonstrate the importance of incorporating syntactic information, such as POS labels, for conversational search evaluation. Experimental results demonstrate that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics.
This paper is concerned with the problem of computing the semantic difference between different versions of large-scale ontological knowledge bases using a uniform interpolation (UI) approach. The semantic difference between two versions of an ontology are the axioms entailed by one version but not the other version, reflecting the evolutionary changes of the content of the ontology. In general, computing such axioms is not computationally feasible, since there are infinitely many of them. UI is an advanced reasoning technique that seeks to create restricted views of ontologies; it provides an effective means for computing a finite representation of the difference between two ontologies. While existing UI methods are designed for languages that are either more expressive or less expressive than the description logic ELH, the underlying language of typical large-scale ontologies, in this paper, we introduce a practical UI method tailored for the task of computing the semantic difference in large-scale ELH-ontologies. The method is terminating, sound, and can always compute UI results possibly including fresh definer symbols. Two case studies on different versions of the SNOMED CT terminology show that the method has overcome major limitations of existing UI methods and can be used to reveal modeling changes that have occurred over successive releases of SNOMED CT.
Graph embedding converts a graph into a multi-dimensional space in which the graph structural information or graph properties are maximumly preserved. It is an effective and efficient way to provide users a deeper understanding of what is behind the data and thus can benefit a lot of useful applications. However, most graph embedding methods suffer from high computation and space costs. In this paper, we present a simple graph embedding method that directly embeds the graph into its Euclidean distance space. This method does not require the learned representations to be low dimensional, but it has several good characteristics. We find that the centrality of nodes/edges can be represented by the position of nodes or the length of edges when a graph is embedded. Besides, the edge length is closely related to the density of regions in a graph. We then apply this graph embedding method into graph analytics, such as community detection, graph compression, and wormhole detection, etc. Our evaluation shows the effectiveness and efficiency of this embedding method and contends that it yields a promising approach to graph analytics.
Deep learning (DL) algorithms have played a major role in achieving state-of-the-art (SOTA) performance in various learning applications, including computer vision, natural language processing, and recommendation systems (RSs). However, these methods are based on a vast amount of data and do not perform as well when there is a limited amount of data available. Moreover, some of these applications (e.g., RSs) suffer from other issues such as data sparsity and the cold-start problem. While recent research on RSs used DL models based on side information (SI) (e.g., product reviews, film plots, etc.) to tackle these challenges, we propose boosting neural network (BNN), a new DL framework for capturing complex patterns, which requires just a limited amount of data. Unlike conventional boosting, BNN does not sum the predictions generated by its components. Instead, it uses these predictions as new SI features which enhances accuracy. Our framework can be utilized for many problems, including classification, regression, and ranking. In this paper, we demonstrate BNN's use for addressing a classification task. Comprehensive experiments conducted to illustrate BNN's effectiveness on three real-world datasets demonstrated its ability to outperform existing SOTA models for classification tasks (e.g., clickthrough rate prediction).
In recent years, researchers attempt to utilize online social information to alleviate data sparsity for collaborative filtering, based on the rationale that social networks offers the insights to understand the behavioral patterns. However, due to the overlook of inter-dependent knowledge across items (e.g., knowledge graph dependencies between products), existing social recommender systems are insufficient to distill the heterogeneous collaborative signals from both user and item side. In this work, we propose Self- Supervised Metagraph Informax Network (SMIN) which investigates the potential of jointly incorporating social- and knowledge-aware relational structures into the user preference representation framework. To model relation heterogeneity, we design a metapath-guided heterogeneous graph neural network to aggregate feature embeddings from different types of meta-relations across users and items, empowering SMIN to maintain dedicated representations for multifaceted user- and item-wise dependencies. Additionally, to inject high-order collaborative signals into recommendation, we generalize the mutual information learning paradigm from vector space to a self-supervised graph-based collaborative filtering. This endows the expressive modeling of user-item interactive patterns, by exploring global-level collaborative relations and underlying isomorphic transformation property of graph topology. Experimental results on several real-world datasets demonstrate the effectiveness of our model over various state-of-the-art recommendation methods. Further analysis provides insights into the performance superiority of our new recommendation framework. We release our source code at https://github.com/SocialRecsys/SMIN.
Community detection, aiming to group the graph nodes into clusters with dense inner-connection, is a fundamental graph mining task. Recently, it has been studied on the heterogeneous graph, which contains multiple types of nodes and edges, posing great challenges for modeling the high-order relationship between nodes. With the surge of graph embedding mechanism, it has also been adopted to community detection. A remarkable group of works use the meta-path to capture the high-order relationship between nodes and embed them into nodes' embedding to facilitate community detection. However, defining meaningful meta-paths requires much domain knowledge, which largely limits their applications, especially on schema-rich heterogeneous graphs like knowledge graphs. To alleviate this issue, in this paper, we propose to exploit the context path to capture the high-order relationship between nodes, and build a Context Path-based Graph Neural Network (CP-GNN) model. It recursively embeds the high-order relationship between nodes into the node embedding with attention mechanisms to discriminate the importance of different relationships. By maximizing the expectation of the co-occurrence of nodes connected by context paths, the model can learn the nodes' embeddings that both well preserve the high-order relationship between nodes and are helpful for community detection. Extensive experimental results on four real-world datasets show that CP-GNN outperforms the state-of-the-art community detection methods1.
Generating novel molecules with desired properties is a fundamental problem in modern drug discovery. This is a challenging problem because it requires the optimization of the given objectives while obeying the rules of chemical valence. An effective approach is to incorporate the molecular graph with deep generative models. However, recent generative models with high-performance are still computationally expensive. In this paper, we propose GF-VAE, a flow-based variational autoencoder (VAE) model for molecular graph generation. Specifically, the model equips VAE a lightweight flow model as its decoder, in which, the encoder aims to accelerate the training process of the decoder, while the decoder in turns to optimize the performance of the encoder. Thanks to the invertibility of flow model, the generation process is easily accomplished by reversing the decoder. Additionally, the final generated molecules are processed by validity correction. Therefore, our GF-VAE inherits the advantages of both VAE and flow-based methods. We validate our model on molecule generation and reconstruction, smoothness of learned latent space, property optimization and constrained property optimization. The results show that our model achieves state-of-the-arts performance on these tasks. Moreover, the time performance of GF-VAE on two classical datasets can achieve 31.3% and 62.9% improvements separately than the state-of-the-art model.
Researches on analyzing graphs with Graph Neural Networks (GNNs) have been receiving more and more attention because of the great expressive power of graphs. GNNs map the adjacency matrix and node features to node representations by message passing through edges on each convolution layer. However, the message passed through GNNs is not always beneficial for all parts in a graph. Specifically, as the data distribution is different over the graph, the receptive field (the farthest nodes that a node can obtain information from) needed to gather information is also different. Existing GNNs treat all parts of the graph uniformly, which makes it difficult to adaptively pass the most informative message for each unique part. To solve this problem, we propose two regularization terms that consider message passing locally: (1) Intra-Energy Reg and (2) Inter-Energy Reg. Through experiments and theoretical discussion, we first show that the speed of smoothing of different parts varies enormously and the topology of each part affects the way of smoothing. With Intra-Energy Reg, we strengthen the message passing within each part, which is beneficial for getting more useful information. With Inter-Energy Reg, we improve the ability of GNNs to distinguish different nodes. With the proposed two regularization terms, GNNs are able to filter the most useful information adaptively, learn more robustly and gain higher expressiveness. Moreover, the proposed LEReg can be easily applied to other GNN models with plug-and-play characteristics. Extensive experiments on several benchmarks verify that GNNs with LEReg outperform or match the state-of-the-art methods. The effectiveness and efficiency are also empirically visualized with elaborate experiments.
Graph Neural Networks (GNNs) have risen to prominence in learning representations for graph structured data. A single GNN layer typically consists of a feature transformation and a feature aggregation operation. The former normally uses feed-forward networks to transform features, while the latter aggregates the transformed features over the graph. Numerous recent works have proposed GNN models with different designs in the aggregation operation. In this work, we establish mathematically that the aggregation processes in a group of representative GNN models including GCN, GAT, PPNP, and APPNP can be regarded as (approximately) solving a graph denoising problem with a smoothness assumption. Such a unified view across GNNs not only provides a new perspective to understand a variety of aggregation operations but also enables us to develop a unified graph neural network framework UGNN. To demonstrate its promising potential, we instantiate a novel GNN model, ADA-UGNN, derived from UGNN, to handle graphs with adaptive smoothness across nodes. Comprehensive experiments show the effectiveness of ADA-UGNN.
Designing pre-training objectives that more closely resemble the downstream tasks for pre-trained language models can lead to better performance at the fine-tuning stage, especially in the ad-hoc retrieval area. Existing pre-training approaches tailored for IR tried to incorporate weak supervised signals, such as query-likelihood based sampling, to construct pseudo query-document pairs from the raw textual corpus. However, these signals rely heavily on the sampling method. For example, the query likelihood model may lead to much noise in the constructed pre-training data. In this paper, we propose to leverage the large-scale hyperlinks and anchor texts to pre-train the language model for ad-hoc retrieval. Since the anchor texts are created by webmasters and can usually summarize the target document, it can help to build more accurate and reliable pre-training samples than a specific algorithm. Considering different views of the downstream ad-hoc retrieval, we devise four pre-training tasks based on the hyperlinks. We then pre-train the Transformer model to predict the pair-wise preference, jointly with the Masked Language Model objective. Experimental results on two large-scale ad-hoc retrieval datasets show the significant improvement of our model compared with the existing methods.
We consider the cross-modal task of producing color representations for text phrases. Motivated by the fact that a significant fraction of user queries on an image search engine follow an (attribute, object) structure, we propose a generative adversarial network that generates color profiles for such bigrams. We design our pipeline to learn composition - the ability to combine seen attributes and objects to unseen pairs. We propose a novel dataset curation pipeline from existing public sources. We describe how a set of phrases of interest can be compiled using a graph propagation technique, and then mapped to images. While this dataset is specialized for our investigations on color, the method can be extended to other visual dimensions where composition is of interest. We provide detailed ablation studies that test the behavior of our GAN architecture with loss functions from the contrastive learning literature. We show that the generative model achieves lower Frechet Inception Distance than discriminative ones, and therefore predicts color profiles that better match those from real images. Finally, we demonstrate improved performance in image retrieval and classification, indicating the crucial role that color plays in these downstream tasks.
Information Retrieval evaluation has traditionally focused on defining principled ways of assessing the relevance of a ranked list of documents with respect to a query. Several methods extend this type of evaluation beyond relevance, making it possible to evaluate different aspects of a document ranking (e.g., relevance, usefulness, or credibility) using a single measure (multi-aspect evaluation). However, these methods either are (i) tailor-made for specific aspects and do not extend to other types or numbers of aspects, or (ii) have theoretical anomalies, e.g. assign maximum score to a ranking where all documents are labelled with the lowest grade with respect to all aspects (e.g., not relevant, not credible, etc.).
We present a theoretically principled multi-aspect evaluation method that can be used for any number, and any type, of aspects. A thorough empirical evaluation using up to 5 aspects and a total of 425 runs officially submitted to 10 TREC tracks shows that our method is more discriminative than the state-of-the-art and overcomes theoretical limitations of the state-of-the-art.
Collaborative filtering (CF) is a widely studied research topic in recommender systems. The learning of a CF model generally depends on three major components, namely interaction encoder, loss function, and negative sampling. While many existing studies focus on the design of more powerful interaction encoders, the impacts of loss functions and negative sampling ratios have not yet been well explored. In this work, we show that the choice of loss function as well as negative sampling ratio is equivalently important. More specifically, we propose the cosine contrastive loss (CCL) and further incorporate it to a simple unified CF model, dubbed SimpleX. Extensive experiments have been conducted on 10 benchmark datasets and compared with 28 existing CF models in total. Surprisingly, the results show that, under our CCL loss and a large negative sampling ratio, SimpleX can surpass most sophisticated state-of-the-art models by a large margin (e.g., max 48.5% improvement in NDCG@20 over LightGCN). We believe that SimpleX could not only serve as a simple strong baseline to foster future research on CF, but also shed light on the potential research direction towards improving loss function and negative sampling.
With the recent success of graph convolutional networks (GCNs), they have been widely applied for recommendation, and achieved impressive performance gains. The core of GCNs lies in its message passing mechanism to aggregate neighborhood information. However, we observed that message passing largely slows down the convergence of GCNs during training, especially for large-scale recommender systems, which hinders their wide adoption. LightGCN makes an early attempt to simplify GCNs for collaborative filtering by omitting feature transformations and nonlinear activations. In this paper, we take one step further to propose an ultra-simplified formulation of GCNs (dubbed UltraGCN), which skips infinite layers of message passing for efficient recommendation. Instead of explicit message passing, UltraGCN resorts to directly approximate the limit of infinite-layer graph convolutions via a constraint loss. Meanwhile, UltraGCN allows for more appropriate edge weight assignments and flexible adjustment of the relative importances among different types of relationships. This finally yields a simple yet effective UltraGCN model, which is easy to implement and efficient to train. Experimental results on four benchmark datasets show that UltraGCN not only outperforms the state-of-the-art GCN models but also achieves more than 10x speedup over LightGCN.
Entity alignment (EA) aims to find the equivalent entities in different KGs, which is a crucial step in integrating multiple KGs. However, most existing EA methods have poor scalability and are unable to cope with large-scale datasets. We summarize three issues leading to such high time-space complexity in existing EA methods: (1) Inefficient graph encoders, (2) Dilemma of negative sampling, and (3) "Catastrophic forgetting" in semi-supervised learning. To address these challenges, we propose a novel EA method with three new components to enable high Performance, high Scalability, and high Robustness (PSR): (1) Simplified graph encoder with relational graph sampling, (2) Symmetric negative-free alignment loss, and (3) Incremental semi-supervised learning. Furthermore, we conduct detailed experiments on several public datasets to examine the effectiveness and efficiency of our proposed method. The experimental results show that PSR not only surpasses the previous SOTA in performance but also has impressive scalability and robustness.
Many social phenomena are triggered by public opinion that is formed in the process of opinion exchange among individuals. To date, from the engineering point of view, a large body of work has been devoted to studying how to manipulate individual opinions so as to guide public opinion towards the desired state. Recently, Abebe et al. (KDD 2018) have initiated the study of the impact of interventions at the level of susceptibility rather than the interventions that directly modify individual opinions themselves. For the model, Chan et al. (The Web Conference 2019) designed a local search algorithm to find an optimal solution in polynomial time. However, it can be seen that the solution obtained by solving the above model might not be implemented in real-world scenarios. In fact, as we do not consider the amount of changes of the susceptibility, it would be too costly to change the susceptibility values for agents based on the solution.
In this paper, we study an opinion optimization model that is able to limit the amount of changes of the susceptibility in various forms. First we introduce a novel opinion optimization model, where the initial susceptibility values are given as additional input and the feasible region is defined using the ℓp-ball centered at the initial susceptibility vector. For the proposed model, we design a projected gradient method that is applicable to the case where there are millions of agents. Finally we conduct thorough experiments using a variety of real-world social networks and demonstrate that the proposed algorithm outperforms baseline methods.
Neural architecture search (NAS) has achieved remarkable results in deep neural network design. Differentiable architecture search converts the search over discrete architectures into a hyperparameter optimization problem which can be solved by gradient descent. However, questions have been raised regarding the effectiveness and generalizability of gradient methods for solving non-convex architecture hyperparameter optimization problems. In this paper, we propose L2NAS, which learns to intelligently optimize and update architecture hyperparameters via an actor neural network based on the distribution of high-performing architectures in the search history. We introduce a quantile-driven training procedure which efficiently trains L2NAS in an actor-critic framework via continuous-action reinforcement learning. Experiments show that L2NAS achieves state-of-the-art results on NAS-Bench-201 benchmark as well as DARTS search space and Once-for-All MobileNetV3 search space. We also show that search policies generated by L2NAS are generalizable and transferable across different training datasets with minimal fine-tuning.
Automatic detection of click-baits and incongruent news headlines is crucial to maintain the reliability of the Web and has raised much research attention. However, most existing methods perform poorly when news headline contains contextually important cardinal values such as a quantity or an amount. In this work, we focus on this particular case and propose a neural attention based solution, which uses a novel cardinal Part of Speech (POS) tags pattern based hierarchical attention network, namely POSHAN, to learn effective representations of sentences in the news article. In addition, we investigate a novel cardinal phrase guided attention, which uses word embeddings of the contextually important cardinal value and neighbouring words. In the experiments conducted on two publicly available datasets, we observe that the proposed method gives appropriate significance to cardinal values and outperforms all the baselines. An ablation study of the POSHAN, shows that the cardinal POS-tag pattern based hierarchical attention is very effective for the cases in which headline contains cardinal values.
The web is full of guidance on a wide variety of tasks, from changing the oil in your car to baking an apple pie. However, as content is created independently, a single task could have thousands of corresponding procedural texts. This makes it difficult for users to view the bigger picture and understand the multiple ways the task could be accomplished. In this work we propose an unsupervised learning approach for summarizing multiple procedural texts into an intuitive graph representation, allowing users to easily explore commonalities and differences. We demonstrate our approach on recipes, a prominent example of procedural texts. User studies show that our representation is intuitive and coherent and that it has the potential to help users with several sensemaking tasks, including adapting recipes for a novice cook and finding creative ways to spice up a dish.
Given a source node s and a target node t in a graph G, the Personalized PageRank (PPR) from s to t is the probability of a random walk starting from s terminates at t. PPR is a classic measure of the relevance among different nodes in a graph, and has been applied in numerous real-world systems. However, existing techniques for PPR queries are not robust to dynamic real-world graphs, which typically have different evolving speeds. Their performance is significantly degraded either at a lower graph evolving rate (e.g., much more queries than updates) or a higher rate.
To address the above deficiencies, we propose Agenda to efficiently process, with strong approximation guarantees, the single-source PPR (SSPPR) queries on dynamically evolving graphs with various evolving speeds. Compared with previous methods, Agenda has significantly better workload robustness, while ensuring the same result accuracy. Agenda also has theoretically-guaranteed small query and update costs. Experiments on up to billion-edge scale graphs show that Agenda significantly outperforms state-of-the-art methods for various query/update workloads, while maintaining better or comparable approximation accuracies.
Modeling information cascades in a social network through the lenses of the ideological leaning of its users can help understanding phenomena such as misinformation propagation and confirmation bias, and devising techniques for mitigating their toxic effects.
In this paper we propose a stochastic model to learn the ideological leaning of each user in a multidimensional ideological space, by analyzing the way politically salient content propagates. In particular, our model assumes that information propagates from one user to another if both users are interested in the topic and ideologically aligned with each other. To infer the parameters of our model, we devise a gradient-based optimization procedure maximizing the likelihood of an observed set of information cascades. Our experiments on real-world political discussions on Twitter and Reddit confirm that our model is able to learn the political stance of the social media users in a multidimensional ideological space.
A search engine generally applies a single search strategy to any user query. The search combines many component processes (e.g., indexing, query expansion, search-weighting model, document ranking) and their hyperparameters, whose values are optimized based on past queries and then applied to all future queries. Even an optimized system may perform poorly on some queries, however, whereas another system might perform better on those queries. Selective search strategy aims to select the most appropriate combination of components and hyperparameter values to apply for each individual query. The number of candidate combinations is huge. To adapt best to any query, the ideal system would use many combinations. In the real world it would be too costly to use and maintain thousands of configurations. A trade-off must therefore be found between performance and cost. In this paper, we describe a risk-sensitive approach to optimize the set of configurations that should be included in a selective search strategy. This approach solves the problem of which and how many configurations to include in the system. We show that the use of 20 configurations results in significantly greater effectiveness than current approaches when tested on three TREC reference collections, by about 23% when compared to L2R documents and about 10% when compared to other selective approaches, and that it offers an appropriate trade-off between system complexity and system effectiveness.
Several real-world scenarios, such as remote control and sensing, are comprised of action and observation delays. The presence of delays degrades the performance of reinforcement learning (RL) algorithms, often to such an extent that algorithms fail to learn anything substantial. This paper formally describes the notion of Markov Decision Processes (MDPs) with stochastic delays and shows that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure. We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with stochastic delays in actions and observations. The delay-resolved deep Q-network (DRDQN) algorithm is bench-marked on a variety of environments comprising of multi-step and stochastic delays and results in better performance, both in terms of achieving near-optimal rewards and minimizing the computational overhead thereof, with respect to the currently established algorithms.
Modern recommender systems usually embed users and items into a learned vector space representation. Similarity in this space is used to generate recommendations, and recommendation methods are agnostic to the structure of the embedding space. Motivated by the need for recommendation systems to be more transparent and controllable, we postulate that it is beneficial to assign meaning to some of the dimensions of user and item representations. Disentanglement is one technique commonly used for this purpose. We presenta novel supervised disentangling approach for recommendation tasks. Our model learns embeddings where attributes of interest are disentangled, while requiring only a very small number of labeled items at training time. The model can then generate interactive and critiquable recommendations for all users, without requiring any labels at recommendation time, and without sacrificing any recommendation performance. Our approach thus provides users with levers to manipulate, critique and fine-tune recommendations, and gives insight into why particular recommendations are made. Given only user-item interactions at recommendation time, we show that it identifies user tastes with respect to the attributes that have been disentangled, allowing for users to manipulate recommendations across these attributes.
In many high-stakes applications of machine learning models, outputting only predictions or providing statistical confidence is usually insufficient to gain trust from end users, who often prefer a transparent reasoning paradigm. Despite the recent encouraging developments on deep networks for sequential data modeling, due to the highly recursive functions, the underlying rationales of their predictions are difficult to explain. Thus, in this paper, we aim to develop a sequence modeling approach that explains its own predictions by breaking input sequences down into evidencing segments (i.e., sub-sequences) in its reasoning. To this end, we build our model upon convolutional neural networks, which, in their vanilla forms, associates local receptive fields with outputs in an obscure manner. To unveil it, we resort to case-based reasoning, and design prototype modules whose units (i.e., prototypes) resemble exemplar segments in the problem domain. Each prediction is obtained by combining the comparisons between the prototypes and the segments of an input. To enhance interpretability, we propose a training objective that delicately adapts the distribution of prototypes to the data distribution in latent spaces, and design an algorithm to map prototypes to human-understandable segments. Through extensive experiments in a variety of domains, we demonstrate that our model can achieve high interpretability generally, together with a competitive accuracy to the state-of-the-art approaches.
A typical assumption in supervised machine learning is that the train (source) and test (target) datasets follow completely the same distribution. This assumption is, however, often violated in uncertain real-world applications, which motivates the study of learning under covariate shift. In this setting, the naive use of adaptive hyperparameter optimization methods such as Bayesian optimization does not work as desired since it does not address the distributional shift among different datasets. In this work, we consider a novel hyperparameter optimization problem under the i>multi-source covariate shift whose goal is to find the optimal hyperparameters for a target task of interest using only unlabeled data in a target task and labeled data inmultiple source tasks. To conduct efficient hyperparameter optimization for the target task, it is essential to estimate the target objective using only the available information. To this end, we construct the variance reduced estimator that unbiasedly approximates the target objective with a desirable variance property. Building on the proposed estimator, we provide a general and tractable hyperparameter optimization procedure, which works preferably in our setting with a no-regret guarantee. The experiments demonstrate that the proposed framework broadens the applications of automated hyperparameter optimization.
How can we predict missing values in multi-dimensional data (or tensors) more accurately? The task of tensor completion is crucial in many applications such as personalized recommendation, image and video restoration, and link prediction in social networks. Many tensor factorization and neural network-based tensor completion algorithms have been developed to predict missing entries in partially observed tensors. However, they can produce inaccurate estimations as real-world tensors are very sparse, and these methods tend to overfit on the small amount of data. Here, we overcome these shortcomings by presenting a data augmentation technique for tensors. In this paper, we propose DAIN, a general data augmentation framework that enhances the prediction accuracy of neural tensor completion methods. Specifically, DAIN first trains a neural model and finds tensor cell importances with influence functions. After that, DAIN aggregates the cell importance to calculate the importance of each entity (i.e., an index of a dimension). Finally, DAIN augments the tensor by weighted sampling of entity importances and a value predictor. Extensive experimental results show that DAIN outperforms all data augmentation baselines in terms of enhancing imputation accuracy of neural tensor completion on four diverse real-world tensors. Ablation studies of DAIN substantiate the effectiveness of each component of DAIN. Furthermore, we show that DAIN scales near linearly to large datasets.
Neural text matching models have been widely used in community question answering, information retrieval, and dialogue. However, these models designed for short texts cannot well address the long-form text matching problem, because there are many contexts in long-form texts can not be directly aligned with each other, and it is difficult for existing models to capture the key matching signals from such noisy data. Besides, these models are computationally expensive for simply use all textual data indiscriminately. To tackle the effectiveness and efficiency problem, we propose a novel hierarchical noise filtering model, namely Match-Ignition. The main idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information in the matching process. Noisy sentences are usually easy to detect because previous work has shown that their similarity can be explicitly evaluated by the word overlapping, so we directly use PageRank to filter such information based on a sentence similarity graph. Unlike sentences, words rely on their contexts to express concrete meanings, so we propose to jointly learn the filtering and matching process, to well capture the critical word-level matching signals. Specifically, a word graph is first built based on the attention scores in each self-attention block of Transformer, and key words are then selected by applying PageRank on this graph. In this way, noisy words will be filtered out layer by layer in the matching process. Experimental results show that Match-Ignition outperforms both SOTA short text matching models and recent long-form text matching models. We also conduct detailed analysis to show that Match-Ignition efficiently captures important sentences and words, to facilitate the long-form text matching process.
Explainable classification is essential to high-impact settings where practitioners requireevidence to support their decisions. However, state-of-the-art deep learning models lack transparency in how they make their predictions. One increasingly popular solution is attribution-based explainability, which finds the impact of input features on the model's predictions. While this is popular for computer vision, little has been done to explain deep time series classifiers.In this work, we study this problem and propose PERT, a novel perturbation-based explainability method designed to explain deep classifiers' decisions on time series. PERT extends beyond recent perturbation methods to generate a saliency map that assigns importance values to the timesteps of the instance-of-interest.
First, PERT uses a novel Prioritized Replacement Selector to learn which alternative time series from a larger dataset are most useful to perform this perturbation. Second, PERT mixes the instance with the replacements using a Guided Perturbation Strategy, which learns to what degree each timestep can be perturbed without altering the classifier's final prediction. These two steps jointly learn to identify the fewest and most impactful timesteps that explain the classifier's prediction. We evaluate PERT using three metrics on nine popular datasets with two black-box models. We find that PERT consistently outperforms all five state-of-the-art methods. Using a case study, we also demonstrate that PERT succeeds in finding the relevant regions of the input time series.
Knowledge graph embedding plays an important role in knowledge representation, reasoning, and data mining applications. However, for multiple cross-domain knowledge graphs, state-of-the-art embedding models cannot make full use of the data from different knowledge domains while preserving the privacy of exchanged data. In addition, the centralized embedding model may not scale to the extensive real-world knowledge graphs. Therefore, we propose a novel decentralized scalable learning framework, Federated Knowledge Graphs Embedding (FKGE), where embeddings from different knowledge graphs can be learnt in an asynchronous and peer-to-peer manner while being privacy-preserving. FKGE exploits adversarial generation between pairs of knowledge graphs to translate identical entities and relations of different domains into near embedding spaces. In order to protect the privacy of the training data, FKGE further implements a privacy-preserving neural network structure to guarantee no raw data leakage. We conduct extensive experiments to evaluate FKGE on 11 knowledge graphs, demonstrating a significant and consistent improvement in model quality with at most 17.85% and 7.90% increases in performance on triple classification and link prediction tasks.
With the rise of social media users and the general shift of communication from traditional media to online platforms, the spread of harmful content (e.g., hate speech, misinformation, fake news) has been exacerbated. Harmful content in the form of hate speech causes a person distress or harm, having a negative impact on the individual mental health, with even more detrimental effects on the psychology of children and teenagers. In this paper, we propose an end-to-end solution with real-time capabilities to detect harmful content in real-time and mitigate its spread over the network. Our main contribution is Sparse Shield, a novel method that out-scales existing state-of-the-art methods for network immunization. We also propose a novel architecture for harmful speech mitigation that maximizes the impact of immunization. Our solution aims to identify a set of users for which to move harmful content at the bottom of the user feed, rather than censoring users. By immunizing certain network nodes in this manner, we minimize the negative impact on the network and minimize the interference with and limitation of individual freedoms: the information is not hidden but rather not as easy to reach without an explicit search. Our analysis is based on graphs built on real-world data collected from Twitter; these graphs reflect real user behavior. We perform extensive scalability experiments to prove the superiority of our method over existing state-of-the-art network immunization techniques. We also perform extensive experiments to showcase that Sparse Shield outperforms existing techniques on the task of harmful speech mitigation on a real-world dataset.
While depression is rated as the most important leading factor to global disability, early detection of depression is a non-trivial task. Existing depression detection mechanisms harvesting social media data suffer from two major limitations. First, existing solutions rely heavily on the amount, quality, and variety of content types (textual, visual, etc.) posted by users to make accurate inferences, therefore suffering from the cold-start problem when coping with users with limited training data (e.g., most existing works exclude users with fewer than 25 tweets). Second, existing approaches ignore the social impact or indication from users' social circles that can be leveraged to enhance the inference results. In this paper, we present MentalSpot, a social-contagion based depression early-screening framework using meta-learning. Specifically, we first construct a social-contagion driven data repository PsycheNet, filling the void of social-circle based depression datasets. We design a triplet network to extract users' embeddings based on the similarities of the linguistic features extracted from written texts. Afterwards, for each target user, we employ dynamic mean shift pruning to select her top-k homogeneous friends in the metric space, the texts written by whom will then be leveraged to train a friend based depression detection model. Extensive experiments show that MentalSpot outperforms the state of the art in terms of all effectiveness metrics, especially for users with very few tweets. Specifically, by using only five tweets per user, MentalSpot successfully yields an F1 score that would otherwise be achieved by the state-of-the-art methods requiring at least twenty tweets. Our approach represents a step forward to address the cold-start problem that deep learning techniques struggle with for their applications in psychiatric diagnosis. The principal beneficiaries of this study are healthcare professionals in medical institutions to determine timely and targeted interventions in a clinical setting. This study also supports non-profit groups in reaching out to people with mental health issues, helping in a global health task that cannot be fully covered by clinicians.
While reinforcement learning has achieved considerable successes in recent years, state-of-the-art models are often still limited by the size of state and action spaces. Model-free reinforcement learning approaches use some form of state representations and the latest work has explored embedding techniques for actions, both with the aim of achieving better generalization and applicability. However, these approaches consider only states or actions, ignoring the interaction between them when generating embedded representations. In this work, we establish the theoretical foundations for the validity of training a reinforcement learning agent using embedded states and actions. We then propose a new approach for jointly learning embeddings for states and actions that combines aspects of model-free and model-based reinforcement learning, which can be applied in both discrete and continuous domains. Specifically, we use a model of the environment to obtain embeddings for states and actions and present a generic architecture that leverages these to learn a policy. In this way, the embedded representations obtained via our approach enable better generalization over both states and actions by capturing similarities in the embedding spaces. Evaluations of our approach on several gaming, robotic control, and recommender systems show it significantly outperforms state-of-the-art models in both discrete/continuous domains with large state/action spaces, thus confirming its efficacy.
Static malware detection is important for protection against malware by allowing for malicious files to be detected prior to execution. It is also especially suitable for machine learning-based approaches. Recently, gradient boosting decision trees (GBDT) models, e.g., LightGBM (a popular implementation of GBDT), have shown outstanding performance for malware detection. However, as malware programs are known to evolve rapidly, malware classification models trained on the (source) training data often fail to generalize to the target domain, i.e., the deployed environment. To handle the underlying data distribution drifts, unsupervised domain adaptation techniques have been proposed for machine learning models including deep learning models. However, unsupervised domain adaptation for GBDT has remained challenging. In this paper, we adapt the adversarial learning framework for unsupervised domain adaptation to enable GBDT learn domain-invariant features and alleviate performance degradation in the target domain. In addition, to fully exploit the unlabelled target data, we merge them into the training dataset after pseudo-labelling. We propose a new weighting scheme integrated into GBDT for sampling instances in each boosting round to reduce the negative impact of wrongly labelled target instances. Experiments on two large malware datasets demonstrate the superiority of our proposed method.
In this paper, we explore the problem of developing personalized chatbots. A personalized chatbot is designed as a digital chatting assistant for a user. The key characteristic of a personalized chatbot is that it should have a consistent personality with the corresponding user. It can talk the same way as the user when it is delegated to respond to others' messages. Many methods have been proposed to assign a personality to dialogue chatbots, but most of them utilize explicit user profiles, including several persona descriptions or key-value-based personal information. In a practical scenario, however, users might be reluctant to write detailed persona descriptions, and obtaining a large number of explicit user profiles requires tremendous manual labour. To tackle the problem, we present a retrieval-based personalized chatbot model, namely IMPChat, to learn an implicit user profile from the user's dialogue history. We argue that the implicit user profile is superior to the explicit user profile regarding accessibility and flexibility. IMPChat aims to learn an implicit user profile through modeling user's personalized language style and personalized preferences separately. To learn a user's personalized language style, we elaborately build language models from shallow to deep using the user's historical responses; To model a user's personalized preferences, we explore the conditional relations underneath each post-response pair of the user. The personalized preferences are dynamic and context-aware: we assign higher weights to those historical pairs that are topically related to the current query when aggregating the personalized preferences. We match each response candidate with the personalized language style and personalized preference, respectively, and fuse the two matching signals to determine the final ranking score. We conduct comprehensive experiments on two large datasets, and the results show that our method outperforms all baseline models.
The conventional solution to learning to rank problems ranks individual documents by prediction scores greedily. Recent emerged re-ranking models, which take as input initial lists, aim to capture document interdependencies and directly generate the optimal ordered lists. Typically, a re-ranking model is learned from a set of labeled data, which can achieve favorable performance on average. However, it can be suboptimal for individual queries because the available training data is usually highly imbalanced. This problem is challenging due to the absence of informative data for some queries and furthermore, the lack of a good data augmentation policy.
In this paper, we propose a novel method named Learning to Augment (LTA), which mitigates the imbalance issue through learning to augment the initial lists for re-ranking models. Specifically, we first design a data generation model based on Gaussian Mixture Variational Autoencoder (GMVAE) for generating informative data. GMVAE imposes a mixture of Gaussians on the latent space, which allows it to cluster queries in an unsupervised manner and then generate new data with different query types using the learned components. Then, to obtain a good augmentation strategy (instead of heuristics), we design a teacher model that consists of two intelligent agents to determine how to generate new data for a given list and how to rank both the raw data and generated data to produce augmented lists, respectively. The teacher model leverages the feedback from the re-ranking model to optimize its augmentation policy by means of reinforcement learning. Our method offers a general learning paradigm that is applicable to both supervised and reinforced re-ranking models. Experimental results on benchmark learning to rank datasets show that our proposed method can significantly improve the performance of re-ranking models.
Privacy preservation remains a key challenge in data mining and Natural Language Understanding (NLU). Previous research shows that the input text or even text embeddings can leak private information. This concern motivates our research on effective privacy preservation approaches for pretrained Language Models (LMs). We investigate the privacy and utility implications of applying dχ-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications. More importantly, we further propose privacy-adaptive LM pretraining methods and show that our approach can boost the utility of BERT dramatically while retaining the same level of privacy protection. We also quantify the level of privacy preservation and provide guidance on privacy configuration. Our experiments and findings lay the groundwork for future explorations of privacy-preserving NLU with pretrained LMs.
Faceted search systems enable users to filter results by selecting values along different dimensions or facets. Traditionally, facets have corresponded to properties of information items that are part of the document metadata. Recently, faceted search systems have begun to use machine learning to automatically associate documents with facet-values that are more subjective and abstract. Examples include search systems that support topic-based filtering of research articles, concept-based filtering of medical documents, and tag-based filtering of images. While machine learning can be used to infer facet-values when the collection is too large for manual annotation, machine-learned classifiers make mistakes. In such cases, it is desirable to have a scrutable system that explains why a filtered result is relevant to a facet-value. Such explanations are missing from current systems. In this paper, we investigate how explainability features can help users interpret results filtered using machine-learned facets. We consider two explainability features: (1) showing prediction confidence values and (2) highlighting rationale sentences that played an influential role in predicting a facet-value. We report on a crowdsourced study involving 200 participants. Participants were asked to scrutinize movie plot summaries predicted to satisfy multiple genres and indicate their agreement or disagreement with the system. Participants were exposed to four interface conditions. We found that both explainability features had a positive impact on participants' perceptions and performance. While both features helped, the sentence-highlighting feature played a more instrumental role in enabling participants to reject false positive cases. We discuss implications for designing tools to help users scrutinize automatically assigned facet-values.
We study the problem of formally verifying individual fairness of decision tree ensembles, as well as training tree models which maximize both accuracy and individual fairness. In our approach, fairness verification and fairness-aware training both rely on a notion of stability of a classifier, which is a generalization of the standard notion of robustness to input perturbations used in adversarial machine learning. Our verification and training methods leverage abstract interpretation, a well-established mathematical framework for designing computable, correct, and precise approximations of potentially infinite behaviors. We implemented our fairness-aware learning method by building on a tool for adversarial training of decision trees. We evaluated it in practice on the reference datasets in the literature on fairness in machine learning. The experimental results show that our approach is able to train tree models exhibiting a high degree of individual fairness with respect to the natural state-of-the-art CART trees and random forests. Moreover, as a by-product, these fairness-aware decision trees turn out to be significantly compact, which naturally enhances their interpretability.
Frequently Asked Questions (FAQ) are a form of semi-structured data that provides users with commonly requested information and enables several natural language processing tasks. Given the plethora of such question-answer pairs on the Web, there is an opportunity to automatically build large FAQ collections for any domain, such as COVID-19 or Plastic Surgery. These collections can be used by several information-seeking portals and applications, such as AI chatbots. Automatically identifying and extracting such high-utility question-answer pairs is a challenging endeavor, which has been tackled by little research work. For a question-answer pair to be useful to a broad audience, it must (i) provide general information -- not be specific to the Web site or Web page where it is hosted -- and (ii) must be self-contained -- not have references to other entities in the page or missing terms (ellipses) that render the question-answer pair ambiguous. Although identifying general, self-contained questions may seem like a straightforward binary classification problem, the limited availability of training data for this task and the countless domains make building machine learning models challenging. Existing efforts in extracting FAQs from the Web typically focus on FAQ retrieval without much regard to the utility of the extracted FAQ. We propose QuAX: a framework for extracting high-utility (i.e., general and self-contained) domain-specific FAQ lists from the Web. QuAX receives a set of keywords from a user, and works in a pipelined fashion to find relevant web pages and extract general and self-contained questions-answer pairs. We experimentally show how QuAX generates high-utility FAQ collections with little and domain-agnostic training data, and how the individual stages of the pipeline improve on the corresponding state-of-the-art.
In the literature, various link-based similarity measures such as Adamic/Adar (in short Ada), SimRank, and random walk with restart (RWR) have been proposed. Contrary to SimRank and RWR, Ada is a non-recursive measure, which exploits the local graph structure in similarity computation. Motivated by Ada's promising results in various graph-related tasks, along with the fact that SimRank is a recursive generalization of the co -citation measure, in this paper, we propose AdaSim, a recursive similarity measure based on the Ada philosophy. Our AdaSim provides identical accuracy to that of Ada on the first iteration and it is applicable to both directed and undirected graphs. To accelerate our iterative form, we also propose a matrix form that is dramatically faster while providing the exact AdaSim scores. We conduct extensive experiments with five real-world datasets to evaluate both the effectiveness and efficiency of our AdaSim in comparison with those of existing similarity measures and graph embedding methods in the task of similarity computation of nodes. Our experimental results show that 1) AdaSim significantly improves the effectiveness of Ada and outperforms other competitors, 2) its efficiency is comparable to that of SimRank* while being better than the others, 3) AdaSim is not sensitive to the parameter tuning, and 4) similarity measures are better than embedding methods to compute similarity of nodes.
The application of graph neural networks (GNNs) to the domain of electrical power grids has high potential impact on smart grid monitoring. Even though there is a natural correspondence of power flow to message-passing in GNNs, their performance on power grids is not well-understood. We argue that there is a gap between GNN research driven by benchmarks which contain graphs that differ from power grids in several important aspects. Additionally, inductive learning of GNNs across multiple power grid topologies has not been explored with real-world data.
We address this gap by means of (i) defining power grid graph datasets in inductive settings, (ii) an exploratory analysis of graph properties, and (iii) an empirical study of the concrete learning task of state estimation on real-world power grids. Our results show that GNNs are more robust to noise with up to 400% lower error compared to baselines. Furthermore, due to the unique properties of electrical grids, we do not observe the well known over-smoothing phenomenon of GNNs and find the best performing models to be exceptionally deep with up to 13 layers. This is in stark contrast to existing benchmark datasets where the consensus is that 2--3 layer GNNs perform best. Our results demonstrate that a key challenge in this domain is to effectively handle long-range dependence.
Establishing robust dialogue policy with low computation cost is challenging, especially for multi-domain task-oriented dialogue management due to the high complexity in state and action spaces. The previous works mostly using the deterministic policy optimization only attain moderate performance. Meanwhile, state-of-the-art result that uses end-to-end approach is computationally demanding since it utilizes a large-scaled language model based on the generative pre-trained transformer-2 (GPT-2). In this study, a new learning procedure consisting of three learning stages is presented to improve multi-domain dialogue management with corrective guidance. Firstly, the behavior cloning with an auxiliary task is developed to build a robust pre-trained model by mitigating the causal confusion problem in imitation learning. Next, the pre-trained model is rectified by using reinforcement learning via the proximal policy optimization. Lastly, human-in-the-loop learning strategy is fulfilled to enhance the agent performance by directly providing corrective feedback from rule-based agent so that the agent is prevented to trap in confounded states. The experiments on end-to-end evaluation show that the proposed learning method achieves state-of-the-art result by performing nearly identical to the rule-based agent. This method outperforms the second place of 9th dialog system technology challenge (DSTC9) track 2 that uses GPT-2 as the core model in dialogue management.
What is the value of an individual model in an ensemble of binary classifiers? We answer this question by introducing a class of transferable utility cooperative games called ensemble games. In machine learning ensembles, pre-trained models cooperate to make classification decisions. To quantify the importance of models in these ensemble games, we define Troupe - an efficient algorithm that allocates payoffs based on approximate Shapley values of the classifiers. We argue that the Shapley value of models in these games is an effective decision metric for choosing a high-performing subset of models from the ensemble. Our analytical findings prove that our Shapley value estimation scheme is precise and scalable; its performance increases with the size of the dataset and ensemble. Empirical results on real-world graph classification tasks demonstrate that our algorithm produces high-quality estimates of the Shapley value. We find that Shapley values can be utilized for ensemble pruning and that adversarial models receive a low valuation. Complex classifiers are frequently found to be responsible for both correct and incorrect classification decisions.
Counting the number of occurrences of small connected subgraphs, called temporal motifs, has become a fundamental primitive for the analysis of temporal networks, whose edges are annotated with the time of the event they represent. One of the main complications in studying temporal motifs is the large number of motifs that can be built even with a limited number of vertices or edges. As a consequence, since in many applications motifs are employed for exploratory analyses, the user needs to iteratively select and analyze several motifs that represent different aspects of the network, resulting in an inefficient, time-consuming process. This problem is exacerbated in large networks, where the analysis of even a single motif is computationally demanding. As a solution, in this work we propose and study the problem of simultaneously counting the number of occurrences of multiple temporal motifs, all corresponding to the same (static) topology (e.g., a triangle). Given that for large temporal networks computing the exact counts is unfeasible, we propose odeN, a sampling-based algorithm that provides an accurate approximation of all the counts of the motifs. We provide analytical bounds on the number of samples required by odeN to compute rigorous, probabilistic, relative approximations. Our extensive experimental evaluation shows that odeN enables the approximation of the counts of motifs in temporal networks in a fraction of the time needed by state-of-the-art methods, and that it also reports more accurate approximations than such methods.
The study of biological or physical processes often results in long sequences of temporally-ordered values, aka time series (TS). Changes in the observed processes, e.g. as a cause of natural events or internal state changes, result in changes of the measured values. Time series segmentation (TSS) tries to find such changes in TS to deduce changes in the underlying process. TSS is typically approached as an unsupervised learning problem aiming at the identification of segments distinguishable by some statistical property. We present ClaSP, a novel and highly accurate method for TSS. ClaSP hierarchically splits a TS into two parts, where each split point is determined by training a binary TS classifier for each possible split point and selecting the one with highest accuracy, i.e., the one that is best at identifying subsequences to be from either of the partitions. In our experimental evaluation using a benchmark of 98 datasets, we show that ClaSP outperforms the state-of-the-art in terms of accuracy and is also faster than the second best method. We highlight properties of ClaSP using several real-life time series.
Fine-grained population distribution data is of great importance for many applications, e.g., urban planning, traffic scheduling, epidemic modeling, and risk control. However, due to the limitations of data collection, including infrastructure density, user privacy, and business security, such fine-grained data is hard to collect and usually, only coarse-grained data is available. Thus, obtaining fine-grained population distribution from coarse-grained distribution becomes an important problem. To tackle this problem, existing methods mainly rely on sufficient fine-grained ground truth for training, which is not often available for the majority of cities. That limits the applications of these methods and brings the necessity to transfer knowledge between data-sufficient source cities to data-scarce target cities.
In knowledge transfer scenario, we employ single reference fine-grained ground truth in target city, which is easy to obtain via remote sensing or questionnaire, as the ground truth to inform the large-scale urban structure and support the knowledge transfer in target city. By this approach, we transform the fine-grained population mapping problem into a one-shot transfer learning problem.
In this paper, we propose a novel one-shot transfer learning framework PSRNet to transfer spatial-temporal knowledge across cities from three views. From the view of network structure, we build a dense connection-based population mapping network with temporal feature enhancement to capture the complicated spatial-temporal correlation between population distributions of different granularities. From the view of data, we design a generative model to synthesize fine-grained population samples with POI distribution and the single fine-grained ground truth in data-scarce target city. From the view of optimization, after combining above structure and data, we propose a pixel-level adversarial domain adaption mechanism for universal feature extraction and knowledge transfer during training with scarce ground truth for supervision.
Experiments on real-life datasets of 4 cities demonstrate that PSRNet has significant advantages over 8 state-of-the-art baselines by reducing RMSE and MAE by more than 25%. Our code and datasets are released in Github (https://github.com/erzhuoshao/PSRNet-CIKM).
Being able to reply with a related, fluent, and informative response is an indispensable requirement for building high-quality conversational agents. In order to generate better responses, some approaches have been proposed, such as feeding extra information by collecting large-scale datasets with human annotations, designing neural conversational models (NCMs) with complex architecture and loss functions, or filtering out untrustworthy samples based on a dialogue attribute, e.g., Relatedness or Genericness. In this paper, we follow the third research branch and present a data filtering method for open-domain dialogues, which identifies untrustworthy samples from training data with a quality measure that linearly combines seven dialogue attributes. The attribute weights are obtained via Bayesian Optimization (BayesOpt) that aims to optimize an objective function for dialogue generation iteratively on the validation set. Then we score training samples with the quality measure, sort them in descending order, and filter out those at the bottom. Furthermore, to accelerate the "filter-train-evaluate'' iterations involved in BayesOpt on large-scale datasets, we propose a training framework that integrates maximum likelihood estimation (MLE) and negative training method (NEG). The training method updates parameters of a trained NCMs on two small sets with newly maintained and removed samples, respectively. Specifically, MLE is applied to maximize the log-likelihood of newly maintained samples, while NEG is used to minimize the log-likelihood of newly removed ones. Experimental results on two datasets show that our method can effectively identify untrustworthy samples, and NCMs trained on the filtered datasets achieve better performance.
Recently, the graph neural network (GNN) has shown great power in matrix completion by formulating a rating matrix as a bipartite graph and then predicting the link between the corresponding user and item nodes. The majority of GNN-based matrix completion methods are based on Graph Autoencoder (GAE), which considers the one-hot index as input, maps a user (or item) index to a learnable embedding, applies a GNN to learn the node-specific representations based on these learnable embeddings and finally aggregates the representations of the target users and its corresponding item nodes to predict missing links. However, without node content (i.e., side information) for training, the user (or item) specific representation can not be learned in the inductive setting, that is, a model trained on one group of users (or items) cannot adapt to new users (or items). To this end, we propose an inductive matrix completion method using GAE (IMC-GAE), which utilizes the GAE to learn both the user-specific (or item-specific) representation for personalized recommendation and local graph patterns for inductive matrix completion. Specifically, we design two informative node features and employ a layer-wise node dropout scheme in GAE to learn local graph patterns which can be generalized to unseen data. The main contribution of our paper is the capability to efficiently learn local graph patterns in GAE, with good scalability and superior expressiveness compared to previous GNN-based matrix completion methods. Furthermore, extensive experiments demonstrate that our model achieves state-of-the-art performance on several matrix completion benchmarks.
Graph convolutional networks (GCNs) have recently enabled a popular class of algorithms for collaborative filtering (CF). Nevertheless, the theoretical underpinnings of their empirical successes remain elusive. In this paper, we endeavor to obtain a better understanding of GCN-based CF methods via the lens of graph signal processing. By identifying the critical role of smoothness, a key concept in graph signal processing, we develop a unified graph convolution-based framework for CF. We prove that many existing CF methods are special cases of this framework, including the neighborhood-based methods, low-rank matrix factorization, linear auto-encoders, and LightGCN, corresponding to different low-pass filters. Based on our framework, we then present a simple and computationally efficient CF baseline, which we shall refer to as Graph Filter based Collaborative Filtering (GF-CF). Given an implicit feedback matrix, GF-CF can be obtained in a closed form instead of expensive training with back-propagation. Experiments will show that GF-CF achieves competitive or better performance against deep learning-based methods on three well-known datasets, notably with a 70% performance gain over LightGCN on the Amazon-book dataset.
Knowledge Graph (KG) representation learning aims to encode both entities and relations into a continuous low-dimensional vector space. Most existing methods only concentrate on learning representations from structural triples in Euclidean space, which cannot well exploit the rich semantic information with hierarchical structure in KGs. In this paper, we propose a novel DataType-aware hyperbolic knowledge representation learning model called DT-GCN, which has the advantage of fully embedding attribute values of data types information. We refine data types into five primitive modalities, including integer, double, Boolean, temporal, and textual. For each modality, an encoder is specifically designed to learn its embedding. In addition, we define a unified space based on Euclidean, spherical, and hyperbolic space, which is a continuous curvature space that combines advantages of three different spaces. Extensive experiments on both synthetic and real-world datasets show that our model is consistently better than the state-of-the-art models. The average performance is improved by 2.19% and 3.46% than the optimal baseline model on node classification and link prediction tasks, respectively. The results of ablation experiments demonstrate the advantages of embedding data types information and leveraging the unified space.
To defend against fake news, researchers have developed various methods based on texts. These methods can be grouped as 1) pattern-based methods, which focus on shared patterns among fake news posts rather than the claim itself; and 2) fact-based methods, which retrieve from external sources to verify the claim's veracity without considering patterns. The two groups of methods, which have different preferences of textual clues, actually play complementary roles in detecting fake news. However, few works consider their integration. In this paper, we study the problem of integrating pattern- and fact-based models into one framework via modeling their preference differences, i.e., making the pattern- and fact-based models focus on respective preferred parts in a post and mitigate interference from non-preferred parts as possible. To this end, we build a Preference-aware Fake News Detection Framework (Pref-FEND), which learns the respective preferences of pattern- and fact-based models for joint detection. We first design a heterogeneous dynamic graph convolutional network to generate the respective preference maps, and then use these maps to guide the joint learning of pattern- and fact-based models for final prediction. Experiments on two real-world datasets show that Pref-FEND effectively captures model preferences and improves the performance of models based on patterns, facts, or both.
News recommendation plays an indispensable role in acquiring daily news for users. Previous studies make great efforts to model high-order feature interactions between users and items, where various neural models are applied (e.g., RNN, GNN). However, we find that seldom efforts are made to get better representations for news. Most previous methods simply adopt pre-trained word embeddings to represent news and also suffer from cold-start users.
In this work, we propose a new textual content representation method by building a word graph for recommendation, which is named WG4Rec. Three types of word associations are adopted in WG4Rec for content representation and user preference modeling, namely: 1)semantically-similar according to pre-trained word vectors, 2)co-occurrence in documents, and 3)co-click by users across documents. As extra information can be unified by adding nodes/edges to the word graph easily, WG4Rec is flexible to make use of cross-platform and cross-domain context for recommendation to alleviate the cold-start issue. To the best of our knowledge, it is the first attempt that using these relationships for news recommendation to better model textual content and adopt cross-platform information. Experimental results on two large-scale real-world datasets show that WG4Rec significantly outperforms state-of-the-art algorithms, especially for cold users in the online environment. Besides, WG4Rec achieves better performances when cross-platform information is utilized.
Reinforcement learning-based portfolio management has recently attracted extensive attention. However, deep reinforcement learning methods are unexplainable and considered to be potentially risky, difficult to be trusted and regulated by users. To address these problems, we propose an eXplainable reinforcement learning framework for Portfolio Management, named XPM, which is efficient, concise, and can provide faithful explanations for network outputs. Specifically, we first design a policy network for portfolio management, which uses temporal convolutional network (TCN) to extract temporal features of multiple time series in portfolio. Then, we employ global average pooling (GAP) and a fully connected layer to integrate the global feature maps to handle asset correlations. Finally, we utilize softmax to determine the output portfolio weights. To assemble explainability into our model, we employ an explainable artificial intelligence method, class activation mapping (CAM), to explain the network outputs, which computes an activation map for an asset of interest. The map highlights the important assets and time intervals in the input state. In this way, end users can understand which part of the portfolio's recent price movements makes the network decision to invest in the target asset. Experimental results show that XPM outperforms the current state-of-the-art portfolio management methods in NASDAQ and NYSE markets, and can provide faithful and informative explanations to end users.
Graph contrastive representation learning aims to learn discriminative node representations by contrasting positive and negative samples. It helps models learn more generalized representations to achieve better performances on downstream tasks, which has aroused increasing research interest in recent years. Simultaneously, signed graphs consisting of both positive and negative links have become ubiquitous with the growing popularity of social media. However, existing works on graph contrastive representation learning are only proposed for unsigned graphs (containing only positive links) and it remains unexplored how they could be applied to signed graphs due to the distinct semantics and complex relations between positive and negative links. Therefore we propose a novel Signed Graph Contrastive Learning model (SGCL) to bridge this gap, which to the best of our knowledge is the first research to employ graph contrastive representation learning on signed graphs. Concretely, we design two types of graph augmentations specific to signed graphs based on a significant signed social theory, i.e., balance theory. Besides, inter-view and intra-view contrastive learning are proposed to learn discriminative node representations from perspectives of graph augmentations and signed structures respectively. Experimental results demonstrate the superiority of the proposed model over state-of-the-art methods on both real-world social datasets and online game datasets.
Automatic Speech Scoring (ASS) is the computer-assisted evaluation of a candidate's speaking proficiency in a language. ASS systems face many challenges like open grammar, variable pronunciations, and unstructured or semi-structured content. Recent deep learning approaches have shown some promise in this domain. However, most of these approaches focus on extracting features from single audio, making them suffer from the lack of speaker-specific context required to model such a complex task. We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modelling. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context vectors from these responses and feed them as additional speaker-specific context to our network to score a particular response. We compare our technique with strong baselines and find that such modelling improves the model's average performance by 6.92% (maximum = 12.86%, minimum = 4.51%). We further show both quantitative and qualitative insights into the importance of this additional context in solving the problem of ASS.
This paper describes SciClops, a method to help combat online scientific misinformation. Although automated fact-checking methods have gained significant attention recently, they require pre-existing ground-truth evidence, which, in the scientific context, is sparse and scattered across a constantly-evolving scientific literature. Existing methods do not exploit this literature, which can effectively contextualize and combat science-related fallacies. Furthermore, these methods rarely require human intervention, which is essential for the convoluted and critical domain of scientific misinformation.
SciClops involves three main steps to process scientific claims found in online news articles and social media postings: extraction, clustering, and contextualization. First, the extraction of scientific claims takes place using a domain-specific, fine-tuned transformer model. Second, similar claims extracted from heterogeneous sources are clustered together with related scientific literature using a method that exploits their content and the connections among them. Third, check-worthy claims, broadcasted by popular yet unreliable sources, are highlighted together with an enhanced fact-checking context that includes related verified claims, news articles, and scientific papers. Extensive experiments show that SciClops tackles sufficiently these three steps, and effectively assists non-expert fact-checkers in the verification of complex scientific claims, outperforming commercial fact-checking systems.
Label representation aims to generate a so-called verbalizer to an input text, which has a broad application in the field of text classification, event detection, question answering, etc. Previous works on label representation, especially in a few-shot setting, mainly define the verbalizers manually, which is accurate but time-consuming. Other models fail to correctly produce antonymous verbalizers for two semantically opposite classes. Thus, in this paper, we propose a metric sentiment learning framework (MSeLF) to generate the verbalizers automatically, which can capture the sentiment differences between the verbalizers accurately. In detail, MSeLF consists of two major components, i.e., the contrastive mapping learning (CML) module and the equal-gradient verbalizer acquisition (EVA) module. CML learns a transformation matrix to project the initial word embeddings to the antonym-aware embeddings by enlarging the distance between the antonyms. After that, in the antonym-aware embedding space, EVA first takes a pair of antonymous words as verbalizers for two opposite classes and then applies a sentiment transition vector to generate verbalizers for intermediate classes. We use the generated verbalizers for the downstream text classification task in a few-shot setting on two publicly available fine-grained datasets. The results indicate that our proposal outperforms the state-of-the-art baselines in terms of accuracy. In addition, we find CML can be used as a flexible plug-in component in other verbalizer acquisition approaches.
Session-based recommendation is to predict an anonymous user's next action based on the user's historical actions in the current session. However, the cold-start problem of limited number of actions at the beginning of an anonymous session makes it difficult to model the user's behavior, i.e., hard to capture the user's various and dynamic preferences within the session. This severely affects the accuracy of session-based recommendation. Although some existing meta-learning based approaches have alleviated the cold-start problem by borrowing preferences from other users, they are still weak in modeling the behavior of the current user. To tackle the challenge, we propose a novel cluster-based meta-learning model for session-based recommendation. Specially, we adopt a soft-clustering method and design a parameter gate to better transfer shared knowledge across similar sessions and preserve the characteristics of the session itself. Besides, we apply two self-attention blocks to capture the transition patterns of sessions in both item and feature aspects. Finally, comprehensive experiments are conducted on two real-world datasets and demonstrate the superior performance of CBML over existing approaches.
The semi-supervised multi-label classification problem primarily deals with Euclidean data, such as text with a 1D grid of tokens and images with a 2D grid of pixels. However, the non-Euclidean graph-structured data naturally and constantly appears in semi-supervised multi-label learning tasks from various domains like social networks, citation networks, and protein-protein interaction (PPI) networks. Moreover, the existing popular node embedding methods, like Graph Neural Networks (GNN), focus on graphs with simplex labels and tend to neglect label correlations in the multi-label setting, so the easy adaption proves empirically ineffective. Therefore, graph representation learning for the semi-supervised multi-label learning task is crucial and challenging. In this work, we incorporate the idea of label embedding into our proposed model to capture both network topology and higher-order multi-label correlations. The label embedding is generated along with the node embedding based on the topological structure to serve as the prototype center for each class. Moreover, the similarity of the label embedding and node embedding can be used as a confidence vector to guide the label smoothing process, formulating as a margin ranking optimization problem to learn the second-order relations between labels. Extensive experiments on real-world datasets from various domains demonstrate that our model significantly outperforms the state-of-the-art models for node-level tasks.
Human mobility recovery is of great importance for a wide range of location-based services. However, recovering human mobility is not trivial because of three challenges: 1) complex transition patterns among locations; 2) multi-level periodicity and shifting periodicity of human mobility; 3) sparsity of the collected trajectory data. In this paper, we propose PeriodicMove, a neural attention model based on graph neural network for human mobility recovery from lengthy and sparse trajectories. In PeriodicMove, we first construct a directed graph for each trajectory and capture complex location transition patterns using graph neural network. Then, we design two attention mechanisms which capture multi-level periodicity and shifting periodicity of human mobility respectively. Finally, a spatial-aware loss function is proposed to incorporate spatial proximity into the model optimization, which alleviates the data sparsity problem. We perform extensive experiments and the evaluation results demonstrate that PeriodicMove yields significant improvements over the competitors on two representative real-life mobility datasets. In addition, by providing high-quality mobility data, our model can benefit a variety of mobility-oriented downstream applications.
Tables are a widely-used format for data curation. The diversity of domains, layouts, and content of tables makes knowledge extraction challenging. Understanding table layouts is an important step for automatically harvesting knowledge from tabular data. Since table cells are spatially organized into regions, correctly identifying such regions and inferring their functional roles, referred to as functional block detection, is a critical part of understanding table layouts. Earlier functional block detection approaches fail to leverage spatial relationships and higher-level structure, either depending on cell-level predictions or relying on data types as signals for identifying blocks. In this paper, we introduce a flexible functional block detection method by applying agglomerative clustering techniques which merge smaller blocks into larger blocks using two merging strategies. Our proposed method uses cell embeddings with a customized dissimilarity function which utilizes local and margin distances, as well as block coherence metrics to capture cell, block, and table scoped features. Given the diversity of tables in real-world corpora, we also introduce a sampling-based approach for automatically tuning distance thresholds for each table. Experimental results show that our method improves over the earlier state-of-the-art method in terms of several evaluation metrics.
Cohesive substructure identification is one fundamental task of graph analytics. Recently, a useful problem of dense subgraph maximization has attracted significant attentions, which aims at enlarging a dense subgraph pattern using a few new edge insertions, e.g., k-core maximization. As a more cohesive subgraph of k-core, k-truss requires that each edge has at least k-2 triangles within this subgraph. However, the problem of k-truss maximization has not been studied yet. In this paper, we motivate and formulate a new problem of budget-constrained k-truss maximization. Given a budget of b edges and an integer k≥2, the problem is to find and insert b new edges into a graph G such that the resulted k-truss of G is maximized. We theoretically prove the NP-hardness of k-truss maximization problem. To efficiently tackle it, we analyze non-submodular property of k-truss newcomers function and develop non-conventional heuristic strategies for edge insertions. We first identify high-quality candidate edges with regard to (k-1)-light subgraphs and propose a greedy algorithm using per-edge insertion. Besides further improving the efficiency by pruning disqualified candidate edges, we finally develop a component-based dynamic programming algorithm for enlarging k-truss mostly, which makes a balance of budget assignment and inserts multiple edges simultaneously into all (k-1)-light components. Extensive experiments on nine real-world graphs demonstrate the efficiency and effectiveness of our proposed methods.
The formulation of a claim rests at the core of argument mining. To demarcate between a claim and a non-claim is arduous for both humans and machines, owing to latent linguistic variance between the two and the inadequacy of extensive definition-based formalization. Furthermore, the increase in the usage of online social media has resulted in an explosion of unsolicited information on the web presented as informal text. To account for the aforementioned, in this paper, we propose DESYR. It is a framework that intends on annulling the said issues for informal web-based text by leveraging a combination of hierarchical representation learning (dependency-inspired Poincaré embedding), definition-based alignment, and feature projection. We do away with fine-tuning compute-heavy language models in favor of fabricating a more domain-centric but lighter approach. Experimental results indicate that DESYR builds upon the state-of-the-art system across four benchmark claim datasets, most of which were constructed with informal texts. We see an increase of 3 claim-F1 points on the LESA-Twitter dataset, an increase of 1 claim-F1 point and 9 macro-F1 points on the Online Comments (OC) dataset, an increase of 24 claim-F1 points and 17 macro-F1 points on the Web Discourse (WD) dataset, and an increase of 8 claim-F1 points and 5 macro-F1 points on the Micro Texts (MT) dataset. We also perform an extensive analysis of the results. We make a 100-D pre-trained version of our Poincaré-variant along with the source code.
Multivariate time-series data are gaining popularity in various urban applications, such as emergency management, public health, etc. Segmentation algorithms mostly focus on identifying discrete events with changing phases in such data. For example, consider a power outage scenario during a hurricane. Each time-series can represent the number of power failures in a county for a time period. Segments in such time-series are found in terms of different phases, such as, when a hurricane starts, counties face severe damage, and hurricane ends. Disaster management domain experts typically want to identify the most affected counties (time-series of interests) during these phases. These can be effective for retrospective analysis and decision-making for resource allocation to those regions to lessen the damage. However, getting these actionable counties directly (either by simple visualization or looking into the segmentation algorithm) is typically hard. Hence we introduce and formalize a novel problem RaTSS (Rationalization for time-series segmentation) that aims to find such time-series (rationalizations), which are actionable for the segmentation. We also propose an algorithm Find-RaTSS to find them for any black-box segmentation. We show Find-RaTSS outperforms non-trivial baselines on generalized synthetic and real data, also provides actionable insights in multiple urban domains, especially disasters and public health.
By providing explanations for users and system designers to facilitate better understanding and decision making, explainable recommendation has been an important research problem. In this paper, we propose Counterfactual Explainable Recommendation (CountER), which takes the insights of counterfactual reasoning from causal inference for explainable recommendation. CountER is able to formulate the complexity and the strength of explanations, and it adopts a counterfactual learning framework to seek simple (low complexity) and effective (high strength) explanations for the model decision. Technically, for each item recommended to each user, CountER formulates a joint optimization problem to generate minimal changes on the item aspects so as to create a counterfactual item, such that the recommendation decision on the counterfactual item is reversed. These altered aspects constitute the explanation of why the original item is recommended. The counterfactual explanation helps both the users for better understanding and the system designers for better model debugging.
Another contribution of the work is the evaluation of explainable recommendation, which has been a challenging task. Fortunately, counterfactual explanations are very suitable for standard quantitative evaluation. To measure the explanation quality, we design two types of evaluation metrics, one from user's perspective (i.e. why the user likes the item), and the other from model's perspective (i.e. why the item is recommended by the model). We apply our counterfactual learning algorithm on a black-box recommender system and evaluate the generated explanations on five real-world datasets. Results show that our model generates more accurate and effective explanations than state-of-the-art explainable recommendation models. Source code is available at https://github.com/chrisjtan/counter.
Node injection attack on Graph Neural Networks (GNNs) is an emerging and practical attack scenario that the attacker injects malicious nodes rather than modifying original nodes or edges to affect the performance of GNNs. However, existing node injection attacks ignore extremely limited scenarios, namely the injected nodes might be excessive such that they may be perceptible to the target GNN. In this paper, we focus on an extremely limited scenario of single node injection evasion attack, i.e., the attacker is only allowed to inject one single node during the test phase to hurt GNN's performance. The discreteness of network structure and the coupling effect between network structure and node features bring great challenges to this extremely limited scenario. We first propose an optimization-based method to explore the performance upper bound of single node injection evasion attack. Experimental results show that 100%, 98.60%, and 94.98% nodes on three public datasets are successfully attacked even when only injecting one node with one edge, confirming the feasibility of single node injection evasion attack. However, such an optimization-based method needs to be re-optimized for each attack, which is computationally unbearable. To solve the dilemma, we further propose a Generalizable Node Injection Attack model, namely G-NIA, to improve the attack efficiency while ensuring the attack performance. Experiments are conducted across three well-known GNNs. Our proposed G-NIA significantly outperforms state-of-the-art baselines and is 500 times faster than the optimization-based method when inferring.
Online social networks have become a crucial medium to disseminate the latest political, commercial, and social information. Users with high visibility are often selected as seeds to spread information and affect their adoption in target groups. We study how gender differences and similarities can impact the information spreading process. Using a large-scale Instagram dataset and a small-scale Facebook dataset, we first conduct a multi-faceted analysis taking the interaction type, directionality and frequency into account. To this end, we explore a variety of existing and new single and multihop centrality measures. Our analysis unveils that males and females interact differently depending on the interaction types, e.g., likes or comments, and they feature different support and promotion patterns. We complement prior work showing that females do not reach top visibility (often referred to as the glass ceiling effect) jointly factoring in the connectivity and interaction intensity, both of which were previously mainly discussed independently.
Inspired by these observations, we propose a novel seeding framework, called Disparity Seeding, which aims at maximizing spread while reaching a target user group, e.g., a certain percentage of females - promoting the influence of under-represented groups. Disparity Seeding ranks influential users with two gender-aware measures, the Target HI-index and the Embedding index. Extensive simulations comparing Disparity Seeding with target-agnostic algorithms show that Disparity Seeding meets the target percentage while effectively maximizing the spread. Disparity Seeding can be generalized to counter different types of inequality, e.g., race, and proactively promote minorities in the society.
Graph representation learning has now become the de facto standard when dealing with graph-structured data. Using powerful tools from deep learning and graph neural networks, recent works have applied graph representation learning to time-evolving dynamic graphs and showed promising results. However, all the previous dynamic graph models require labeled samples to train, which might be costly to acquire in practice. Self-supervision offers a principled way of utilizing unlabeled data and has achieved great success in computer vision community. In this paper we propose debiased dynamic graph contrastive learning (DDGCL), the first self-supervised representation learning framework on dynamic graphs. The proposed model extends the contrastive learning idea to dynamic graphs via contrasting two nearby temporal views of the same node identity, with a time-dependent similarity critic. Inspired by recent theoretical developments contrastive learning, we propose a novel debiased GAN-type contrastive loss as the learning objective in order to correct the sampling bias occurred in negative sample construction process. We conduct extensive experiments on benchmark datasets via testing the DDGCL framework under two different self-supervision schemes: pretraining and finetuning and multi task learning. The results show that using a simple time-aware GNN encoder, the performance of downstream tasks is significantly improved under either scheme to closely match, or even outperform state-of-the-art dynamic graph models with more elegant encoder architectures. Further empirical evaluations suggest that the proposed approach offers more performance improvement than previously established self-supervision mechanisms over static graphs.
Learning effective representations for recipes is essential in food studies for recommendation, classification, and other applications. Unlike what has been developed for learning textual or cross-modal embeddings for recipes, the structural relationship among recipes and food items are less explored. In this paper, we formalize the problem recipe representation learning with networks to involve both the textual feature and the structural relational feature into recipe representations. Specifically, we first present RecipeNet, a new and large-scale corpus of recipe data to facilitate network based food studies and recipe representation learning research. We then propose a novel heterogeneous recipe network embedding model, rn2vec, to learn recipe representations. The proposed model is able to capture textual, structural, and nutritional information through several neural network modules, including textual CNN, inner-ingredients transformer, and a graph neural network with hierarchical attention. We further design a combined objective function of node classification and link prediction to jointly optimize the model. The extensive experiments show that our model outperforms state-of-the-art baselines on two classic food study tasks. Dataset and codes are available at https://github.com/meettyj/rn2vec.
Knowledge graph is generally incorporated into recommender systems to improve overall performance. Due to the generalization and scale of the knowledge graph, most knowledge relationships are not helpful for a target user-item prediction. To exploit the knowledge graph to capture target-specific knowledge relationships in recommender systems, we need to distill the knowledge graph to reserve the useful information and refine the knowledge to capture the users' preferences. To address the issues, we propose Knowledge-aware Conditional Attention Networks (KCAN), which is an end-to-end model to incorporate knowledge graph into a recommender system. Specifically, we use a knowledge-aware attention propagation manner to obtain the node representation first, which captures the global semantic similarity on the user-item network and the knowledge graph. Then given a target, i.e., a user-item pair, we automatically distill the knowledge graph into the target-specific subgraph based on the knowledge-aware attention. Afterward, by applying a conditional attention aggregation on the subgraph, we refine the knowledge graph to obtain target-specific node representations. Therefore, we can gain both representability and personalization to achieve overall performance. Experimental results on real-world datasets demonstrate the effectiveness of our framework over the state-of-the-art algorithms.
Recent studies suggest that financial networks play an essential role in asset valuation and investment decisions. Unlike road networks, financial networks are neither given nor static, posing significant challenges in learning meaningful networks and promoting their applications in price prediction. In this paper, we first apply the attention mechanism to connect the "dots" (firms) and learn dynamic network structures among stocks over time. Next, the end-to-end graph neural networks pipeline diffuses and propagates the firms' accounting fundamentals into the learned networks and ultimately predicts stock future returns. The proposed model reduces the prediction errors by 6% compared to the state-of-the-art models. Our results are robust with different assessment measures. We also show that portfolios based on our model outperform the S&P-500 index by 34% in terms of Sharpe Ratio, suggesting that our model is better at capturing the dynamic inter-connection among firms and identifying stocks with fast recovery from major events. Further investigation on the learned networks reveals that the network structure aligns closely with the market conditions. Finally, with an ablation study, we investigate different alternative versions of our model and the contribution of each component.
In counterfactual learning to rank (CLTR) user interactions are used as a source of supervision. Since user interactions come with bias, an important focus of research in this field lies in developing methods to correct for the bias of interactions. Inverse propensity scoring (IPS) is a popular method suitable for correcting position bias. Affine correction (AC) is a generalization of IPS that corrects for position bias and trust bias. IPS and AC provably remove bias, conditioned on an accurate estimation of the bias parameters. Estimating the bias parameters, in turn, requires an accurate estimation of the relevance probabilities. This cyclic dependency introduces practical limitations in terms of sensitivity, convergence and efficiency.
We propose a new correction method for position and trust bias in CLTR in which, unlike the existing methods, the correction does not rely on relevance estimation. Our proposed method, mixture-based correction (MBC), is based on the assumption that the distribution of the CTRs over the items being ranked is a mixture of two distributions: the distribution of CTRs for relevant items and the distribution of CTRs for non-relevant items. We prove that our method is unbiased. The validity of our proof is not conditioned on accurate bias parameter estimation. Our experiments show that MBC, when used in different bias settings and accompanied by different LTR algorithms, outperforms AC, the state-of-the-art method for correcting position and trust bias, in some settings, while performing on par in other settings. Furthermore, MBC is orders of magnitude more efficient than AC in terms of the training time.
With the number of users in crowdsourcing increasing rapidly, task matching service is attracting more and more attention. However, it also causes many security concerns, one of which is the leakage of sensitive information. Privacy-preserving task matching techniques can protect the private information of task requesters and workers. Whereas existing privacy-preserving task matching schemes are constructed on a central server, and thereby they may suffer from potential wrongdoings of a malicious server. In addition, most of them only provide accurate task matching, which means that they cannot tolerate keyword spelling errors, leading to the decline of task matching accuracy. In this paper, we propose a Reliable and Privacy-preserving Task Matching scheme (RPTM) for crowdsourcing. To guarantee the reliability of task matching results, RPTM employs smart contracts to ensure that operations of RPTM are faithfully performed. However, it may still disclose the privacy of users due to the transparency of the blockchain. In order to deal with this problem, RPTM can perform task matching service without compromising the privacy of task requesters and workers by leveraging a novel integer vector encryption scheme. Moreover, RPTM supports multi-keyword fuzzy matching by exploiting locality sensitive hashing and Bloom filter, which can tolerate keyword spelling errors and different expression formats. Extensive analysis and experiments based on a test net of EOS show that RPTM is efficient and secure.
Sequential recommendation which aims to predict a user's next interaction based on his/her previous behaviors, has attracted great attention. Recent studies mainly employ deep recurrent neural networks or self-attention networks to capture dynamic user preferences. However, existing methods merely focus on modeling users' clear interests in interacted items. We argue that for an interaction, the user may also have ambiguous interests in items that are semantically related to the interacted one. For comprehensively capturing user preferences, it is beneficial to discover potential interests from historical interactions at a broader itemset level. Therefore, in this paper, we propose a knowledge graph enhanced sequential recommendation model namely KGIE, which focuses on enhancing user interest modeling with knowledge-enriched itemsets by incorporating the knowledge graph. Specifically, in addition to item-level interest modeling with interacted items, we further construct knowledge-enriched itemsets that are extracted via high-order knowledge associations with the interacted items. For capturing personalized itemset-level interests, we design an attentive aggregation unit to combine item embeddings considering both inherent and contextual personalization signals. Furthermore, to balance the contributions of both two levels of interest modeling, we adaptively learn high-level preference representations with a gating fusion unit. Extensive experiments on three real-world datasets demonstrate the superior performance beyond state-of-the-art methods and recommendation interpretability of our model.
Information diffusion prediction targets on forecasting how information items spread among a set of users. Recently, neural networks have been widely used in modeling information diffusion, owing to the great successes of deep learning. However, in real-world information diffusion scenarios, users are likely to have different behaviors to information items from different topics. Existing neural-based methods failed to model the topic-specific diffusion patterns and dependencies, which have been shown to be useful in conventional non-neural methods. In this paper, we propose Topic-aware Attention Network (TAN) to take advantage of both topic-specific diffusion modeling and deep learning techniques. We jointly model the text content of information items and cascade sequences by incorporating topical context and user/position dependencies into user representations via attention mechanisms. A time-decayed aggregation module is further employed to integrate user representations for cascade representations, which can encode the topic-specific diffusion dependencies independently. Experimental results on diffusion prediction tasks over three realistic cascade datasets show that our model can achieve a relative improvement up to 9% against the best performing baseline in terms of Hits@10.
Knowledge graphs, which consist of entities and their relations, have become a popular way to store structured knowledge. Knowledge graph embedding (KGE), which derives a representation for each entity and relation, has been widely used to capture the semantics of the information in the knowledge graphs, and has demonstrated great success in many downstream applications, such as the extraction of similar entities in response to a query entity. However, existing KGE methods cannot work well on emerging knowledge graphs that are large-scale due to the constraints in storage and inference efficiency. In this paper, we propose a lightweight KGE model, LightKG, which significantly reduces storage as well as running time needed for inference. Instead of storing a continuous vector for every entity, LightKG only needs to store a few codebooks, each of which contains some codewords that correspond to the representatives among the embeddings, and the indices that correspond to the codeword selections for entities. Hence LightKG can achieve highly efficient storage. The efficiency of the downstream querying process can be significantly boosted too with the proposed LightKG model as the relevance score between the query and an entity can be efficiently calculated via a quick look-up in a table that contains the scores between the query and codewords. The storage and inference efficiency of LightKG is achieved by its novel design. LightKG is an end-to-end framework that automatically infers codebooks and codewords and generates an approximated embedding for each entity. A residual module is included in LightKG to induce the diversity among codebooks, and a continuous function is adopted to approximate codeword selection, which is non-differential. In addition, to further improve the performance of KGE, we propose a novel dynamic negative sampling method based on quantization, which can be applied to the proposed LightKG or other KGE methods. We conduct extensive experiments on five public datasets. The experiments show that LightKG is search and memory efficient with high approximate search accuracy. Also, the dynamic negative sampling can dramatically improve model performance with over 19% improvement on average.
Facility Relocation (FR), which is an effort to reallocate the placement of facilities to adapt to the changes of urban planning and population distribution, has remarkable impact on many application areas. Existing solutions to the FR problem either focus on relocating one facility (ie 1-FR) or fail to guarantee the result quality on relocating k>1 facilities (ie k-FR). As k-FR problem is NP-hard and is not submodular or non-decreasing, traditional hill-climb approximate algorithm cannot be directly applied. In light of that, we propose to transform k-FR into another facility placement problem, which is submodular and non-decreasing. We theoretically prove that the optimal solution of both problems are equivalent. Accordingly, we are able to present the first approximate solution towards the k-FR, namely FR2FP. Our extensive comparison over both FR2FP and the state-of-the-art heuristic solution shows that FR2FP, although provides approximation guarantee, cannot necessarily given superior results to the heuristic solution. The comparison motivates and, more importantly, directs us to present an advanced approximate solution, namely FR2FP-ex. Extensive experimental study over both real-world and synthetic datasets have verified that, FR2FP-ex demonstrates the best result quality. In addition, we also exactly unveil the scenarios when the state-of-the-art heuristic would fail to provide satisfied results in practice.
NN-Descent is a classic k-NN graph construction approach. It is still widely employed in machine learning, computer vision, and information retrieval tasks due to its efficiency and genericness. However, the current design only works well on CPU. In this paper, NN-Descent has been redesigned to adapt to the GPU architecture. A new graph update strategy called selective update is proposed. It reduces the data exchange between GPU cores and GPU global memory significantly, which is the processing bottleneck under GPU computation architecture. This redesign leads to full exploitation of the parallelism of the GPU hardware. In the meantime, the genericness, as well as the simplicity of NN-Descent, are well-preserved. Moreover, a procedure that allows to k-NN graph to be merged efficiently on GPU is proposed. It makes the construction of high-quality k-NN graphs for out-of-GPU-memory datasets tractable. Our approach is 100-250× faster than the single-thread NN-Descent and is 2.5-5× faster than the existing GPU-based approaches as we tested on million as well as billion scale datasets.
Tree similarity join is useful for analyzing tree structured data. The traditional threshold-based tree similarity join requires a similarity threshold, which is usually a difficult task for users. To remedy this issue, we advocate the problem of top-k tree similarity join. Given a collection of trees and a parameter k, the top-k tree similarity join aims to find k tree pairs with minimum tree edit distance (TED). Although we show that this problem can be resolved by utilizing the threshold-based join, the efficiency is unsatisfactory. In this paper, we propose an efficient algorithm, namely TopKTJoin, which generates the candidate tree pairs incrementally using an inverted index. We also derive TED lower bound for the unseen tree pairs. Together with TED value of the k-th best join result seen so far, we have a chance to terminate the algorithm early without missing any correct results. To further improve the efficiency, we propose two optimization techniques in terms of index structure and verification mechanism. We conduct comprehensive performance studies on real and synthetic datasets. The experimental results demonstrate that TopKTJoin significantly outperforms the baseline method.
News recommendation is of vital importance to alleviating in-formation overload. Recent research shows that precise modeling of news content and user interests become critical for news rec-ommendation. Existing methods usually utilize information such as news title, abstract, entities to predict Click Through Rate(CTR) or add some auxiliary tasks to a multi-task learning framework. However, none of them directly consider predicted news popularity and the degree of users' attention to popular news into the CTR prediction results. Meanwhile, multiple inter-ests may arise throughout users' browsing history. Thus it is hard to represent user interests via a single user vector. In this paper, we propose PENR, a Popularity-Enhanced News Recommenda-tion method, which integrates popularity prediction task to im-prove the performance of the news encoder. News popularity score is predicted and added to the final CTR, while news popu-larity is utilized to model the degree of users' tendency to follow hot news. Moreover, user interests are modeled from different perspectives via a subspace projection method that assembles the browsing history to multiple subspaces. In this way, we capture users' multi-view interest representations. Experiments on a real-world dataset validate the effectiveness of our PENR approach.
Fraud activities in e-commerce, such as spam reviews and fake shopping behaviors, significantly mislead customers' decision making, damage the platforms' reputation, and reduce enterprises' revenue. In recent years, GNN-based models have been widely adopted in fraud detection tasks, which have shown better performance compared to conventional rule-based methods and feature-based models. Most GNN-based models focus on homogeneous graphs, usually including user-to-user, or item-to-item connections. These types of graphs have limitations of eliminating certain types of connections, such as user-item connections. In addition, GNN-based models aggregate neighborhood information based on the assumption that neighbors share the similar structure and content. However, in fraud detection tasks, two major inconsistency issues arise: Severe mixture of structure-inconsistency due to extremely unbalanced positive and negative samples; and mixture of content-inconsistency due to the difference between various item categories. To address the above issues, we propose a Community-based Framework with ATtention mechanism for large-scale Heterogeneous graphs (C-FATH). In order to utilize the entire heterogeneous graph, we directly model on the heterogeneous graph and combine it with homogeneous graphs. The structure-inconsistent nodes are filtered by introducing the community information when constructing neighbors. Content-inconsistent nodes are selected with lower probability by a similarity-based sampling strategy. Further, the model is trained in a multi-task manner that each node type (e.g. user, item, device, order, and review) is associated with a specific loss function. Comprehensive experiments are conducted on two public review datasets and two large-scale datasets from JD.com, and the experimental results demonstrate the effectiveness and scalability of the proposed C-FATH compared to the state-of-the-art approaches.
Few-Shot Event Classification (FSEC) aims at developing a model for event prediction, which can generalize to new event types with a limited number of annotated data. Existing FSEC studies have achieved high accuracy on different benchmarks. However, we find they suffer from trigger biases that signify the statistical homogeneity between some trigger words and target event types, which we summarize as trigger overlapping and trigger separability. The biases can result in context-bypassing problem, i.e., correct classifications can be gained by looking at only the trigger words while ignoring the entire context. Therefore, existing models can be weak in generalizing to unseen data in real scenarios. To further uncover the trigger biases and assess the generalization ability of the models, we propose two new sampling methods, Trigger-Uniform Sampling (TUS) and COnfusion Sampling (COS), for the meta tasks construction during evaluation. Besides, to cope with the context-bypassing problem in FSEC models, we introduce adversarial training and trigger reconstruction techniques. Experiments show these techniques help not only improve the performance, but also enhance the generalization ability of models.
Knowledge graphs (KGs) are of great importance in various artificial intelligence systems, such as question answering, relation extraction, and recommendation. Nevertheless, most real-world KGs are highly incomplete, with many missing relations between entities. To discover new triples (i.e., head entity, relation, tail entity), many KG completion algorithms have been proposed in recent years. However, a vast majority of existing studies often require a large number of training triples for each relation, which contradicts the fact that the frequency distribution of relations in KGs often follows a long tail distribution, meaning a majority of relations have only very few triples. Meanwhile, since most existing large-scale KGs are constructed automatically by extracting information from crowd-sourcing data using heuristic algorithms, plenty of errors could be inevitably incorporated due to the lack of human verification, which greatly reduces the performance for KG completion. To tackle the aforementioned issues, in this paper, we study a novel problem of error-aware few-shot KG completion and present a principled KG completion framework REFORM. Specifically, we formulate the problem under the few-shot learning framework, and our goal is to accumulate meta-knowledge across different meta-tasks and generalize the accumulated knowledge to the meta-test task for error-aware few-shot KG completion. To address the associated challenges resulting from insufficient training samples and inevitable errors, we propose three essential modules neighbor encoder, cross-relation aggregation, and error mitigation in each meta-task. Extensive experiments on three widely used KG datasets demonstrate the superiority of the proposed framework REFORM over competitive baseline methods.
In open-domain dialogue systems, knowledge information such as unstructured persona profiles, text descriptions and structured knowledge graph can help incorporate abundant background facts for delivering more engaging and informative responses. Existing studies attempted to model a general posterior distribution over candidate knowledge by considering the entire response utterance as a whole at the beginning of decoding process for knowledge selection. However, a single smooth distribution could fail to model the variability of knowledge selection patterns over different decoding steps, and make the knowledge expression less consistent. To remedy this issue, we propose an adaptive posterior knowledge selection framework, which sequentially introduces a series of discriminative distributions to dynamically control when and what knowledge should be used in specific decoding steps. The adaptive distributions can also capture knowledge-relevant semantic dependencies between adjacent words to refine response generation. In particular, for knowledge graph-grounded dialogue generation, we further incorporate the adaptive distributions into generative word distributions to help express the knowledge entity words. The experimental results show that our developed methods outperform strong baseline systems by large margins.
Chinese characters are often composed of subcharacter components which are also semantically informative, and the component-level internal semantic features of a Chinese character inherently bring with additional information that benefits the semantic representation of the character. Therefore, there have been several studies that utilized subcharacter component information (e.g. radical, fine-grained components and stroke n-grams) to improve Chinese character representation.
However we argue that it has not been fully explored what would be the best way of modeling and encoding a Chinese character. For improving the representation of a Chinese character, existing methods introduce more component-level internal semantic features as well as more semantic irrelevant subcharacter component information, and these semantic irrelevant subcharacter component will be noisy for representing a Chinese character. Moreover, existing methods suffer from the inability of discriminating the importance of the introduced subcharacter components, accordingly they can not filter out introduced noisy subcharacter component information.
In this paper, we first decompose Chinese characters into components according to their formations, then model a Chinese character and its decomposed components as a graph structure named Chinese character formation graph; Chinese character formation graph can reserve the azimuth relationship among subcharacter components, and be advantageous to explicitly model the component-level internal semantic features of a Chinese character. Furtherly, we propose a novel model Chinese Character Formation Graph Attention Network (FGAT) which is able to discriminate the importance of the introduced subcharacter components and extract component-level internal semantic features of a Chinese character efficiently. To demonstrate the effectiveness of our research, we have conducted extensive experiments. The experimental results show that our model achieves better results than state-of-the-art (SOTA) approaches.
Cognitive diagnosis is a crucial task in the field of educational measurement and psychology, which is aimed to mine and analyze the level of knowledge for a student in his or her learning process periodically. While a number of approaches and tools have been developed to diagnose the learning states of students, they do not fully learn the relationship between students, exercises and knowledge concepts in the learning system, or do not consider the traits that it is easier to complete diagnosis when focusing on a small part of knowledge concepts rather than all knowledge concepts. To address these limitations, we develop CDGK, a model based artificial neural network to deal with cognitive diagnosis. Our method not only captures non-linear interactions between exercise features, student scores, and their mastery on each knowledge concept, but also performs an aggregation of the knowledge concepts via converting them into graph structure, and only considering the leaf node in the knowledge concept tree, which can reduce the dimension of the model without accuracy loss. In our evaluation on two real-world datasets, CDGK outperforms the state-of-the-art related approaches in terms of accuracy, reasonableness and interpretability.
Medication recommendation aiming at accurate prescription is a significant clinical application that assists caregivers in professional practice of medicine, and obtaining informative patient representations plays an important role in building effective recommendation models. Meanwhile, conducting attentive multi-hop reading on Memory Neural Network (MemNN) that stores knowledge from previous admissions is widely applied to derive contextual patterns for accurate patient representations. However, regular attentive reading may repeatedly attend to the same slots of MemNN. Although the coverage mechanism is proposed to tackle the problem, it is based on the assumption that there is one-to-one alignment between source information and target outputs, which medical records do not follow. In pursuit of a valuable model for medication recommendation, we propose the Multi-hop Reading with Selective Coverage (MRSC). MRSC firstly conducts information selection on MemNN based on the coverage of each slot. Then the method involves coverage into the attention calculation during the multi-hop reading on MemNN, making sure that all important historical records is fully utilized by balancing attention within selected information. Experiments on real-world clinical dataset demonstrate that MRSC successfully derives informative patient representations for the recommendation by conducting selection on MemNN and limiting attention adjustment within selected information.
Counterfactual explanations are minimum changes of a given input to alter the original prediction by a machine learning model, usually from an undesirable prediction to a desirable one. Previous works frame this problem as a constrained cost minimization, where the cost is defined as L1/L2 distance (or variants) over multiple features to measure the change. In real-life applications, features of different types are hardly comparable and it is difficult to measure the changes of heterogeneous features by a single cost function. Moreover, existing approaches do not support interactive exploration of counterfactual explanations. To address above issues, we propose the skyline counterfactual explanations that define the skyline of counterfactual explanations as all non-dominated changes. We solve this problem as multi-objective optimization over actionable features. This approach does not require any cost function over heterogeneous features. With the skyline, the user can interactively and incrementally refine their goals on the features and magnitudes to be changed, especially when lacking prior knowledge to express their needs precisely. Intensive experiment results on three real-life datasets demonstrate that the skyline method provides a friendly way for finding interesting counterfactual explanations, and achieves superior results compared to the state-of-the-art methods.
Graph Neural Networks (GNNs) have achieved significant success in learning better representations by performing feature propagation and transformation iteratively to leverage neighborhood information. Nevertheless, iterative propagation restricts the information of higher-layer neighborhoods to be transported through and fused with the lower-layer neighborhoods', which unavoidably results in feature smoothing between neighborhoods in different layers and can thus compromise the performance, especially on heterophily networks. Furthermore, most deep GNNs only recognize the importance of higher-layer neighborhoods while yet to fully explore the importance of multi-hop dependency within the context of different layer neighborhoods in learning better representations. In this work, we first theoretically analyze the feature smoothing between neighborhoods in different layers and empirically demonstrate the variance of the homophily level across neighborhoods at different layers. Motivated by these analyses, we further propose a tree decomposition method to disentangle neighborhoods in different layers to alleviate feature smoothing among these layers. Moreover, we characterize the multi-hop dependency via graph diffusion within our tree decomposition formulation to construct Tree Decomposed Graph Neural Network (TDGNN), which can flexibly incorporate information from large receptive fields and aggregate this information utilizing the multi-hop dependency. Comprehensive experiments demonstrate the superior performance of TDGNN on both homophily and heterophily networks under a variety of node classification settings. Extensive parameter analysis highlights the ability of TDGNN to prevent over-smoothing and incorporate features from shallow layers with deeper multi-hop dependencies, which provides new insights towards deeper graph neural networks.
With the increasing popularity of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in various domains, such as image classification and object detection, and achieve stunning success in terms of their high accuracy over the traditional statistical methods. To exploit the potentials of CNN models, a huge amount of research and industry efforts have been devoted to optimizing CNNs. Among these endeavors, CNN architecture design has attracted tremendous attention because of its great potential of improving model accuracy or reducing model complexity. However, existing work either introduces repeated training overhead in the search process or lacks an interpretable metric to guide the design.
To clear these hurdles, we propose 3D-Receptive Field (3DRF), an explainable and easy-to-compute metric, to estimate the quality of a CNN architecture and guide the search process of designs. To validate the effectiveness of 3DRF, we build a static optimizer to improve the CNN architectures at both the stage level and the kernel level. Our optimizer not only provides a clear and reproducible procedure but also mitigates unnecessary training efforts in the architecture search process. Extensive experiments and studies show that the models generated by our optimizer can achieve up to 5.47% accuracy improvement and up to 65.38% parameters deduction, compared with state-of-the-art CNN structures like MobileNet and ResNet.
Forecasting incident occurrences (e.g. crime, EMS, traffic accident) is a crucial task for emergency service providers and transportation agencies in performing response time optimization and dynamic fleet management. However, such events are by nature rare and sparse, which causes the label imbalance problem and inferior performance of models relying on data sufficiency. The existing studies circumvent, instead of truly solving, this issue by defining the incident prediction problem in a coarse-grained temporal (e.g. daily) setting, which leaves the proposed models unrobust to fine-grained dynamics and trivial for the real-world decision making. In this paper, we tackle the temporally fine-grained incident prediction problem in a sparse setting by explicitly exploiting the behind-the-scene chainlike triggering mechanism. Moreover, this chain effect roots in multiple domains (i.e. spatial, categorical), which further entangles with the temporal dimension and happens to be time-variant. To be specific, we propose a novel deep learning framework, namely Spatio-Temporal-Categorical Graph Neural Networks (STC-GNN), to handle the multidimensional and dynamic chain effect for performing fine-grained multi-incident co-prediction. Extensive experiments on three real-world city-level incident datasets verify the insightfulness of our perspective and effectiveness of the proposed model.
The pervasiveness of GPS-enabled devices and wireless communication technologies flourish the market of Spatial Crowdsourcing (SC), which consists of location-based tasks and requires workers to physically be at specific locations to complete them. In this work, we study the problem of Worker Churn based Task Assignment in SC, where tasks are to be assigned by considering workers' churn. In particular, we aim to achieve the highest total rewards of task assignments based on the worker churn prediction. To solve the problem, we propose a two-phase framework, which consists of a worker churn prediction phase and a task assignment phase. In the first phase, we use an LSTM-based model to extract the latent feelings of workers based on the historical data and then estimate the idle time intervals of workers. In the assignment phase, we design an efficient greedy algorithm and a Kuhn-Munkras (KM)-based algorithm that can achieve the optimal task assignment. Extensive experiments offer insight into the effectiveness and efficiency of the proposed solutions.
Zero-shot learning (ZSL) aims to recognize unseen classes based on the knowledge of seen classes. Previous methods focused on learning direct embeddings from global features to the semantic space in hope of knowledge transfer from seen classes to unseen classes. However, an unseen class shares local visual features with a set of seen classes and leveraging global visual features makes the knowledge transfer ineffective. To tackle this problem, we propose a Region Semantically Aligned Network (RSAN), which maps local features of unseen classes to their semantic attributes. Instead of using global features which are obtained by an average pooling layer after an image encoder, we directly utilize the output of the image encoder which maintains local information of the image. Concretely, we obtain each attribute from a specific region of the output and exploit these attributes for recognition. As a result, the knowledge of seen classes can be successfully transferred to unseen classes in a region-bases manner. In addition, we regularize the image encoder through attribute regression with a semantic knowledge to extract robust and attribute-related visual features. Experiments on several standard ZSL datasets reveal the benefit of the proposed RSAN method, outperforming state-of-the-art methods.
Graph classification is an important problem with applications across many domains, like chemistry and bioinformatics, for which graph neural networks (GNNs) have been state-of-the-art (SOTA) methods. GNNs are designed to learn node-level representation based on neighborhood aggregation schemes, and to obtain graph-level representation, pooling methods are applied after the aggregation operation in existing GNN models to generate coarse-grained graphs. However, due to highly diverse applications of graph classification, and the performance of existing pooling methods vary on different graphs. In other words, it is a challenging problem to design a universal pooling architecture to perform well in most cases, leading to a demand for data-specific pooling methods in real-world applications. To address this problem, we propose to use neural architecture search (NAS) to search for adaptive pooling architectures for graph classification. Firstly we designed a unified framework consisting of four modules: Aggregation, Pooling, Readout, and Merge, which can cover existing human-designed pooling methods for graph classification. Based on this framework, a novel search space is designed by incorporating popular operations in human-designed architectures. Then to enable efficient search, a coarsening strategy is proposed to continuously relax the search space, thus a differentiable search method can be adopted. Extensive experiments on six real-world datasets from three domains are conducted, and the results demonstrate the effectiveness and efficiency of the proposed framework1
Automating architecture design for recommendation tasks becomes a trending topic because expert efforts are saved, and better performance is expected. Neural Architecture Search (NAS) is introduced to discover powerful CTR prediction model architectures in recent works. CTR prediction model usually consists of three components: embedding layer, interaction layer, and deep neural network. However, existing automation works focus on searching single component and leaving other components hand-crafted. The isolated searching will cause incompatibility among components and lead to weak generalization ability. Moreover, there is not a unified framework for integrated CTR prediction model architecture searching. This paper presents Automatic Integrated Architecture Searcher (AutoIAS), a framework that provides a practical and general method to find optimal CTR prediction model architecture in an automatic manner. In AutoIAS, we unify existing interaction-based CTR prediction model architectures and propose an integrated search space for a complete CTR prediction model. We utilize a supernet to predict the performance of sub-architectures, and the supernet is trained with Knowledge Distillation(KD) to enhance consistency among sub-architectures. To efficiently explore the search space, we design an architecture generator network that explicitly models the architecture dependencies among components and generates conditioned architectures distribution for each component. Experiments on public datasets show the outstanding performance and generalization ability of AutoIAS. Ablation study shows the effectiveness of the KD-based supernet training method and the Architecture Generator Network.
Instance type information is particularly relevant to perform reasoning and obtain further information about entities in knowledge graphs (KGs). However, during automated or pay-as-you-go KG construction processes, instance types might be incomplete or missing in some entities. Previous work focused mostly on representing entities and relations as embeddings based on the statements in the KG. While the computed embeddings encode semantic descriptions and preserve the relationship between the entities, the focus of these methods is often not on predicting schema knowledge, but on predicting missing statements between instances for completing the KG. To fill this gap, we propose an approach that first learns a KG representation suitable for predicting instance type assertions. Then, our solution implements a neural network architecture to predict instance types based on the learned representation. Results show that our representations of entities are much more separable with respect to their associations with classes in the KG, compared to existing methods. For this reason, the performance of predicting instance types on a large number of KGs, in particular on cross-domain KGs with a high variety of classes, is significantly better in terms of F1-score than previous work.
Ontology authoring is a complicated and error-prone process since the knowledge being modeled is expressed using logic-based formalisms, in which logical consequences of the knowledge have to be foreseen. To make that process easier, competency questions (CQs), being questions expressed in natural language are often stated to trace both the correctness and completeness of the ontology at a given time. However, CQs have to be translated into a formal language, like ontology query language (SPARQL-OWL), to query the ontology. Since the translation step is time-consuming and requires familiarity with the query language used, in this paper, we propose an automatic method named SeeQuery, which recommends SPARQL-OWL queries being translations of CQs stated against a given ontology. It consists of a pipeline of transformations based on template matching and filling, being motivated by the biggest to date publicly available CQ to SPARQL-OWL datasets. We provide a detailed description of SeeQuery and evaluate the method on a separate set of 2 ontologies with their CQs. It is, to date, the only automatic method available for recommending SPARQL-OWL queries out of CQs. The source code of SeeQuery is available at: https://github.com/dwisniewski/SeeQuery.
Conversational recommender systems elicit user preference via interactive conversational interactions. By introducing conversational key-terms, existing conversational recommenders can effectively reduce the need for extensive exploration in a traditional interactive recommender. However, there are still limitations of existing conversational recommender approaches eliciting user preference via key-terms. First, the key-term data of the items needs to be carefully labeled, which requires a lot of human efforts. Second, the number of the human labeled key-terms is limited and the granularity of the key-terms is fixed, while the elicited user preference is usually from coarse-grained to fine-grained during the conversations. In this paper, we propose a clustering of conversational bandits algorithm. To avoid the human labeling efforts and automatically learn the key-terms with the proper granularity, we online cluster the items and generate meaningful key-terms for the items during the conversational interactions. Our algorithm is general and can also be used in the user clustering when the feedback from multiple users is available, which further leads to more accurate learning and generations of conversational key-terms. We analyze the regret bound of our learning algorithm. In the empirical evaluations, without using any human labeled key-terms, our algorithm effectively generates meaningful coarse-to-fine grained key-terms and performs as well as or better than the state-of-the-art baseline.
Knowledge graph completion (KGC) has become a focus of attention across deep learning community owing to its excellent contribution to numerous downstream tasks. Although recently have witnessed a surge of work on KGC, they are still insufficient to accurately capture complex relations, since they adopt the single and static representations. In this work, we propose a novel Disentangled Knowledge Graph Attention Network (DisenKGAT) for KGC, which leverages both micro-disentanglement and macro-disentanglement to exploit representations behind Knowledge graphs (KGs). To achieve micro-disentanglement, we put forward a novel relation-aware aggregation to learn diverse component representation. For macro-disentanglement, we leverage mutual information as a regularization to enhance independence. With the assistance of disentanglement, our model is able to generate adaptive representations in terms of the given scenario. Besides, our work has strong robustness and flexibility to adapt to various score functions. Extensive experiments on public benchmark datasets have been conducted to validate the superiority of DisenKGAT over existing methods in terms of both accuracy and explainability.
Adaptive traffic signal control plays a significant role in the construction of smart cities. This task is challenging because of many essential factors, such as cooperation among neighboring intersections and dynamic traffic scenarios. First, to facilitate the cooperation of traffic signals, existing work adopts graph neural networks to incorporate the temporal and spatial influences of the surrounding intersections into the target intersection, where spatial-temporal information is used separately. However, one drawback of these methods is that the spatial-temporal correlations are not adequately exploited to obtain a better control scheme. Second, in a dynamic traffic environment, the historical state of the intersection is also critical for predicting future signal switching. Previous work mainly solves this problem using the current intersection's state, neglecting the fact that traffic flow is continuously changing both spatially and temporally and does not handle the historical state.
In this paper, we propose a novel neural network framework named DynSTGAT, which integrates dynamic historical state into a new spatial-temporal graph attention network to address the above two problems. More specifically, our DynSTGAT model employs a novel multi-head graph attention mechanism, which aims to adequately exploit the joint relations of spatial-temporal information. Then, to efficiently utilize the historical state information of the intersection, we design a sequence model with the temporal convolutional network (TCN) to capture the historical information and further merge it with the spatial information to improve its performance. Extensive experiments conducted in the multi-intersection scenario on synthetic data and real-world data confirm that our method can achieve superior performance in travel time and throughput against the state-of-the-art methods.
User behavior sequences contain rich information about user interests and are exploited to predict user's future clicking in sequential recommendation. Existing approaches, especially recently proposed deep learning models, often embed a sequence of clicked items into a single vector, i.e., a point in vector space, which suffer from limited expressiveness for complex distributions of user interests with multi-modality and heterogeneous concentration. In this paper, we propose a new representation model, named as Seq2Bubbles, for sequential user behaviors via embedding an input sequence into a set of bubbles each of which is represented by a center vector and a radius vector in embedding space. The bubble embedding can effectively identify and accommodate multi-modal user interests and diverse concentration levels. Furthermore, we design an efficient scheme to compute distance between a target item and the bubble embedding of a user sequence to achieve next-item recommendation. We also develop a self-supervised contrastive loss based on our bubble embeddings as an effective regularization approach. Extensive experiments on four benchmark datasets demonstrate that our bubble embedding can consistently outperform state-of-the-art sequential recommendation models.
Graph neural networks (GNN) recently achieved huge success in collaborative filtering (CF) due to the useful graph structure information. However, users will continuously interact with items, which causes the user-item interaction graphs to change over time and well-trained GNN models to be out-of-date soon. Naive solutions such as periodic retraining lose important temporal information and are computationally expensive. Recent works that leverage recurrent neural networks to keep GNN up-to-date may suffer from the "catastrophic forgetting'' issue, and experience a cold start with new users and items. To this end, we propose the incremental graph convolutional network (IGCN) --- a pure graph convolutional network (GCN) based method to update GNN models when new user-item interactions are available. IGCN consists of two main components: 1) a historical feature generation layer, which generates the initial user/item embedding via model agnostic meta-learning and ensures good initial states and fast model adaptation; 2) a temporal feature learning layer, which first aggregates the features from local neighborhood to update the embedding of each user/item within each subgraph via graph convolutional network and then fuses the user/item embeddings from last subgraph and current subgraph via incremental temporal convolutional network. Experimental studies on real-world datasets show that IGCN can outperform state-of-the-art CF algorithms in sequential recommendation tasks.
Session-based recommendation targets next-item prediction by exploiting user behaviors within a short time period. Compared with other recommendation paradigms, session-based recommendation suffers more from the problem of data sparsity due to the very limited short-term interactions. Self-supervised learning, which can discover ground-truth samples from the raw data, holds vast potentials to tackle this problem. However, existing self-supervised recommendation models mainly rely on item/segment dropout to augment data, which are not fit for session-based recommendation because the dropout leads to sparser data, creating unserviceable self-supervision signals. In this paper, for informative session-based data augmentation, we combine self-supervised learning with co-training, and then develop a framework to enhance session-based recommendation. Technically, we first exploit the session-based graph to augment two views that exhibit the internal and external connectivities of sessions, and then we build two distinct graph encoders over the two views, which recursively leverage the different connectivity information to generate ground-truth samples to supervise each other by contrastive learning. In contrast to the dropout strategy, the proposed self-supervised graph co-training preserves the complete session information and fulfills genuine data augmentation. Extensive experiments on multiple benchmark datasets show that, session-based recommendation can be remarkably enhanced under the regime of self-supervised graph co-training, achieving the state-of-the-art performance.
Node mapping between large graphs (or network alignment) plays a key preprocessing role in joint-graph data mining applications like social link prediction, cross-platform recommendation, etc. Most existing approaches attempt to perform alignment at the granularity of entire graphs, while handling the whole graphs may lower the scalability and the noisy nodes/edges in the graphs may impact the effectiveness. From the observation that potential node mappings always appear near known corresponding nodes, we propose iMAP, a novel sub-graph expansion based alignment framework to incrementally construct meaningful sub-graphs and perform alignment on each sub-graph pair iteratively, which reduces the unnecessary computation cost in the original raw networks and improves effectiveness via excluding possible noises. Specifically, iMap builds a candidate sub-graph around known matched nodes initially. In each following iteration, iMap trains an alignment model to infer the node mapping relationship between sub-graphs, from which the sub-graphs are further extended and refined. In addition, we design a Graph Neural Network(GNN) based model named MAP on each sub-graph pair in the iMap framework. MAP utilizes trainable Multi-layer Perception (MLP) prediction heads for similarity computation and employs a mixed loss function consisting of the ranking loss for contrastive learning and the cross-entropy loss for classification. Extensive experiments conducted on real social networks demonstrate superior efficiency and effectiveness (above 12% improvement) of our proposed method compared to several state-of-the-art methods.
PathSim is a widely used meta-path-based similarity in heterogeneous information networks. Numerous applications rely on the computation of PathSim, including similarity search and clustering. Computing PathSim scores on large graphs is computationally challenging due to its high time and storage complexity. In this paper, we propose to transform the problem of approximating the ground truth PathSim scores into a learning problem. We design an encoder-decoder based framework, NeuPath, where the algorithmic structure of PathSim is considered. Specifically, the encoder module identifies Top T optimized path instances, which can approximate the ground truth PathSim, and maps each path instance to an embedding vector. The decoder transforms each embedding vector into a scalar respectively, which identifies the similarity score. We perform extensive experiments on two real-world datasets in different domains, ACM and IMDB. Our results demonstrate that NeuPath performs better than state-of-the-art baselines in the PathSim approximation task and similarity search task.
The World Wide Web contains rich up-to-date information for knowledge graph construction. However, most current relation extraction techniques are designed for free text and thus do not handle well semi-structured web content. In this paper, we propose a novel multi-phase machine reading framework, called WebKE. It processes the web content on different granularity by first detecting areas of interest at DOM tree node level and then extracting relational triples for each area. We also propose HTMLBERT as an encoder the web content. It is a pre-trained markup language model that fully leverages the visual layout information and DOM-tree structure, without the need of hand engineered features. Experimental results show that the proposed approach outperforms state-of- the-art methods by a considerable gain. The source code is available at https://github.com/redreamality/webke.
This paper presents a three-tier modality alignment approach to learning text-image joint embedding, coined as JEMA, for cross-modal retrieval of cooking recipes and food images. The first tier improves recipe text embedding by optimizing the LSTM networks with term extraction and ranking enhanced sequence patterns, and optimizes the image embedding by combining the ResNeXt-101 image encoder with the category embedding using wideResNet-50 with word2vec. The second tier modality alignment optimizes the textual-visual joint embedding loss function using a double batch-hard triplet loss with soft-margin optimization. The third modality alignment incorporates two types of cross-modality alignments as the auxiliary loss regularizations to further reduce the alignment errors in the joint learning of the two modality-specific embedding functions. The category-based cross-modal alignment aims to align the image category with the recipe category as a loss regularization to the joint embedding. The cross-modal discriminator-based alignment aims to add the visual-textual embedding distribution alignment to further regularize the joint embedding loss. Extensive experiments with the one-million recipes benchmark dataset Recipe1M demonstrate that the proposed JEMA approach outperforms the state-of-the-art cross-modal embedding methods for both image-to-recipe and recipe-to-image retrievals.
Incorporating review information into the recommender system has been demonstrated to be an effective method for boosting the recommendation performance. Previous research mainly focus on designing advanced architectures to better profile the users and items. However, the review information in realities can be highly sparse and imbalanced, which poses great challenges for effective user/item representations and satisfied performance enhancement. To alleviate this problem, in this paper, we propose to improve review-based recommendation by counterfactually augmenting the training samples. We focus on a common setting --- feature-aware recommendation, and the main building block of our idea lies in the counterfactual question: "what would be the user's decision if her feature-level preference had been different?''. When augmenting the training samples, we actively change the user preference (also called intervention), and predict the user feedback on the items based on pre-trained recommender models. Instead of changing the user preference in a random manner, we design a learning-based method to discover the samples which are more effective for model optimization. In order to improve the sample qualities, we propose two strategies --- constrained feature perturbation and frequency-based sampling --- to equip our model. Since the sample generation model can be not perfect, we theoretically analyze the relation between the model prediction error and the number of generated samples. As a byproduct, our framework can explain the user pair-wise preference, which is complementary to the traditional point-wise explanations. Extensive experiments demonstrate that our model can significantly improve the performance of the state-of-the-art methods.
Recent studies have shown that graph neural networks (GNNs) are vulnerable to unnoticeable adversarial perturbations, which largely confines their deployment in many safety-critical domains. Robust graph structure learning has been proposed to improve the GNN performance in the face of adversarial attacks. In particular, the low-rank methods are utilized to purify the perturbed graphs. However, these methods are mostly computationally expensive with O(n3) time complexity and O(n2) space complexity. We propose LRGNN, a fast and robust graph structure learning framework, which exploits the low-rank property as prior knowledge to speed up optimization. To eliminate adversarial perturbation, LRGNN decouples the adjacency matrix into a low-rank component and a sparse one, and learns by minimizing the rank of the first part while suppressing the second part. Its sparse variant is formed to reduce the memory footprint further. Experimental results on various attack settings have shown LRGNN acquires comparable robustness with the state-of-the-art much more efficiently, boasting a significant advantage on large-scale graphs.
Cross-domain recommendation technique is a promising way to alleviate data sparsity issues by transferring knowledge from an auxiliary domain to a target domain. However, most existing works focus on utilizing the same users among different domains, while ignoring domain-specific users which forms the majority in real-world circumstances. In this paper, we propose a novel cross-domain learning approach--Relation Expansion based Cross-Domain Recommendation (ReCDR) to improve recommendation accuracies on small-overlapped domains. ReCDR first models the interactions in each domain as a local graph. It then forms a shared network by expanding out relationships using pre-trained node similarities. On the enhanced graph, ReCDR adopts a hierarchical attention mechanism. The output embedding will finally be combined with the local feature to balance the result for dual-target task. The proposed model is thoroughly evaluated on three real-world datasets. Experiments demonstrate superior performance compared to state-of-the-art methods.
Heterogeneous graphs (HGs), consisting of multiple types of nodes and links, can characterize a variety of real-world complex systems. Recently, heterogeneous graph neural networks (HGNNs), as a powerful graph embedding method to aggregate heterogeneous structure and attribute information, has earned a lot of attention. Despite the ability of HGNNs in capturing rich semantics which reveal different aspects of nodes, they still stay at a coarse-grained level which simply exploits structural characteristics. In fact, rich unstructured text content of nodes also carries latent but more fine-grained semantics arising from multi-facet topic-aware factors, which fundamentally manifest why nodes of different types would connect and form a specific heterogeneous structure. However, little effort has been devoted to factorizing them.
In this paper, we propose a Topic-aware Heterogeneous Graph Neural Network, named THGNN, to hierarchically mine topic-aware semantics for learning multi-facet node representations for link prediction in HGs. Specifically, our model mainly applies an alternating two-step aggregation mechanism including intra-metapath decomposition and inter-metapath mergence, which can distinctively aggregate rich heterogeneous information according to the inferential topic-aware factors and preserve hierarchical semantics. Furthermore, a topic prior guidance module is also designed to keep the quality of multi-facet topic-aware embeddings relying on the global knowledge from unstructured text content in HGs. It helps to simultaneously improve both performance and interpretability. Experimental results on three real-world HGs demonstrate that our proposed model can effectively outperform the state-of-the-art methods in the link prediction task, and show the potential interpretability of learnt multi-facet topic-aware representations.
The largest portion of urban congestion is caused by 'phantom' traffic jams, causing significant delay travel time, fuel waste, and air pollution. It frequently occurs in high-density traffics without any obvious signs of accidents or roadworks. The root cause of 'phantom' traffic jams in one-lane traffics is the sudden change in velocity of some vehicles (i.e. harsh driving behavior (HDB)), which may generate a chain reaction with accumulated impact throughout the vehicles along the lane. This paper makes the first attempt to address this notorious problem in a one-lane traffic environment through velocity control of autonomous vehicles. Specifically, we propose a velocity control framework, called PATROL (sPAtial-temporal ReinfOrcement Learning). First, we design a spatial-temporal graph inside the reinforcement learning model to process and extract the information (e.g. velocity and distance difference) of multiple vehicles ahead across several historical time steps in the interactive environment. Then, we propose an attention mechanism to characterize the vehicle interactions and an LSTM structure to understand the vehicles' driving patterns through time. At last, we modify the reward function used in previous velocity control works to enable the autonomous driving agent to predict the HDB of preceding vehicles and smoothly adjust its velocity, which could alleviate the chain reaction caused by HDB. We conduct extensive experiments to demonstrate the effectiveness and superiority of PATROL in alleviating the 'phantom' traffic jam in simulation environments. Further, on the real-world velocity control dataset, our method significantly outperforms the existing methods in terms of driving safety, comfortability, and efficiency.
Graph Convolutional Network (GCN) has been widely used in graph learning tasks. However, GCN-based models (GCNs) are inherently coupled training frameworks repetitively conducting the recursive neighborhood aggregation, which leads to high computational and memory overheads when processing large-scale graphs. To tackle these issues, we present Node2Grids, a cost-efficient uncoupled training framework that leverages the independent mapped data for obtaining the embedding. Instead of directly processing the coupled nodes as GCNs, Node2Grids supports a more efficacious method in practice, mapping the coupled graph data into the independent grid-like data which can be fed into the uncoupled models as Convolutional Neural Network (CNN). This simple but valid strategy significantly saves memory and computational resources while achieving comparable results with the leading GCN-based models. Specifically, in order to support a general and convenient mapping approach, Node2Grids selects the most influential neighborhood with central node fusion information to construct the grid-like data. To further improve the downstream tasks' efficiency, a simple CNN-based neural network is employed to capture the significant information from the mapped grid-like data. Moreover, the grid-level attention mechanism is implemented, which enables implicitly specifying the different weights for the extracted grids of CNN. In addition to the typical transductive and inductive learning tasks, we also verify our framework on million-scale graphs to demonstrate the superiority of cost performance against the state-of-the-art GCN-based approaches. The codes are available on the GitHub link.
The rapid rise of real-time bidding-based online advertising has brought significant economic benefits and attracted extensive research attention. From the perspective of an advertiser, it is crucial to perform accurate utility estimation and cost estimation for each individual auction in order to achieve cost-effective advertising. These problems are known as the click through rate (CTR) prediction task and the market price modeling task, respectively. However, existing approaches treat CTR prediction and market price modeling as two independent tasks to be optimized without regard to each other, thus resulting in suboptimal performance. Moreover, they do not make full use of unlabeled data from the losing bids during estimations, which makes them suffer from the sample selection bias issue. To address these limitations, we propose Multi-task Advertising Estimator (MTAE), an end-to-end joint optimization framework which performs both CTR prediction and market price modeling simultaneously. Through multi-task learning, both estimation tasks can take advantage of knowledge transfer to achieve improved feature representation and generalization abilities. In addition, we leverage the abundant bid price signals in the full-volume bid request data and introduce an auxiliary task of predicting the winning probability into the framework for unbiased learning. Through extensive experiments on two large-scale real-world public datasets, we demonstrate that our proposed approach has achieved significant improvements over the state-of-the-art models under various performance metrics.
Knowledge graph (KG) embedding aims to encode entities and relations into low-dimensional vector spaces, in turn, can support various machine learning models on KG related tasks with good performance. However, existing methods for knowledge graph embedding fail to consider the influence of the embedding space, which makes them still unsatisfactory in practical applications. In this study, we try to improve the expressiveness of the embedding space from the perspective of the metric. Specifically, we first point out the implications of Minkowski metric used in KG embedding and then make a quantitative analysis. To solve the limitations, we introduce a new metric, named Cycle metric, based on the oscillation property of the periodic function. Furthermore, we find that the function period has a significant influence on the expressiveness of the embedding space. Given a fully trained model, the smaller the period, the better the expressive ability. Finally, to validate the findings, we propose a new model, named CyclE by combining Cycle Metric and the popular KG embeddings models. Comprehensive experimental results show that Cycle is more appropriate than Minkowski for KG embedding.
Knowledge graph (KG) representation learning which aims to encode entities and relations into low-dimensional spaces, has been widely used in KG completion and link prediction. Although existing KG representation learning models have shown promising performance, the theoretical mechanism behind existing models is much less well-understood. It is challenging to accurately portray the internal connections between models and build a competitive model systematically. To overcome this problem, a unified KG representation learning framework, called GrpKG, is proposed in this paper to model the KG representation learning from a generic groupoid perspective. We discover that many existing models are essentially the same in the sense of groupoid isomorphism and further provide transformation methods between different models. Moreover, we explore the applications of GrpKG in the model classification as well as other processes. The experiments on several benchmark data sets validate the effectiveness and superiority of our framework by comparing two proposed models (GrpQ8 and GrpM2) with the state-of-the-art models.
Learning in a sparse-reward setting is a well-known challenge in RL (Reinforcement Learning). In the single-agent domain, this challenge can be addressed by introducing exploration bonuses driven by intrinsic motivation to encourage agents to visit unseen states. However, naively applying these methods in MARL (Multi-Agent Reinforcement Learning) cooperative settings with sparse rewards results in some inevitable problems: misunderstanding environmental knowledge and lack of collaboration among agents, etc. Based on this, in this paper, we propose the Curiosity and Influence-based Explore (CIExplore) method, which includes a new form of intrinsic reward and an internal counterfactual advantage function. Concretely, the intrinsic reward is a combination of joint curiosity reward and influence reward. The former is the variance of outputs across an ensemble of prediction models that take joint observations and actions of all agents as inputs to predict the next time's joint observations. And the latter quantifies the influence of one agent's behavior on other agents' state-value functions. Given that the joint curiosity reward is shared by all agents, we compute an internal counterfactual advantage function to address this intrinsic reward assignment problem. We demonstrate the efficacy of CIExplore in the multi-agent grid-world environments and show that it is compatible with both on-policy and off-policy MARL algorithms and be scalable to complex settings where agents' number or environment randomness increases.
Entity alignment aims to match synonymous entities across different knowledge graphs, which is a fundamental task for knowledge integration. Recently, researchers have devoted to leveraging rich information within relations to enhance entity alignment. They explicitly incorporate relations in entity representation and alignment, demonstrating remarkable results. However, affected by the semantic assumptions from early works, these works represent a relation by combining all the entities it connects, ignoring the semantic independence between entity and relation. Moreover, since these works perform alignment by comparing embedding similarity, they fail to consider a graph level alignment and tend to find local false correspondences.
In this paper, we propose Entity and Relation Matching Consensus (ERMC), a two-stage matching schema based on graph matching consensus that jointly models and aligns entities and relations and retains their semantic independence at the same time. In the first stage, we design a bidirectional relation-aware graph convolutional network to jointly learn entity and relation embeddings based on the triadic graph by a novel message passing mechanism. Then, we jointly align the entities and relations by computing a graph-level matching consensus. In the second stage, we introduce a refinement strategy to detect and correct false alignments in the first stage. Experimental results on three real-world multilingual datasets demonstrate that ERMC outperforms some state-of-the-art models on both entity alignment and relation alignment tasks.
Top-N recommendation, which aims to learn user ranking-based preference, has long been a fundamental problem in a wide range of applications. Traditional models usually motivate themselves by designing complex or tailored architectures based on different assumptions. However, the training data of recommender system can be extremely sparse and imbalanced, which poses great challenges for boosting the recommendation performance. To alleviate this problem, in this paper, we propose to reformulate the recommendation task within the causal inference framework, which enables us to counterfactually simulate user ranking-based preferences to handle the data scarce problem. The core of our model lies in the counterfactual question: "what would be the user's decision if the recommended items had been different?''. To answer this question, we firstly formulate the recommendation process with a series of structural equation models (SEMs), whose parameters are optimized based on the observed data. Then, we actively indicate many recommendation lists (called intervention in the causal inference terminology) which are not recorded in the dataset, and simulate user feedback according to the learned SEMs for generating new training samples. Instead of randomly intervening on the recommendation list, we design a learning-based method to discover more informative training samples. Considering that the learned SEMs can be not perfect, we, at last, theoretically analyze the relation between the number of generated samples and the model prediction error, based on which a heuristic method is designed to control the negative effect brought by the prediction error. Extensive experiments are conducted based on both synthetic and real-world datasets to demonstrate the effectiveness of our framework.
This paper investigates online active learning in the setting of class-imbalanced data streams, where labels are allowed to be queried of with limited budgets. In this setup, conventional learning would be biased towards majority classes and consequently harm the performance. To address this issue, imbalance learning technique adopts both asymmetric losses and asymmetric queries to tackle the imbalance. Although this approach is effective, it may not guarantee the performance in an adversarial setting where the actual labels are unknown, and they may be chosen by the adversary
To learn a promising hypothesis in class-imbalanced and adversarial environment, we propose an asymmetric min-max optimization framework for online classification. The derived algorithm can track the imbalance and bound the choices of an adversary simultaneously. Despite the promising result, this algorithm assumes that the label is provided for every input, while label is scare and labeling is expensive in real-world application. To this end, we design a confidence-based sampling strategy to query the informative labels within a budget. We theoretically analyze this algorithm in terms of mistake bound, and two asymmetric measures. Empirically, we evaluate the algorithms on multiple real-world imbalanced tasks. Promising results could be achieved on various application domains.
Streaming spatial-textual data that contains geographic and textual information, e.g., geo-tagged tweets, has an unprecedented increase in amount. As one of the basic operations, the continuous spatial-textual queries that retrieve real-time results continuously on large-scale spatial-textual streams call for means of efficient distributed processing. However, existing proposals either are spatialaware only, or superficially exploit textual information for pruning. We propose a distributed system, called HASTE, for hybrid and adaptive processing on streaming spatial-textual data. The novelty lies on three aspects: (1) We propose a novel method to reduce the workload beforehand by dividing objects and queries into mutually exclusive types; (2) We develop a novel load partitioning strategy and a novel cost model that consider both spatial and textual properties; (3) We design a multi-level load adjustment strategy that adaptively copes with different degrees of load imbalance. We report on extensive experiments with real-world data that offer insight into the performance of the solution, and show that the solution is capable of outperforming the state-of-the-art proposals.
Search and recommendation are the two most common approaches used by people to obtain information. They share the same goal -- satisfying the user's information need at the right time. There are already a lot of Internet platforms and Apps providing both search and recommendation services, showing us the demand and opportunity to simultaneously handle both tasks. However, most platforms consider these two tasks independently -- they tend to train separate search model and recommendation model, without exploiting the relatedness and dependency between them. In this paper, we argue that jointly modeling these two tasks will benefit both of them and finally improve overall user satisfaction. We investigate the interactions between these two tasks in the specific information content service domain. We propose first integrating the user's behaviors in search and recommendation into a heterogeneous behavior sequence, then utilizing a joint model for handling both tasks based on the unified sequence. More specifically, we design the Unified Information SEarch and Recommendation model (USER), which mines user interests from the integrated sequence and accomplish the two tasks in a unified way. Experiments on a dataset from a real-world information content service platform verify that our model outperforms separate search and recommendation baselines.
As a fundamental task of document content analysis, keyphrase extraction (KE) aims at predicting a set of lexical units that conveys the core information of the document. In this paper, we study the problem of KE in the talent recruitment. This problem is critical for the development of a variety of intelligent recruitment services, such as person-job fit, market trend analysis and course recommendation. However, unlike traditional textual data, the texts from the recruitment domain, such as resume and job postings, often have unique characteristics of abbreviation and succinctness, resulting in massive keyphrases consisting of inconsecutive words that are hard to be fully captured by existing KE methods. To this end, we propose an interactive neural network approach, INKE, for facilitating KE in the talent recruitment. To be specific, we first introduce a novel keyphrase indicator that captures the explicit hint information for each keyphrase. Then, we design a dynamically-initialized decoder which can generate keyphrases in an interactive manner. Moreover, we propose a hierarchical reinforcement learning algorithm to enhance the interaction between the hint information capture and keyphrase generation. Finally, extensive experiments on real-world data clearly validate the effectiveness and interpretability of INKE compared with state-of-the-art baselines.
Entity resolution is the task of identifying records in different datasets that refer to the same entity in the real world. In sensitive domains (e.g. financial accounts, hospital health records), entity resolution must meet privacy requirements to avoid revealing sensitive information such as personal identifiable information to untrusted parties. Existing solutions are either too algorithmically-specific or come with an implicit trade-off between accuracy of the computation, privacy, and run-time efficiency. We propose AMMPERE, an abstract computation model for performing universal privacy-preserving entity resolution. AMMPERE offers abstractions that encapsulate multiple algorithmic and platform-agnostic approaches using variants of Jaccard similarity to perform private data matching and entity resolution. Specifically, we show that two parties can perform entity resolution over their data, without leaking sensitive information. We rigorously compare and analyze the feasibility, performance overhead and privacy-preserving properties of these approaches on the Sharemind multi-party computation (MPC) platform as well as on PALISADE, a lattice-based homomorphic encryption library. The AMMPERE system demonstrates the efficacy of privacy-preserving entity resolution for real-world data while providing a precise characterization of the induced cost of preventing information leakage.
Recent years have witnessed a revolution in Spatial Crowdsourcing (SC), in which people with mobile connectivity can perform spatio-temporal tasks that involve travel to specified locations. In this paper, we identify and study in depth a new multi-center-based task allocation problem in the context of SC, where multiple allocation centers exist. In particular, we aim to maximize the total number of the allocated tasks while minimizing the average allocated task number difference. To solve the problem, we propose a two-phase framework, called Task Allocation with Geographic Partition, consisting of a geographic partition phase and a task allocation phase. The first phase is to divide the whole study area based on the allocation centers by using both a basic Voronoi diagram-based algorithm and an adaptive weighted Voronoi diagram-based algorithm. In the allocation phase, we utilize a Reinforcement Learning method to achieve the task allocation, where a graph neural network with the attention mechanism is used to learn the embeddings of allocation centers, delivery points and workers. Extensive experiments give insight into the effectiveness and efficiency of the proposed solutions.
The broad adoption of electronic health record (EHR) systems and the advances of deep learning technology have motivated the development of health risk prediction models, which mainly depend on the expressiveness and temporal modeling capacity of deep neural networks (DNNs) to improve prediction performance. Some further augment the prediction by using external knowledge, however, a great deal of EHR information inevitably loses during the knowledge mapping. In addition, prediction made by existing models usually lacks reliable interpretation, which undermines their reliability in guiding clinical decision-making. To solve these challenges, we propose MedRetriever, an effective and flexible framework that leverages unstructured medical text collected from authoritative websites to augment health risk prediction as well as to provide understandable interpretation. Besides, MedRetriever explicitly takes the target disease documents into consideration, which provide key guidance for the model to learn in a target-driven direction, i.e., from the target disease to the input EHR. To specify, MedRetriever can flexibly choose its backbone from major predictive models to learn the EHR embedding for each visit. After that, the EHR embedding and features of target disease documents are aggregated into a query by self-attention to retrieve highly relevant text segments from the medical text pool, which is stored in the dynamically updated text memory. Finally, the comprehensive EHR embedding and the text memory are used for prediction and interpretation. We evaluate MedRetriever against nine state-of-the-art approaches across three real-world EHR datasets, which consistently achieves the best performance in AUC and recall metrics and outperforms the best baseline by at least 4.8% in recall on three test datasets. Furthermore, we conduct case studies to show the easy-to-understand interpretation by MedRetriever.
Dynamic community detection (or graph clustering) in temporal networks has attracted much attention because it is promising for revealing the underlying mechanism of complex real-world systems. Current methods are criticized for the independence of graph representation learning and graph clustering, considerable noise during temporal information smoothing, and high time complexity. We propose a R obust T emporal S moothing C lustering method (RTSC), which involves joint graph representation learning and graph clustering, to solve these problems. RTSC can be formulated as a constrained multi-objective optimization problem. Specifically, three-order successive snapshots are first projected into the same subspace via graph embedding. We then use the embedding matrices to learn a common low-rank block-diagonal matrix that contains current clustering information and specific noise matrices with a sparse constraint to remove noise at each time step. To efficiently solve the challenging optimization problem, we also propose an optimization procedure based on the augmented Lagrangian multiplier (ALM) scheme. Experimental results on six artificial datasets and four real-world dynamic network datasets indicate that RTSC performs better than six state-of-the-art algorithms for dynamic clustering in temporal networks.
Time series prediction has great practical value in a wide range of real-world scenarios such as stock market and retail. Existing methods typically face model aging issue caused by the concept drift: the model performance degrades along time. Undoubtedly, the model aging issue can cause serious damage in practical usage, e.g. wrong predictions in stock price may cause catastrophic losses in the financial domain. Therefore, it is essential to address the model aging issue so as to promise the predictor's performance in the future. In this paper, we propose a novel solution to address the issue. First, we uncover the theoretical connection between the complex concept drift in time series data and the gradients of deep neural networks. Based on this, we propose a novel framework called learning to learn the future. Specifically, we develop a learning method to model the concept drift during the inference stage, which can help the model generalize well in the future. Furthermore, to mitigate the impact of noises and randomness of time series data, we propose to enhance the framework by leveraging similar series in concept drift modeling. To the best of our knowledge, our approach is the first general solution to model aging issue in time series prediction. We conduct extensive experiments on three real-world datasets, which validate the effectiveness of our framework. For instance, it achieves a relative improvement of 33% in stock price prediction over the state-of-the-art methods.
In this paper, we tackle the cross-lingual language-to-vision (CLLV) retrieval task. In the CLLV retrieval task, given the text query in one language, it seeks to retrieve the relevant images/videos from the database based on visual content in images/videos and their captions in another language. As the CLLV retrieval bridges the modal gap and the language gap, it makes many international cross-modal applications feasible. To tackle the CLLV retrieval, in this paper, we propose an assorted attention network (A2N) to synchronously overcome the language gap, bridge the modal gap and fuse features of two modals in an elegant and effective manner. It represents each text query as a set of word features and represents each image/video as a set of its caption's word features in another language and a set of its local visual features. In this case, the relevance between the text query and the image/video is obtained by the matching between the set of query's word features and two sets of image/video features. To enhance the effectiveness of the matching, A2N merges the query's word features and the image/video's visual and word features into an assorted set and further conducts the self-attention operation on items of the assorted set. On one hand, benefited from the attentions between the query's word features and the video/image's visual features, some important word features or visual features of the image/video can be emphasized. On the other hand, benefited from the attentions between the video/image's visual features and its caption word features, the image/video's visual content and the text information can be fused in a more effective manner. Systematic experiments conducted on four datasets demonstrate the effectiveness of the proposed A2N in the CLLV retrieval task.
The past decade, we have witnessed rapid progress in compact representation learning for fast image retrieval. In the unsupervised scenario, product quantization (PQ) is one of the promising methods to generate compact image representation for fast and accurate retrieval. Inspired by the great success of deep neural network (DNN) achieved in computer vision, many works attempted to integrate PQ in DNN for end-to-end supervised training. Nevertheless, in existing deep PQ methods, data samples from different classes share the same codebook. Thus, they might be entangled with each other in the feature space. Meanwhile, existing deep PQ methods relying on triplet or pairwise loss require a huge number of training triplets or pairs, which are expensive in computation and scale poorly.
In this work, we propose a multiple exemplars learning (MEL) approach to improve retrieval accuracy and training efficiency. For each class, we learn a class-specific codebook consisting of multiple exemplars to partition the class-specific feature space. Since the feature space as well as the codebook is class-specific, samples of different classes are disentangled in the feature space. We incorporate the proposed MEL in a convolutional neural network, supporting end-to-end training. Moreover, we propose MEL loss which trains the network in a considerably more efficient manner than existing deep product quantization approaches based on pairwise or triplet loss. Systematic experiments conducted on two public benchmarks demonstrate the effectiveness and efficiency of our method.
Graph Neural Networks (GNNs) have achieved significant success in handling graph-structured data, such as knowledge graphs, citation networks, molecular structures, etc. However, most of them are usually shallow structures because of the over-smoothing problem that the representations of nodes are indistinguishable when stacking many layers. Several recent studies have tried to design deep GNNs for powerful expression ability by enlarging the receptive fields to aggregate information from high-order neighbors. But deep models may give rise to overfitting problem. In this paper, we propose a novel insight to aggregate more useful information based on multi-view which does not require deep structures. Specifically, we first design two complementary views to describe global topology and feature similarity of nodes. Then we devise an attention strategy to fuse node representations, named M ulti-V iew G raph C onvolutional N etowrk(MV-GCN). Further, we introduce a self-supervised technique to learn node representations by contrastive learning on different views, which can learn distinctive node embeddings from a large number of unlabeled data, named M ulti-V iew C ontrastive G raph C onvolutional Network(MV-CGC). Finally, we conduct extensive experiments on six public datasets for node classification, which prove the superiority of two proposed models compared with state-of-the-art methods.
Entity alignment (EA) is the task of detecting equivalent entities from different knowledge graphs (KGs). Although this problem has been intensively studied during the last few years, the majority of the state-of-the-arts heavily rely on the labeled data, which are difficult to obtain in practice. Therefore, it calls for the study of EA with scarce supervision. To resolve this issue, we put forward a reinforced active entity alignment framework to select the entities to be manually labeled with the aim of enhancing alignment performance with minimal labeling efforts. Under this framework, we further devise an unsupervised contrastive loss to contrast different views of entity representations and augment the limited supervision signals by exploiting the vast unlabeled data. We empirically evaluate our proposal on eight popular KG pairs, and the results demonstrate that our proposed model and its components consistently boost the alignment performance under scarce supervision.
Recently, Information Retrieval community has witnessed fast-paced advances in Dense Retrieval (DR), which performs first-stage retrieval with embedding-based search. Despite the impressive ranking performance, previous studies usually adopt brute-force search to acquire candidates, which is prohibitive in practical Web search scenarios due to its tremendous memory usage and time cost. To overcome these problems, vector compression methods have been adopted in many practical embedding-based retrieval applications. One of the most popular methods is Product Quantization (PQ). However, although existing vector compression methods including PQ can help improve the efficiency of DR, they incur severely decayed retrieval performance due to the separation between encoding and compression. To tackle this problem, we present JPQ, which stands for Joint optimization of query encoding and Product Quantization. It trains the query encoder and PQ index jointly in an end-to-end manner based on three optimization strategies, namely ranking-oriented loss, PQ centroid optimization, and end-to-end negative sampling. We evaluate JPQ on two publicly available retrieval benchmarks. Experimental results show that JPQ significantly outperforms popular vector compression methods. Compared with previous DR models that use brute-force search, JPQ almost matches the best retrieval performance with 30x compression on index size. The compressed index further brings 10x speedup on CPU and 2x speedup on GPU in query latency.
Fraud detection in e-commerce, which is critical to protecting the capital safety of users and financial corporations, aims at determining whether an online transaction or other activity is fraudulent or not. This problem has been previously addressed by various fully supervised learning methods. However, the true labels for training a supervised fraud detection model are difficult to collect in many real-world cases. To circumvent this issue, a series of automatic annotation techniques are employed instead in generating multiple noisy annotations for each unknown activity. In order to utilize these low-quality, multi-sourced annotations in achieving reliable detection results, we propose an iterative two-staged fraud detection framework with multi-sourced extremely noisy annotations. In label aggregation stage, multi-sourced labels are integrated by voting with adaptive weights; and in label correction stage, the correctness of the aggregated labels are properly estimated with the help of a handful of exactly labeled data and the results are used to train a robust fraud detector. These two stages benefit from each other, and the iterative executions lead to steadily improved detection results. Therefore, our method is termed "Label Aggregation and Correction" (LAC). Experimentally, we collect millions of transaction records from Alipay in two different fraud detection scenarios, i.e., credit card theft and promotion abuse fraud. When compared with state-of-the-art counterparts, our method can achieve at least 0.019 and 0.117 improvements in terms of average AUC on the two collected datasets, which clearly demonstrate the effectiveness.
As a well-established probabilistic method, topic models seek to uncover latent semantics from plain text. In addition to having textual content, we observe that documents are usually compared in listwise rankings based on their content. For instance, world-wide countries are compared in an international ranking in terms of electricity production based on their national reports. Such document comparisons constitute additional information that reveal documents' relative similarities. Incorporating them into topic modeling could yield comparative topics that help to differentiate and rank documents. Furthermore, based on different comparison criteria, the observed document comparisons usually cover multiple aspects, each expressing a distinct ranked list. For example, a country may be ranked higher in terms of electricity production, but fall behind others in terms of life expectancy or government budget. Each comparison criterion, or aspect, observes a distinct ranking. Considering such multiple aspects of comparisons based on different ranking criteria allows us to derive one set of topics that inform heterogeneous document similarities. We propose a generative topic model aimed at learning topics that are well aligned to multi-aspect listwise comparisons. Experiments on public datasets demonstrate the advantage of the proposed method in jointly modeling topics and ranked lists against baselines comprehensively.
Relation prediction is a fundamental task in network analysis which aims to predict the relationship between two nodes. Thus, this differes from the traditional link prediction problem predicting whether a link exists between a pair of nodes, which can be viewed as a binary classification task. However, in the heterogeneous information network (HIN) which contains multiple types of nodes and multiple relations between nodes, the relation prediction task is more challenging. In addition, the HIN might have missing relation types on some edges and missing node types on some nodes, which makes the problem even harder.
In this work, we propose RPGNN, a novel relation prediction model based on the graph neural network (GNN) and multi-task learning to solve this problem. Existing GNN models for HIN representation learning usually focus on the node classification/clustering task. They require the type information of all edges and nodes and always learn a weight matrix for each type, thus requiring a large number of learning parameters on HINs with rich schema. In contrast, our model directly encodes and learns relations in HIN and avoids the requirement of type information during message passing in GNN. Hence, our model is more robust to the missing types for the relation prediction task on HINs. The experiments on real HINs show that our model can consistently achieve better performance than several state-of-the-art HIN representation learning methods.
With the push for transparency and open data, many datasets and data repositories are becoming available on the Web. This opens new opportunities for data-driven exploration, from empowering analysts to answer new questions and obtain insights to improving predictive models through data augmentation. But as datasets are spread over a plethora of Web sites, finding data that are relevant for a given task is difficult. In this paper, we take a first step towards the construction of domain-specific data lakes. We propose an end-to-end dataset discovery system, targeted at domain experts, which given a small set of keywords, automatically finds potentially relevant datasets on the Web. The system makes use of search engines to hop across Web sites, uses online learning to incrementally build a model to recognize sites that contain datasets, utilizes a set of discovery actions to broaden the search, and applies a multi-armed bandit based algorithm to balance the trade-offs of different discovery actions. We report the results of an extensive experimental evaluation over multiple domains, and demonstrate that our strategy is effective and outperforms state-of-the-art content discovery methods.
Prescription (aka Rx) drugs can be easily overprescribed and lead to drug abuse or opioid overdose. Accordingly, a state-run prescription drug monitoring program (PDMP) in the United States has been developed to reduce Overprescribing. However, PDMP has limited capability in detecting patients' potential overprescribing behaviors, impairing its effectiveness in preventing drug abuse and overdose in patients. Despite a few machine-learning-based methods that have been proposed for detecting overprescribing, they usually ignore the patient prescribing behavior and their performances are not satisfying. In light of this, we propose a novel model RxNet for overprescribing detection in PDMP. RxNet builds a dynamic heterogeneous graph to model Rx refills that are essentially prescribing and dispensing (P&D) relationships among various Rx entries (e.g., patients) whose representations are encoded by graph neural network. In addition, to explore the dynamic Rx-refill behavior and medical condition variation of patients, an RxLSTM network is designed to update representations of patients. Based on the output of RxLSTM, a dosing-adaptive network is leveraged to extract and recalibrate dosing patterns and obtain the refined patient representations which are finally utilized for overprescribing detection. The extensive experimental results on a 1-year Ohio PDMP data demonstrate that RxNet consistently outperforms state-of-the-art methods in predicting patients at high risk of opioid overdose and drug abuse, with an average of 5.7% and 7.3% improvement on F1 score respectively.
Differentiable architecture search (DARTS) is widely considered to be easy to overfit the validation set which leads to performance degradation. We first employ a series of exploratory experiments to verify that neither high-strength architecture parameters regularization nor warmup training scheme can effectively solve this problem. Based on the insights from the experiments, we conjecture that the performance of DARTS does not depend on the well-trained supernet weights and argue that the architecture parameters should be trained by the gradients which are obtained in the early stage rather than the final stage of training. This argument is then verified by exchanging the learning rate schemes of weights and parameters. Experimental results show that the simple swap of the learning rates can effectively solve the degradation and achieve competitive performance. Further empirical evidence suggests that the degradation is not a simple problem of the validation set overfitting but exhibit some links between the degradation and the operation selection bias within bilevel optimization dynamics. We demonstrate the generalization of this bias and propose to utilize this bias to achieve an operation-magnitude-based selective stop.
With the prevalence of social media, there has recently been a proliferation of recommenders that shift their focus from individual modeling to group recommendation. Since the group preference is a mixture of various predilections from group members, the fundamental challenge of group recommendation is to model the correlations among members. Existing methods mostly adopt heuristic or attention-based preference aggregation strategies to synthesize group preferences. However, these models mainly focus on the pairwise connections of users and ignore the complex high-order interactions within and beyond groups. Besides, group recommendation suffers seriously from the problem of data sparsity due to severely sparse group-item interactions. In this paper, we propose a self-supervised hypergraph learning framework for group recommendation to achieve two goals: (1) capturing the intra- and inter-group interactions among users; (2) alleviating the data sparsity issue with the raw data itself. Technically, for (1), a hierarchical hypergraph convolutional network based on the user- and group-level hypergraphs is developed to model the complex tuplewise correlations among users within and beyond groups. For (2), we design a double-scale node dropout strategy to create self-supervision signals that can regularize user representations with different granularities against the sparsity issue. The experimental analysis on multiple benchmark datasets demonstrates the superiority of the proposed model and also elucidates the rationality of the hypergraph modeling and the double-scale self-supervision.
Next Point-of-Interest (POI) recommendation plays an important role in location-based services. The state-of-the-art methods utilize recurrent neural networks (RNNs) to model users' check-in sequences and have shown promising results. However, they tend to recommend POIs similar to those that the user has often visited. As a result, users become bored with obvious recommendations. To address this issue, we propose Serendipity-oriented Next POI Recommendation model (SNPR), a supervised multi-task learning problem, with objective to recommend unexpected and relevant POIs only. To this end, we define the quantitativeserendipity as a trade-off ofrelevance andunexpectedness in the context of next POI recommendation, and design a dedicated neural network with Transformer to capture complex interdependencies between POIs in user's check-in sequence. Extensive experimental results show that our model can improverelevance significantly while theunexpectedness outperforms the state-of-the-art serendipity-oriented recommendation methods.
Link-based similarity computation arises in many real applications, including web search, clustering and recommender system. Lots of similarity measures are devoted recently, but there is one undesirable drawback, called ''path missing'' issue, i.e., the paths between objects are not fully considered for similarity computation. For example, SimRank considers only in-coming paths of equal length from a common ''center'' object, and a large portion of other paths are fully neglected. A comprehensive measure can be modeled by tallying all the possible paths between objects, but a large number of traverses would be required for these paths to fetch the similarities, which might increase the computational difficulty. In this paper, we propose a comprehensive similarity measure, namely RG-SimRank (Random surfer Graph-based SimRank), which resolves the "path missing'' issue with inheriting the philosophy of SimRank. We build a random surfer graph by allowing the surfer to stay at current object, go to other objects against in-links or along out-links. RG-SimRank adopts SimRank to compute similarities in random surfer graph instead of the original network, which has a same form of SimRank and hence inherits the optimization techniques on similarity computation. We prove that RG-SimRank considers all the possible paths of any direction and any length. And it provides a general solution to assess similarities, under which lots of existing similarity measures become its special cases. Other similarity measures besides SimRank can also be enhanced similarly using random surfer graph. Extensive experiments on real datasets demonstrate the performance of the proposed approach.
With the increasing demands of personalized learning, knowledge tracing has become important which traces students' knowledge states based on their historical practices. Factor analysis methods mainly use two kinds of factors which are separately related to students and questions to model students' knowledge states. These methods use the total number of attempts of students to model students' learning progress and hardly highlight the impact of the most recent relevant practices. Besides, current factor analysis methods ignore rich information contained in questions. In this paper, we propose Multi-Factors Aware Dual-Attentional model (MF-DAKT) which enriches question representations and utilizes multiple factors to model students' learning progress based on a dual-attentional mechanism. More specifically, we propose a novel student-related factor which records the most recent attempts on relevant concepts of students to highlight the impact of recent exercises. To enrich questions representations, we use a pre-training method to incorporate two kinds of question information including questions' relation and difficulty level. We also add a regularization term about questions' difficulty level to restrict pre-trained question representations to fine-tuning during the process of predicting students' performance. Moreover, we apply a dual-attentional mechanism to differentiate contributions of factors and factor interactions to final prediction in different practice records. At last, we conduct experiments on several real-world datasets and results show that MF-DAKT can outperform existing knowledge tracing methods. We also conduct several studies to validate the effects of each component of MF-DAKT.
Vertical federated learning (VFL) attracts increasing attention due to the emerging demands of multi-party collaborative modeling and concerns of privacy leakage. A complete list of metrics to evaluate VFL algorithms should include model applicability, privacy security, communication cost, and computation efficiency, where privacy security is especially important to VFL. However, to the best of our knowledge, there does not exist a VFL algorithm satisfying all these criteria very well. To address this challenging problem, in this paper, we reveal that zeroth-order optimization (ZOO) is a desirable companion for VFL. Specifically, ZOO can 1) improve the model applicability of VFL framework, 2) prevent VFL framework from privacy leakage under curious, colluding, and malicious threat models, 3) support inexpensive communication and efficient computation. Based on that, we propose a novel and practical VFL framework with black-box models, which is inseparably interconnected to the promising properties of ZOO. We believe that it takes one stride towards designing a practical VFL framework matching all the criteria. Under this framework, we raise two novel asynchronous zeroth-order algorithms for vertical federated learning (AsyREVEL) with different smoothing techniques. We theoretically drive the convergence rates of AsyREVEL algorithms under nonconvex condition. More importantly, we prove the privacy security of our proposed framework under existing VFL attacks on different levels. Extensive experiments on benchmark datasets demonstrate the favorable model applicability, satisfied privacy security, inexpensive communication, efficient computation, scalability and losslessness of our framework.
Knowledge-Enhanced Pre-trained Language Models (KEPLMs) improve the language understanding abilities of deep language models by leveraging the rich semantic knowledge from knowledge graphs, other than plain pre-training texts. However, previous efforts mostly use homogeneous knowledge (especially structured relation triples in knowledge graphs) to enhance the context-aware representations of entity mentions, whose performance may be limited by the coverage of knowledge graphs. Also, it is unclear whether these KEPLMs truly understand the injected semantic knowledge due to the "black-box'' training mechanism. In this paper, we propose a novel KEPLM named HORNET, which integrates Heterogeneous knowledge from various structured and unstructured sources into the Roberta NETwork and hence takes full advantage of both linguistic and factual knowledge simultaneously. Specifically, we design a hybrid attention heterogeneous graph convolution network (HaHGCN) to learn heterogeneous knowledge representations based on the structured relation triplets from knowledge graphs and the unstructured entity description texts. Meanwhile, we propose the explicit dual knowledge understanding tasks to help induce a more effective infusion of the heterogeneous knowledge, promoting our model for learning the complicated mappings from the knowledge graph embedding space to the deep context-aware embedding space and vice versa. Experiments show that our HORNET model outperforms various KEPLM baselines on knowledge-aware tasks including knowledge probing, entity typing and relation extraction. Our model also achieves substantial improvement over several GLUE benchmark datasets, compared to other KEPLMs.
Node classification is an important yet challenging task in various network applications, and many effective methods have been developed for a single network. While for cross-network scenarios, neither single network embedding nor traditional domain adaptation can directly solve the task. Existing approaches have been proposed to combine network embedding and domain adaptation for cross-network node classification. However, they only focus on domain-invariant features, ignoring the individual features of each network, and they only utilize 1-hop neighborhood information (local consistency), ignoring the global consistency information. To tackle the above problems, in this paper, we propose a novel model, Adversarial Separation Network(ASN), to learn effective node representations between source and target networks. We explicitly separate domain-private and domain-shared information. Two domain-private encoders are employed to extract the domain-specific features in each network and a shared encoder is employed to extract the domain-invariant shared features across networks. Moreover, in each encoder, we combine local and global consistency to capture network topology information more comprehensively. ASN integrates deep network embedding with adversarial domain adaptation to reduce the distribution discrepancy across domains. Extensive experiments on real-world datasets show that our proposed model achieves state-of-the-art performance in cross-network node classification tasks compared with existing algorithms.
Human interactions with items are being constantly logged, which enables advanced representation learning and facilitates various tasks. Instead of generating static embeddings at the end of training, several temporal embedding methods were recently proposed to learn user and item embeddings as functions of time, where each entity has a trajectory of embedding vectors aiming to encode the full dynamics. However, these methods may not be optimal to encode the dynamical behaviors on the interaction graphs in that they can not generate "fully''-temporal embeddings and do not consider information propagation. In this paper, we tackle the issues and propose CoPE (Co ntinuous P ropagation and E volution). We use an ordinary differential equation based graph neural network to model information propagation and more sophisticated evolution patterns. We train CoPE on sequences of interactions with the help of meta-learning to ensure fast adaptation to the most recent interactions. We evaluate CoPE on three tasks and prove its effectiveness.
Knowledge graph (KG) embedding aims to encode both entities and relations into a continuous vector space. Most existing methods require that all entities should be observed during training while ignoring the evolving nature of KG. Major recent efforts on this issue embed new entities by aggregating neighborhood information from existing entities and relations with Graph Neural Network (GNN). However, these methods rely on the neighbors seen during training and suffer from the embedding of new entities with insufficient triplets or triplets with the unseen-to-unseen form. To relieve this problem, we propose a two-stage learning model referred as Hyper-Relation Feature Learning Network (HRFN) for effective out-of-knowledge-base embedding. For the first stage, HRFN learns pre-representations for emerging entities using hyper-relation features meta-learned from the training set. A novel feature aggregating network that involves an entity-centered Graph Convolutional Network (GCN) and a relation-centered GCN is proposed to aggregate information from both new entities themselves and their neighbors. For stage two, a transductive learning network is employed to learn finer-grained embeddings based on above-mentioned pre-representations of new entities. Experimental results on the link prediction task demonstrate the superiority of our model. Further analysis is also done to validate the effectiveness and efficiency of pre-representing emerging entities with the hyper-relation feature.
In many real-world applications, bipartite graphs are naturally used to model relationships between two types of entities. Community discovery over bipartite graphs is a fundamental problem and has attracted much attention recently. However, all existing studies overlook the weight (e.g., influence or importance) of vertices in forming the community, thus missing useful properties of the community. In this paper, we propose a novel cohesive subgraph model named Pareto-optimal (α β), which is the first to consider both structure cohesiveness and weight of vertices on bipartite graphs. The proposed Pareto-optimal (α β) model follows the concept of (α, β)-core by imposing degree constraints for each type of vertices, and integrates the Pareto-optimality in modelling the weight information from two different types of vertices. An online query algorithm is developed to retrieve Pareto-optimal (α β) with the time complexity of O(p. m) where p is the number of resulting communities, and m is the number of edges in the bipartite graph G. To support efficient query processing over large graphs, we also develop index-based approaches. A complete index i is proposed, and the query algorithm based on i achieves linear query processing time regarding the result size (i.e., the algorithm is optimal). Nevertheless, the index i incurs prohibitively expensive space complexity. To strike a balance between query efficiency and space complexity, a space-efficient compact index 𝕀 is proposed. Computation-sharing strategies are devised to improve the efficiency of the index construction process for the index 𝕀. Extensive experiments on 9 real-world graphs validate both the effectiveness and the efficiency of our query processing algorithms and indexing techniques.
The spectral radius of the non-backtracking matrix for an undirected graph plays an important role in various dynamic processes running on the graph. For example, its reciprocal provides an excellent approximation of epidemic and edge percolation thresholds. In this paper, we study the problem of minimizing the spectral radius of the non-backtracking matrix of a graph with n nodes and m edges, by deleting k selected edges. We show that the objective function of this combinatorial optimization problem is not submodular, although it is monotone. Since any straightforward approach to solving the optimization problem is computationally infeasible, we present an effective, scalable approximation algorithm with complexity O (n+km). Extensive experiment results for a large set of real-world networks verify the effectiveness and efficiency of our algorithm, and demonstrate that our algorithm outperforms several baseline schemes.
The proliferation of web platforms has created incentives for online abuse. Many graph-based anomaly detection techniques are proposed to identify the suspicious accounts and behaviors. However, most of them detect the anomalies once the users have performed many such behaviors. Their performance is substantially hindered when the users' observed data is limited at an early stage, which needs to be improved to minimize financial loss. In this work, we propose Eland, a novel framework that uses action sequence augmentation for early anomaly detection. Eland utilizes a sequence predictor to predict next actions of every user and exploits the mutual enhancement between action sequence augmentation and user-action graph anomaly detection. Experiments on three real-world datasets show that Eland improves the performance of a variety of graph-based anomaly detection methods. With Eland, anomaly detection performance at an earlier stage is better than non-augmented methods that need significantly more observed data by up to 15% on the Area under the ROC curve.
In this paper, we revisit the decades-old clustering method k -means. The egg-chicken loop in traditional k -means has been replaced by a pure stochastic optimization procedure. The optimization is undertaken from the perspective of each individual sample. Different from existing incremental k -means, an individual sample is tentatively joined into a new cluster to evaluate its distance to the corresponding new centroid, in which the contribution from this sample is accounted. The sample is moved to this new cluster concretely only after we find the reallocation makes the sample closer to the new centroid than it is to the current one. Compared with traditional k -means and other variants, this new procedure allows the clustering to converge faster to a better local minimum. This fundamental modification over the k -means loop leads to the redefinition of a family of k -means variants, such as hierarchical k -means, and Sequential k -means. As an extension, a new target function that minimizes the summation of pairwise distances within clusters is presented. Under l2-norm, it could be solved under the same stochastic optimization procedure. The re-defined traditional k -means, hierarchical k -means, as well as Sequential k-means all show considerable performance improvement over their traditional counterparts under different settings and on various types of datasets.
Knowledge graph (KG) reasoning is a significant method for KG completion. To enhance the explainability of KG reasoning, some studies adopt reinforcement learning (RL) to complete the multi-hop reasoning. However, RL-based reasoning methods are severely limited by few-shot relations (only contain few triplets). To tackle the problem, recent studies introduce meta-learning into RL-based methods to improve reasoning performance. However, the generalization abilities of their models are limited due to the problem of low reasoning accuracies over hard relations (e.g., language and title). To overcome this problem, we propose a novel model called THML (Two-level Hardness-aware Meta-reinforcement Learning). Specifically, the model contains the following two components: (1) A hardness-aware meta-reinforcement learning method is proposed to predict the missing element by training hardness-aware batches. (2) A two-level hardness-aware sampling is proposed to effectively generate new hardness-aware batches from relation level and relation-cluster level. The generalization ability of our model is significantly improved by repeating the process of these two components in an alternate way. The experimental results demonstrate that THML notably outperforms the state-of-the-art approaches in few-shot scenarios.
Natural language question answering over knowledge graphs is an important and interesting task as it enables common users to gain accurate answers in an easy and intuitive manner. However, it remains a challenge to bridge the gap between unstructured questions and structured knowledge graphs. To address the problem, a natural discipline is building a structured query to represent the input question. Searching the structured query over the knowledge graph can produce answers to the question. Distinct from the existing methods that are based on semantic parsing or templates, we propose an effective approach qaSQP powered by a novel notion, structural query pattern, in this paper. Given an input question, we first generate its query sketch that is compatible with the underlying structure of the knowledge graph. Then, we complete the query graph by labeling the nodes and edges under the guidance of the structural query pattern. Finally, answers can be retrieved by executing the constructed query graph over the knowledge graph. In order to improve the overall performance of answering questions, we propose the mutual optimization technique. Evaluations on three question-answering benchmarks show that our proposed approach outperforms state-of-the-art methods significantly.
Recent trends of incorporating LSTM network with different attention mechanisms in time series forecasting have led researchers to consider the attention module as an essential component. While existing studies revealed the effectiveness of attention mechanism with some visualization experiments, the underlying rationale behind their outstanding performance on learning long-term dependencies remains hitherto obscure. In this paper, we aim to elaborate on this fundamental question by conducting a thorough investigation of the memory property for LSTM network with attention mechanism. We present a theoretical analysis of LSTM integrated with attention mechanism, and demonstrate that it is capable of generating an adaptive decay rate which dynamically controls the memory decay according to the obtained attention score. In particular, our theory shows that attention mechanism brings significantly slower decays than the exponential decay rate of a standard LSTM. Experimental results on four real-world time series datasets demonstrate the superiority of the attention mechanism for maintaining long-term memory when compared to the state-of-the-art methods, and further corroborate our theoretical analysis.
Image captioning is a cross-modal problem combining computer vision and natural language processing. A typical image captioning model uses a convolutional neural network to extract the features of an image and then uses an Long Short-Term Memory network to transform the representations of the features. However, this method has problems such as not including high-level semantics in the visual network and exposure bias in the language network. To overcome these problems, this paper proposes a novel image captioning model that combines relationship-aware and reinforcement learning. First, we design a relational awareness network as the visual network to mine the latent relationships between objects in an image. Then, a context semantic relational network is proposed to improve the accuracy of image captioning. The context semantic network can generate feature representations for arbitrary pixel positions in an image without association with any specific visual concepts. Subsequently, the high-level context semantics are used as external knowledge to guide the language network in generating sentences. Finally, a policy gradient training algorithm is designed to simplify the state value function in reinforcement learning. We have verified the effectiveness of the model on the MS-COCO and Flickr 30K datasets. The experimental results show that the model proposed in this paper achieves state-of-the-art results.
A Graph Convolutional Network (GCN) stacks several layers and in each layer performs a PROPagation operation~(PROP) and a TRANsformation operation~(TRAN) for learning node representations over graph-structured data. Though powerful, GCNs tend to suffer performance drop when the model gets deep. Previous works focus on PROPs to study and mitigate this issue, but the role of TRANs is barely investigated. In this work, we study performance degradation of GCNs by experimentally examining how stacking only TRANs or PROPs works. We find that TRANs contribute significantly, or even more than PROPs, to declining performance, and moreover that they tend to amplify node-wise feature variance in GCNs, causing variance inflammation that we identify as a key factor for causing performance drop. Motivated by such observations, we propose a variance-controlling technique termed Node Normalization (NodeNorm), which scales each node's features using its own standard deviation. Experimental results validate the effectiveness of NodeNorm on addressing performance degradation of GCNs. Specifically, it enables deep GCNs to outperform shallow ones in cases where deep models are needed, and to achieve comparable results with shallow ones on 6 benchmark datasets. NodeNorm is a generic plug-in and can well generalize to other GNN architectures. Code is publicly available at https://github.com/miafei/NodeNorm.
COVID-19 has caused lasting damage to almost every domain in public health, society, and economy. To monitor the pandemic trend, existing studies rely on the aggregation of traditional statistical models and epidemic spread theory. In other words, historical statistics of COVID-19, as well as the population mobility data, become the essential knowledge for monitoring the pandemic trend. However, these solutions can barely provide precise prediction and satisfactory explanations on the long-term disease surveillance while the ubiquitous social media resources can be the key enabler for solving this problem. For example, serious discussions may occur on social media before and after some breaking events take place. To take advantage of the social media data, we propose a novel framework, Social Media enhAnced pandemic suRveillance Technique (SMART), which is composed of two modules: (i) information extraction module to construct heterogeneous knowledge graphs based on the extracted events and relationships among them; (ii) time series prediction module to provide both short-term and long-term forecasts of the confirmed cases and fatality at the state-level in the United States and to discover risk factors for COVID-19 interventions. Extensive experiments show that our method largely outperforms the state-of-the-art baselines by 7.3% and 7.4% in confirmed case/fatality prediction, respectively.
Personalized search plays a crucial role in improving user search experience owing to its ability to build user profiles based on historical behaviors. Previous studies have made great progress in extracting personal signals from the query log and learning user representations. However, neural personalized search is extremely dependent on sufficient data to train the user model. Data sparsity is an inevitable challenge for existing methods to learn high-quality user representations. Moreover, the overemphasis on final ranking quality leads to rough data representations and impairs the generalizability of the model. To tackle these issues, we propose a Personalized Search framework with Self-supervised Learning (PSSL) to enhance data representations. Specifically, we adopt a contrastive sampling method to extract paired self-supervised information from sequences of user behaviors in query logs. Four auxiliary tasks are designed to pre-train the sentence encoder and the sequence encoder used in the ranking model. They are optimized by contrastive loss which aims to close the distance between similar user sequences, queries, and documents. Experimental results on two datasets demonstrate that our proposed model PSSL achieves state-of-the-art performance compared with existing baselines.
Click-through rate (CTR) prediction is a critical task for many applications, as its accuracy has a direct impact on user experience and platform revenue. In recent years, CTR prediction has been widely studied in both academia and industry, resulting in a wide variety of CTR prediction models. Unfortunately, there is still a lack of standardized benchmarks and uniform evaluation protocols for CTR prediction research. This leads to non-reproducible or even inconsistent experimental results among existing studies, which largely limit the practical value and potential impact of their research. In this work, we aim to perform open benchmarking for CTR prediction and present a rigorous comparison of different models in a reproducible manner. To this end, we ran over 7,000 experiments for more than 12,000 GPU hours in total to re-evaluate 24 existing models on multiple dataset settings. Surprisingly, our experiments show that with sufficient hyper-parameter search and model tuning, many deep models have smaller differences than expected. The results also reveal that making real progress on the modeling of CTR prediction is indeed a very challenging research task. We believe that our benchmarking work could not only allow researchers to gauge the effectiveness of new models conveniently but also make them fairly compare with the state of the arts. We have publicly released the benchmarking tools, evaluation protocols, and experimental settings of our work to promote reproducible research in this field.
The development of existing extractive summarization models for long-form document summarization is hindered by two factors: 1) the computation of the summarization model will dramatically increase due to the sheer size of the input long document; 2) the discourse structural information in the long-form document has not been fully exploited. To address the two deficiencies, we propose HEROES, a novel extractive summarization model for summarizing long-form documents with rich discourse structural information. In particular, the HEROES model consists of two modules: 1) a content ranking module that ranks and selects salient sections and sentences to compose a short digest that empowers complex summarization models and serves as its input; 2) an extractive summarization module based on a heterogeneous graph with nodes from different discourse levels and elaborately designed edge connections to reflect the discourse hierarchy of the document and restrain the semantic drifts across section boundaries. Experimental results on benchmark datasets show that HEROES can achieve significantly better performance compared with various strong baselines.
Context information in search sessions has proven to be useful for capturing user search intent. Existing studies explored user behavior sequences in sessions in different ways to enhance query suggestion or document ranking. However, a user behavior sequence has often been viewed as a definite and exact signal reflecting a user's behavior. In reality, it is highly variable: user's queries for the same intent can vary, and different documents can be clicked. To learn a more robust representation of the user behavior sequence, we propose a method based on contrastive learning, which takes into account the possible variations in user's behavior sequences. Specifically, we propose three data augmentation strategies to generate similar variants of user behavior sequences and contrast them with other sequences. In so doing, the model is forced to be more robust regarding the possible variations. The optimized sequence representation is incorporated into document ranking. Experiments on two real query log datasets show that our proposed model outperforms the state-of-the-art methods significantly, which demonstrates the effectiveness of our method for context-aware document ranking.
Discovering causal structure from temporal data is an important problem in many fields in science. Existing methods usually suffer from several limitations such as assuming linear dependencies among features, limiting to discrete time series, and/or assuming stationarity, i.e., causal dependencies are repeated with the same time lag and strength at all time points. In this paper, we propose an algorithm called the μ-PC that addresses these limitations. It is based on the theory of μ-separation and extends the well-known PC algorithm to the time domain. To be applicable to both discrete and continuous time series, we develop a conditional independence testing technique for time series by leveraging the Recurrent Marked Temporal Point Process (RMTPP) model. Experiments using both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithm.
Approximate subgraph matching, an important primitive for many applications like question answering, community detection, and motif discovery, often involves large labeled graphs such as knowledge graphs, social networks, and protein sequences. Effective methods for extracting matching subgraphs, in terms of label and structural similarities to a query, should depict accuracy, computational efficiency, and robustness to noise. In this paper, we propose VerSaChI for finding the top-k most similar subgraphs based on 2-hop label and structural overlap similarity with the query. The similarity is characterized using Chebyshev's inequality to compute the chi-square statistical significance for measuring the degree of matching of the subgraphs. Experiments on real-life graph datasets showcase significant improvements in terms of accuracy compared to state-of-the-art methods, as well as robustness to noise.
Financial reports filed by various companies discuss compliance, risks, and future plans, such as goals and new projects, which directly impact their stock price. Quick consumption of such information is critical for financial analysts and investors to make stock buy/sell decisions and for equity evaluations. Hence, we study the problem of extractive summarization of 10-K reports. Recently, Transformer-based summarization models have become very popular. However, lack of in-domain labeled summarization data is a major roadblock to train such finance-specific summarization models. We also show that zero-shot inference on such pretrained models is not as effective either. In this paper, we address this challenge by modeling 10-K report summarization using a goal-directed setting where we leverage summaries with labeled goal-related data for the stock buy/sell classification goal. Further, we provide improvements by considering a multi-task learning method with an industry classification auxiliary task. Intrinsic evaluation as well as extrinsic evaluation for the stock buy/sell classification and portfolio construction tasks shows that our proposed method significantly outperforms strong baselines.
Given a time-evolving tensor stream with missing values, how can we accurately discover latent factors in an online manner to predict missing values? Online tensor factorization is a crucial task with many important applications including the analysis of climate, network traffic, and epidemic disease. However, existing online methods have disregarded temporal locality and thus have limited accuracy.
In this paper, we propose STF (Streaming Tensor Factorization), an accurate online tensor factorization method for real-world temporal tensor streams with missing values. We exploit an attention-based temporal regularization to learn inherent temporal patterns of the streams. We also propose an efficient online learning algorithm which allows each row of the temporal factor matrix to be updated from past and future information. Extensive experiments show that the proposed method gives the state-of-the-art accuracy, and quickly processes each tensor slice.
Link prediction is one of the key problems for graph-structured data. With the advancement of graph neural networks, graph autoencoders (GAEs) and variational graph autoencoders (VGAEs) have been proposed to learn graph embeddings in an unsupervised way. It has been shown that these methods are effective for link prediction tasks. However, they do not work well in link predictions when a node whose degree is zero (i.g., isolated node) is involved. We have found that GAEs/VGAEs make embeddings of isolated nodes close to zero regardless of their content features. In this paper, we propose a novel Variational Graph Normalized AutoEncoder (VGNAE) that utilize L2-normalization to derive better embeddings for isolated nodes. We show that our VGNAEs outperform the existing state-of-the-art models for link prediction tasks. The code is available at https://github.com/SeongJinAhn/VGNAE.
Large-scale recommender systems are integral parts of many services. With the recent rapid growth of accessible data, the need for efficient training methods has arisen. Given the high computational cost of training state-of-the-art graph neural network (GNN) based models, it is infeasible to train them from scratch with every new set of interactions. In this work, we present a novel framework for incrementally training GNN-based models. Our framework takes advantage of an experience reply technique built on top of a structurally aware reservoir sampling method tailored for this setting. This framework addresses catastrophic forgetting, allowing the model to preserve its understanding of users' long-term behavioral patterns while adapting to new trends. Our experiments demonstrate the superior performance of our framework on numerous datasets when combined with state-of-the-art GNN-based models.
Pre-trained Transformer-based word embeddings are now widely used in text mining where they are known to significantly improve supervised tasks such as text classification, named entity recognition and question answering. Since the Transformer models create several different embeddings for the same input, one at each layer of their architecture, various studies have already tried to identify those of these embeddings that most contribute to the success of the above-mentioned tasks. In contrast the same performance analysis has not yet been carried out in the unsupervised setting. In this paper we evaluate the effectiveness of Transformer models on the important task of text clustering. In particular, we present a clustering ensemble approach that harnesses all the network's layers. Numerical experiments carried out on real datasets with different Transformer models show the effectiveness of the proposed method compared to several baselines.
Incremental contrast pattern mining (CPM) is an important task in various fields such as network traffic analysis, medical diagnosis, and customer behavior analysis. Due to increases in the speed and dimension of data streams, a major challenge for CPM is to deal with the huge number of generated candidate patterns. While there are some works on incremental CPM, their approaches are not scalable in dense and high dimensional data streams, and the problem of CPM over an evolving dataset is an open challenge. In this work we focus on extracting the most specific set of contrast patterns (CPs) to discover significant changes between two data streams. We devise a novel algorithm to extract CPs using previously mined patterns instead of generating all patterns in each window from scratch. Our experimental results on a wide variety of datasets demonstrate the advantages of our approach over the state of the art in terms of efficiency.
Collaborative filtering (CF) methods are making an impact on our daily lives in a wide range of applications, including recommender systems and personalization. Latent factor methods, e.g., matrix factorization (MF), have been the state-of-the-art in CF, however they lack interpretability and do not provide a straightforward explanation for their predictions. Explainability is gaining momentum in recommender systems for accountability, and because a good explanation can swing an undecided user. Most recent explainable recommendation methods require auxiliary data such as review text or item content on top of item ratings. In this paper, we address the case where no additional data are available and propose augmenting the classical MF framework for CF with a prior that encodes each user's embedding as a sparse linear combination of item embeddings, and vice versa for each item embedding. Our XPL-CF approach automatically reveals these user-item relationships, which underpin the latent factors and explain how the resulting recommendations are formed. We showcase the effectiveness of XPL-CF on real data from various application domains. We also evaluate the explainability of the user-item relationship obtained from XPL-CF through numeric evaluation and case study examples.
Recommender systems (RSs) employ user-item feedback, e.g., ratings, to match customers to personalized lists of products. Approaches to top-k recommendation mainly rely on Learning-To-Rank algorithms and, among them, the most widely adopted is Bayesian Personalized Ranking (BPR), which bases on a pair-wise optimization approach. Recently, BPR has been found vulnerable against adversarial perturbations of its model parameters. Adversarial Personalized Ranking (APR) mitigates this issue by robustifying BPR via an adversarial training procedure. The empirical improvements of APR's accuracy performance on BPR have led to its wide use in several recommender models. However, a key overlooked aspect has been the beyond-accuracy performance of APR, i.e., novelty, coverage, and amplification of popularity bias, considering that recent results suggest that BPR, the building block of APR, is sensitive to the intensification of biases and reduction of recommendation novelty. In this work, we model the learning characteristics of the BPR and APR optimization frameworks to give mathematical evidence that, when the feedback data have a tailed distribution, APR amplifies the popularity bias more than BPR due to an unbalanced number of received positive updates from short-head items. Using matrix factorization (MF), we empirically validate the theoretical results by performing preliminary experiments on two public datasets to compare BPR-MF and APR-MF performance on accuracy and beyond-accuracy metrics. The experimental results consistently show the degradation of novelty and coverage measures and a worrying amplification of bias.
Query Performance Prediction (QPP) is focused on estimating the difficulty of satisfying a user query for a certain retrieval method. While most state of the art QPP methods are based on term frequency and corpus statistics, more recent work in this area have started to explore the utility of pretrained neural embeddings, neural architectures and contextual embeddings. Such approaches extract features from pretrained or contextual embeddings for the sake of training a supervised performance predictor. In this paper, we adopt contextual embeddings to perform performance prediction, but distinguish ourselves from the state of the art by proposing to directly fine-tune a contextual embedding, i.e., BERT, specifically for the task of query performance prediction. As such, our work allows the fine-tuned contextual representations to estimate the performance of a query based on the association between the representation of the query and the retrieved documents. We compare the performance of our approach with the state-of-the-art based on the MS MARCO passage retrieval corpus and its three associated query sets: (1) MS MARCO development set, (2) TREC DL 2019, and (3) TREC DL 2020. We show that our approach not only shows significant improved prediction performance compared to all the state-of-the-art methods, but also, unlike past neural predictors, it shows significantly lower latency, making it possible to use in practice.
Over the last few years, contextualized pre-trained transformer models such as BERT have provided substantial improvements on information retrieval tasks. Traditional sparse retrieval methods such as BM25 rely on high-dimensional, sparse, bag-of-words query representations to retrieve documents. On the other hand, recent approaches based on pre-trained transformer models such as BERT, fine-tune dense low-dimensional contextualized representations of queries and documents in embedding space. While these dense retrievers enjoy substantial retrieval effectiveness improvements compared to sparse retrievers, they are computationally intensive, requiring substantial GPU resources, and dense retrievers are known to be more expensive from both time and resource perspectives. In addition, sparse retrievers have been shown to retrieve complementary information with respect to dense retrievers, leading to proposals for hybrid retrievers. These hybrid retrievers leverage low-cost, exact-matching based sparse retrievers along with dense retrievers to bridge the semantic gaps between query and documents. In this work, we address this trade-off between the cost and utility of sparse vs dense retrievers by proposing a classifier to select a suitable retrieval strategy (i.e., sparse vs. dense vs. hybrid) for individual queries. Leveraging sparse retrievers for queries which can be answered with sparse retrievers decreases the number of calls to GPUs. Consequently, while utility is maintained, query latency decreases. Although we use less computational resources and spend less time, we still achieve improved performance. Our classifier can select between sparse and dense retrieval strategies based on the query alone. We conduct experiments on the MS MARCO passage dataset demonstrating an improved range of efficiency/effectiveness trade-offs between purely sparse, purely dense or hybrid retrieval strategies, allowing an appropriate strategy to be selected based on a target latency and resource budget.
Online shopping is gaining popularity. Traditional retailers with physical stores adjust to this trend by allowing their customers to shop online as well as offline, in-store. Increasingly, customers can browse and purchase products across multiple shopping channels. Understanding how customer behavior relates to the availability of multiple shopping channels is an important prerequisite for many downstream machine learning tasks, such as recommendation and purchase prediction. However, previous work in this domain is limited to analyzing single-channel behavior only.
In this paper, we provide the first insights into multi-channel customer behavior in retail based on a large sample of 2.8 million transactions originating from 300,000 customers of a food retailer in Europe. Our analysis reveals significant differences in customer behavior across online and offline channels, for example with respect to the repeat ratio of item purchases and basket size. Based on these findings, we investigate the performance of a next basket recommendation model under multi-channel settings. We find that the recommendation performance differs significantly for customers based on their choice of shopping channel, which strongly indicates that future research on recommenders in this area should take into account the particular characteristics of multi-channel retail shopping.
The overload of information on the Internet becomes ubiquitous nowadays, which makes the role of recommender systems more important. In recommender systems, the interest of users and popularity of items are not static, but can change drastically. Thus modeling the temporal dynamic of user-item interactions is crucial in recommender systems. The newly proposed Neural Ordinary Differential Equation (NODE) method is able to modeling the temporal mechanism of a system with neural networks. By using the ODE-LSTM method, which unites the ability of NODE to handle continuous time and that of LSTM to address sequential data, in this paper we achieve significant improvements for the recommendation task on several real-world datasets with the time irregularity. To handle sessions with different timestamps in ODE-LSTM, we propose a collective timeline technique that contributes a lot to the performance improvement. Moreover, we find that reducing the scale of time intervals in sessions significantly improves the recommendation performance.
Modern-day recommender systems are often based on learning representations in a latent vector space that encode user and item preferences. In these models, each user/item is represented by a single vector and user-item interactions are modeled by some function over the corresponding vectors. This paradigm is common to a large body of collaborative filtering models that repeatedly demonstrated superior results. In this work, we break away from this paradigm and present ACF: Anchor-based Collaborative Filtering. Instead of learning unique vectors for each user and each item, ACF learns a spanning set of anchor-vectors that commonly serve both users and items. In ACF, each anchor corresponds to a unique "taste'' and users/items are represented as a convex combination over the spanning set of anchors. Additionally, ACF employs two novel constraints: (1) exclusiveness constraint on item-to-anchor relations that encourages each item to pick a single representative anchor, and (2) an inclusiveness constraint on anchors-to-items relations that encourages full utilization of all the anchors. We compare ACF with other state-of-the-art alternatives and demonstrate its effectiveness on multiple datasets.
Transformer-based language models significantly advanced the state-of-the-art in many linguistic tasks. As this revolution continues, the ability to explain model predictions has become a major area of interest for the NLP community. In this work, we present Gradient Self-Attention Maps (Grad-SAM) - a novel gradient-based method that analyzes self-attention units and identifies the input elements that explain the model's prediction the best. Extensive evaluations on various benchmarks show that Grad-SAM obtains significant improvements over state-of-the-art alternatives.
For summarization in niche domains, data is not enough to fine-tune the large pre-trained model. In order to alleviate the few-shot problem, we design several auxiliary tasks to assist the main task---abstractive summarization. In this paper, we employ BART as the base sequence-to-sequence model and incorporate the main and auxiliary tasks under the multi-task framework. We transform all the tasks in the format of machine reading comprehension . Moreover, we utilize the task-specific adapter to effectively share knowledge across tasks and the adaptive weight mechanism to adjust the contribution of auxiliary tasks to the main task. Experiments show the effectiveness of our method for few-shot datasets. We also propose to firstly pre-train the model on unlabeled datasets, and the methods proposed in this paper can further improve the model performance.
Quality of search engine results returned to health-related questions is very critical, since a searcher may directly trust any suggestion in the top results. We analyze search questions that mention diseases / symptoms and remedies that are potential health-related misbeliefs. Using lists of medical and alternative medicine terms, we extract health-related search questions from 1.5~billion questions submitted to Yandex. As an initial study, we sample 30 frequent questions that contain a disease--remedy pair like "Can hepatitis be cured with milk thistle?". For each question, we carefully identify a ground truth answer in the medical literature and annotate the top-10 Yandex search result snippets as confirming the belief, rejecting it, or giving no answer. Our analysis shows that about 44%~of the snippets (that users may simply interpret as definitive answers!) confirm some untrue beliefs and are wrong, and only few include health risk warnings about using toxic plants.
Extracting event temporal relations is an important task for natural language understanding. Many works have been proposed for supervised event temporal relation extraction, which typically requires a large amount of human-annotated data for model training. However, the data annotation for this task is very time-consuming and challenging. To this end, we study the problem of semi-supervised event temporal relation extraction. Self-training as a widely used semi-supervised learning method can be utilized for this problem. However, it suffers from the noisy pseudo-labeling problem. In this paper, we propose the use of uncertainty-aware self-training framework (UAST) to quantify the model uncertainty for coping with pseudo-labeling errors. Specifically, UAST utilizes (1) Uncertainty Estimation module to compute the model uncertainty for pseudo-labeling unlabeled data; (2) Sample Selection with Exploration module to select informative samples based on uncertainty estimates; and (3) Uncertainty-Aware Learning module to explicitly incorporate the model uncertainty into the self-training process. Experimental results indicate that our approach significantly outperforms previous state-of-the-art methods.
Variants of Graph Neural Networks (GNNs) for representation learning have been proposed recently and achieved fruitful results in various fields. Among them, Graph Attention Network (GAT) first employs a self-attention strategy to learn attention weights for each edge in the spatial domain. However, learning the attentions over edges can only focus on the local information of graphs and greatly increases the computational costs. In this paper, we first introduce the attention mechanism in the spectral domain of graphs and present Spectral Graph Attention Network (SpGAT) that learns representations for different frequency components regarding weighted filters and graph wavelets bases. In this way, SpGAT can better capture global patterns of graphs in an efficient manner with much fewer learned parameters than that of GAT. Further, to reduce the computational cost of SpGAT brought by the eigen-decomposition, we propose a fast approximation variant SpGAT-Cheby. We thoroughly evaluate the performance of SpGAT and SpGAT-Cheby in semi-supervised node classification tasks and verify the effectiveness of the learned attentions in the spectral domain.
Multi-task learning has been observed by many researchers, which supposes that different tasks can share a low-rank common yet latent subspace. It means learning multiple tasks jointly is better than learning them independently. In this paper, we propose two novel multi-task learning formulations based on two regularization terms, which can learn the optimal shared latent subspace by minimizing the exactly k minimal singular values. The proposed regularization terms are the more tight approximations of rank minimization than trace norm. But it's an NP-hard problem to solve the exact rank minimization problem. Therefore, we design a novel re-weighted based iterative strategy to solve our models, which can tactically handle the exact rank minimization problem by setting a large penalizing parameter. Experimental results on benchmark datasets demonstrate that our methods can correctly recover the low-rank structure shared across tasks, and outperform related multi-task learning methods.
The economic policy uncertainty (EPU) index is one of the important text-based indexes in finance and economics fields. The EPU indexes of more than 26 countries have been constructed to reflect the policy uncertainty on country-level economic environments and serve as an important economic leading indicator. The EPU indexes are calculated based on the number of news articles with some manually-selected keywords related to economic, uncertainty, and policy. We find that the keyword-based EPU indexes contain noise, which will influence their explainability and predictability. In our experimental dataset, over 40% of news articles with the selected keywords are not related to the EPU. Instead of using keywords only, our proposed models take contextual information into account and get good performance on identifying the articles unrelated to EPU. The noise free EPU index performs better than the keyword-based EPU index in both explainability and predictability.
The volatility of stock price reflects the risk of stock and influences the risk of investor's portfolio. It is also a crucial part of pricing derivative securities. Researchers have paid their attention to predict the stock volatility with different kinds of textual data. However, most of them focus on using word information only. Few touch on capturing the numeral information in textual data, providing fine-grained clues for financial document understanding. In this paper, we present a novel dataset, ECNum, for understanding the numerals in the transcript of earnings conference calls. We propose a simple but efficient method, Numeral-Aware Model (NAM), for enhancing the capacity of numeral understanding of neural network models. We employ the distilled information in the stock volatility forecasting task and achieve the best performance compared to the previous works in short-term scenarios.
Numeral information plays an important role in narratives of several domains such as medicine, engineering, and finance. Previous works focus on the foundation exploration toward numeracy and show that fine-grained numeracy is a challenging task. In machine reading comprehension, our statistics show that only a few numeral-related questions appear in previous datasets. It indicates that few benchmark datasets are designed for numeracy learning. In this paper, we present a Numeral-related Question Answering Dataset, NQuAD, for fine-grained numeracy, and propose several baselines for future works. We compare NQuAD with three machine reading comprehension datasets and show that NQuAD is more challenging than the numeral-related questions in other datasets. NQuAD is published under the CC BY-NC-SA 4.0 license for academic purposes.
The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.
Machine learning models have been widely used for fraud detection, while developing and maintaining these models often suffers from significant limitations in terms of training data scarcity and constrained resources. To address these issues, in this paper, we leverage machine learning vulnerability to adversarial attacks, and design a novel model AdvRFD that Adversarially Reprograms an ImageNet classification neural network for Fraud Detection task. AdvRFD first embeds transaction features into a host image to construct new ImageNet data, and then learns a universal perturbation to be added to all inputs, such that the outputs of the pretrained model can be accordingly mapped to the final detection decisions for all transactions. Extensive experiments on two transaction datasets made over Ethereum and credit cards have demonstrated that AdvRFD is effective to detect fraud using limited data and resources.
Many payment platforms hold large-scale marketing campaigns, which allocate incentives to encourage users to pay through their applications. To maximize the return on investment, incentive allocations are commonly solved in a two-stage procedure. After training a response estimation model to estimate the users' mobile payment probabilities (MPP), a linear programming process is applied to obtain the optimal incentive allocation. However, the large amount of biased data in the training set, generated by the previous biased allocation policy, causes a biased estimation. This bias deteriorates the performance of the response model and misleads the linear programming process, dramatically degrading the performance of the resulting allocation policy. To overcome this obstacle, we propose a bias correction adversarial network. Our method leverages the small set of unbiased data obtained under a full-randomized allocation policy to train an unbiased model and then uses it to reduce the bias with adversarial learning. Offline and online experimental results demonstrate that our method outperforms state-of-the-art approaches and significantly improves the performance of the resulting allocation policy in a real-world marketing campaign.
Table search aims to retrieve a list of tables given a user's query. Previous methods only consider the textual information of tables and the structural information is rarely used. In this paper, we propose to model the complex relations in the table corpus as one or more graphs and then utilize graph neural networks to learn representations of queries and tables. We show that the text-based table retrieval methods can be further improved by graph-based predictions which fuse multiple field-level information.
Fake news videos are being actively produced and uploaded on YouTube to attract public attention. In this paper,we propose a topic-agnostic fake news video detection model based on adversarial learning and topic modeling. The proposed model estimates the topic distribution of a video using its title/description and comments by topic modeling and tries to identify the differences in stance by the topic distribution difference between title/description and comments. Then, it constructs an adversarial neural network to extract topic-agnostic features effectively. The proposed model can effectively detect topic changes for stance analysis and easily shift among various topics. In this study, it achieves an F1-score 2.68% point greater than previous models in fake news video detection.
User identity linkage (UIL) task aims to infer the identical users between different social networks/platforms. Existing models leverage the labeled inter-linkages or high-quality user attributes to make predictions. Nevertheless, it is often difficult or even impossible to obtain such information in real-world applications. To this end, we in this paper focus on studying an Anonymized User Identity Linkage (AUIL) problem wherein neither labeled anchor users nor attributes are available. To handle such a practical and challenging task, we propose a novel and concise unsupervised embedding method, VCNE, by utilizing the network structural information. Concretely, considering the inherent properties of structural diversity in the AUIL problem, we introduce a variational cross-network embedding learning framework to jointly study the Gaussian embeddings instead of the existing deterministic embedding from the angle of vector space. The multi-facet experiments on both real-world and synthetic datasets demonstrate that VCNE not only outperforms all baselines to a large extent but also be more robust to the different-level diversities and sparsities of the networks.
Point-of-Interest (POI) recommendation is an important task in location-based social networks. It facilitates the relation modeling between users and locations. Recently, researchers recommend POIs by long- and short-term interests and achieve success. However, they fail to well capture the periodic interest. People tend to visit similar places at similar times or in similar areas. Existing models try to acquire such kind of periodicity by user's mobility status or time slot, which limits the performance of periodic interest. To this end, we propose to learn spatial-temporal periodic interest. Specifically, in the long-term module, we learn the temporal periodic interest of daily granularity, then utilize intra-level attention to form long-term interest. In the short-term module, we construct various short-term sequences to acquire the spatial-temporal periodic interest of hourly, areal, and hourly-areal granularities, respectively. Finally, we apply inter-level attention to automatically integrate multiple interests. Experiments on two real-world datasets demonstrate the state-of-the-art performance of our method.
Long sequence time-series forecasting (LSTF) has become increasingly popular for its wide range of applications. Though superior models have been proposed to enhance the prediction effectiveness and efficiency, it is reckless to neglect or underestimate one of the most natural and basic temporal properties of time series: history has inertia. In this paper, we introduce a new baseline for LSTF, named historical inertia (HI). In HI, the most recent historical data points in the input time series are adopted as the prediction results. We experimentally evaluate HI on 4 public real-world datasets and 2 LSTF tasks. The results demonstrate that up to 82% relative improvement over state-of-the-art works can be achieved. We further discuss why HI works and potential ways of benefiting from it.
Traditional oversampling methods are generally employed to handle class imbalance in datasets. This oversampling approach is independent of the classifier; thus, it does not offer an end-to-end solution. To overcome this, we propose a three-player adversarial game-based end-to-end method, where a domain-constraints mixture of generators, a discriminator, and a multi-class classifier are used. Rather than adversarial minority oversampling, we propose an adversarial oversampling (AO) and a data-space oversampling (DO) approach. In AO, the generator updates by fooling both the classifier and discriminator, however, in DO, it updates by favoring the classifier and fooling the discriminator. While updating the classifier, it considers both the real and synthetically generated samples in AO. But, in DO, it favors the real samples and fools the subset class-specific generated samples. To mitigate the biases of a classifier towards the majority class, minority samples are over-sampled at a fractional rate. Such implementation is shown to provide more robust classification boundaries. The effectiveness of our proposed method has been validated with high-dimensional, highly imbalanced and large-scale multi-class tabular datasets. The results as measured by average class specific accuracy (ACSA) clearly indicate that the proposed method provides better classification accuracy (improvement in the range of 0.7% to 49.27%) as compared to the baseline classifier
Open-domain conversational QA (ODCQA) calls for effective question rewriting (QR), as the questions in a conversation typically lack proper context for the QA model to interpret. In this paper, we compare two types of QR approaches, generative and expansive QR, in end-to-end ODCQA systems with recently released QReCC and OR-QuAC benchmarks. While it is common practice to apply the same QR approach for both the retriever and the reader in the QA system, our results show such strategy is generally suboptimal and suggest expansive QR is better for the sparse retriever and generative QR is better for the reader. Furthermore, while conversation history modeling with dense representations outperforms QR, we show the advantages to apply both jointly, as QR boosts the performance especially when limited history turns are considered.
In general, graph neural networks (GNNs) adopt the message-passing scheme to capture the information of a node (i.e., nodal attributes, and local graph structure) by iteratively transforming, aggregating the features of its neighbors. Nonetheless, recent studies show that the performance of GNNs can be easily hampered by the existence of abnormal or malicious nodes due to the vulnerability of neighborhood aggregation. Thus it is necessary to learn anomaly-resistant GNNs without the prior knowledge of ground-truth anomalies, given the fact that labeling anomalies is costly and requires intensive domain knowledge. Though removing anomalies through unsupervised anomaly detection methods could be a possible solution, it may render unreasonable GNN model performance on target tasks due to the non-differentiable gap between the two learning procedures. In order to keep the effectiveness of GNNs on anomaly-contaminated graphs, in this paper, we propose a new framework named RARE-GNN (Reinforced Anomaly-REsistant Graph Neural Networks) which can detect anomalies from the input graph and learn anomaly-resistant GNNs simultaneously. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework.
Explicitly modeling emotions in dialogue generation has important applications, such as building empathetic personal companions. In this study, we consider the task of expressing a specific emotion for dialogue generation. Previous approaches take the emotion as a training signal, which may be ignored during inference. Here, we propose a search-based emotional dialogue system by simulated annealing (SA). Specifically, we first define a scoring function that combines contextual coherence and emotional correctness. Then, SA iteratively edits a general response, and search for a generation with a high score. In this way, we enforce the presence of the desired emotion. We evaluate our system on the NLPCC2017 dataset. The proposed method shows about 12% improvements in emotion accuracy compared with the previous state-of-the-art method, without hurting the generation quality (measured by BLEU).
Change point detection is widely used for finding transitions between states of data generation within a time series. Methods for change point detection currently assume this transition is instantaneous and therefore focus on finding a single point of data to classify as a change point. However, this assumption is flawed because many time series actually display short periods of transitions between different states of data generation. Previous work has shown Bayesian Online Change Point Detection (BOCPD) to be the most effective method for change point detection on a wide range of different time series. This paper explores adapting the change point detection algorithms to detect abrupt changes over short periods of time. We design a segment-based mechanism to examine a window of data points within a time series, rather than a single data point, to determine if the window captures abrupt change. We test our segment-based Bayesian change detection algorithm on 36 different time series and compare it to the original BOCPD algorithm. Our results show that, for some of these 36 time series, the segment-based approach for detecting abrupt changes can much more accurately identify change points based on standard metrics.
Recently, deep learning methods have become mainstream in code search since they do better at capturing semantic correlations between code snippets and search queries and have promising performance. However, code snippets have diverse information from different dimensions, such as business logic, specific algorithm, and hardware communication, so it is hard for a single code representation module to cover all the perspectives. On the other hand, as a specific query may focus on one or several perspectives, it is difficult for a single query representation module to represent different user intents. In this paper, we propose MuCoS, a multi-model ensemble learning architecture for semantic code search. It combines several individual learners, each of which emphasizes a specific perspective of code snippets. We train the individual learners on different datasets which contain different perspectives of code information, and we use a data augmentation strategy to get these different datasets. Then we ensemble the learners to capture comprehensive features of code snippets. The experiments show that MuCoS has better results than the existing state-of-the-art methods. Our source code and data are anonymously available at https://github.com/Xzh0u/MuCoS.
To address the sample selection bias between the training and test data, previous research works focus on reweighing biased training data to match the test data and then building classification models on the reweighed training data. However, how to achieve fairness in the built classification models is under-explored. In this paper, we propose a framework for robust and fair learning under sample selection bias. Our framework adopts the reweighing estimation approach for bias correction and the minimax robust estimation approach for achieving robustness on prediction accuracy. Moreover, during the minimax optimization, the fairness is achieved under the worst case, which guarantees the model's fairness on test data. We further develop two algorithms to handle sample selection bias when test data is both available and unavailable.
There is an urgent call to detect and prevent "biased data" at the earliest possible stage of the data pipelines used to build automated decision-making systems. In this paper, we are focusing on controlling the data bias in entity resolution (ER) tasks aiming to discover and unify records/descriptions from different data sources that refer to the same real-world entity. We formally define the ER problem with fairness constraints ensuring that all groups of entities have similar chances to be resolved. Then, we introduce FairER, a greedy algorithm for solving this problem for fairness criteria based on equal matching decisions. Our experiments show that FairER achieves similar or higher accuracy against two baseline methods over 7 datasets, while guaranteeing minimal bias.
The popularity of online social coding (SC) platforms such as GitHub is growing due to their social functionalities and tremendous support during the product development lifecycle. The rich information of experts' contributions on repositories can be leveraged to recruit experts for new/existing projects. In this paper, we define the problem of collaborative experts finding in SC platforms. Given a project, we model an SC platform as an attributed heterogeneous network, learn latent representations of network entities in an end-to-end manner and utilize them to discover collaborative experts to complete a project. Extensive experiments on real-world datasets from GitHub indicate the superiority of the proposed approach over the state-of-the-art in terms of a range of performance measures.
The steadily rising number of datasets is making it increasingly difficult for researchers and practitioners to be aware of all datasets, particularly of the most relevant datasets for a given research problem. To this end, dataset search engines have been proposed. However, they are based on user's keywords and, thus, have difficulty determining precisely fitting datasets for complex research problems. In this paper, we propose a system that recommends suitable datasets based on a given research problem description. The recommendation task is designed as a domain-specific text classification task. As shown in a comprehensive offline evaluation using various state-of-the-art models, as well as 88,000 paper abstracts and 265,000 citation contexts as research problem descriptions, we obtain an F1-score of 0.75. In an additional user study, we show that users in real-world settings are 88% satisfied in all test cases. We therefore see promising future directions for dataset recommendation.
The sequential patterns within the user interactions are pivotal for representing the user's preference and capturing latent relationships among items. The recent advancements of sequence modeling by Transformers advocate the community to devise more effective encoders for the sequential recommendation. Most existing sequential methods assume users are deterministic. However, item-item transitions might fluctuate significantly in several item aspects and exhibit randomness of user interests. This stochastic characteristics brings up a solid demand to include uncertainties in representing sequences and items. Additionally, modeling sequences and items with uncertainties expands users' and items' interaction spaces, thus further alleviating cold-start problems.
In this work, we propose a Distribution-based Transformer for Sequential Recommendation (DT4SR), which injects uncertainties into sequential modeling. We use Elliptical Gaussian distributions to describe items and sequences with uncertainty. We describe the uncertainty in items and sequences as Elliptical Gaussian distribution. And we adopt Wasserstein distance to measure the similarity between distributions. We devise two novel Transformers for modeling mean and covariance, which guarantees the positive-definite property of distributions. The proposed method significantly outperforms the state-of-the-art methods. The experiments on three benchmark datasets also demonstrate its effectiveness in alleviating cold-start issues. The code is available in https://github.com/DyGRec/DT4SR.
Large datasets in NLP tend to suffer from noisy labels due to erroneous automatic and human annotation procedures. We study the problem of text classification with label noise, and aim to capture this noise through an auxiliary noise model over the classifier. We first assign a probability score to each training sample of having a clean or noisy label, using a two-component beta mixture model fitted on the training losses at an early epoch. Using this, we jointly train the classifier and the noise model through a novel de-noising loss having two components: (i) cross-entropy of the noise model prediction with the input label, and (ii) cross-entropy of the classifier prediction with the input label, weighted by the probability of the sample having a clean label. Our empirical evaluation on two text classification tasks and two types of label noise: random and input-conditional, shows that our approach can improve classification accuracy, and prevent over-fitting to the noise.
FlinkCEP is the Complex Event Processing (CEP) API of the Flink Big Data platform. The high expressive power of the language of FlinkCEP comes at the cost of cumbersome parameterization of the queried patterns, acting as a barrier for FlinkCEP's adoption. Moreover, properly configuring a FlinkCEP program to run over a computer cluster requires advanced skills on modern hardware administration which non-expert programmers do not possess. In this work (i) we build a novel, logical CEP operator that receives CEP pattern queries in the form of extended regular expressions and seamlessly re-writes them to FlinkCEP programs, (ii) we build a CEP Optimizer that automatically decides good job configurations for these FlinkCEP programs. We also present an experimental evaluation which demonstrates the significant benefits of our approach.
Pre-trained language models have achieved noticeable performance on the intent detection task. However, due to assigning an identical weight to each sample, they suffer from the overfitting of simple samples and the failure to learn complex samples well. To handle this problem, we propose a density-based dynamic curriculum learning model. Our model defines the sample's difficulty level according to their eigenvectors' density. In this way, we exploit the overall distribution of all samples' eigenvectors simultaneously. Then we apply a dynamic curriculum learning strategy, which pays distinct attention to samples of various difficulty levels and alters the proportion of samples during the training process. Through the above operation, simple samples are well-trained, and complex samples are enhanced. Experiments on three open datasets verify that the proposed density-based algorithm can distinguish simple and complex samples significantly. Besides, our model obtains obvious improvement over the strong baselines.
Information Extraction from visual documents enables convenient and intelligent assistance to end users. We present a Neighborhood-based Information Extraction (NIE) approach that uses contextual language models and pays attention to the local neighborhood context in the visual documents to improve information extraction accuracy. We collect two different visual document datasets and show that our approach outperforms the state-of-the-art global context-based IE technique. In fact, NIE outperforms existing approaches in both small and large model sizes. Our on-device implementation of NIE on a mobile platform that generally requires small models showcases NIE's usefulness in practical real-world applications.
In recent years, inductive graph embedding models, viz., graph neural networks (GNNs) have become increasingly accurate at link prediction (LP) in online social networks. The performance of such networks depends strongly on the input node features, which vary across networks and applications. Selecting appropriate node features remains application-dependent and generally an open question. Moreover, owing to privacy and ethical issues, use of personalized node features is often restricted. In fact, many publicly available data from online social network do not contain any node features (e.g., demography). In this work, we provide a comprehensive experimental analysis which shows that harnessing a transductive technique (e.g., Node2Vec) for obtaining initial node representations, after which an inductive node embedding technique takes over, leads to substantial improvements in link prediction accuracy. We demonstrate that, for a wide variety of GNN variants, node representation vectors obtained from Node2Vec serve as high quality input features to GNNs, thereby improving LP performance.
Recommender Systems (RS) tend to recommend more popular items instead of the relevant long-tail items. Mitigating such popularity bias is crucial to ensure that less popular but relevant items are part of the recommendation list shown to the user. In this work, we study the phenomenon of popularity bias in session-based RS (SRS) obtained via deep learning (DL) models. We observe that DL models trained on the historical user-item interactions in session logs (having long-tailed item-click distributions) tend to amplify popularity bias. To understand the source of this bias amplification, we consider potential sources of bias at two distinct stages in the modeling process: i. the data-generation stage (user-item interactions captured as session logs), ii. the DL model training stage. We highlight that the popularity of an item has a causal effect on i. user-item interactions via conformity bias, as well as ii. item ranking from DL models via biased training process due to class (target item) imbalance. While most existing approaches in literature address only one of these effects, we consider a comprehensive causal inference framework that identifies and mitigates the effects at both stages. Through extensive empirical evaluation on simulated and real-world datasets, we show that our approach improves upon several strong baselines from literature for popularity bias and long-tailed classification. Ablation studies show the advantage of our comprehensive causal analysis to identify and handle bias in data generation as well as training stages.
There exists a high variability in mobility data volumes across different regions, which deteriorates the performance of spatial recommender systems that rely on region-specific data. In this paper, we propose a novel transfer learning framework called Reformd, for continuous-time location prediction for regions with sparse checkin data. Specifically, we model user-specific checkin-sequences in a region using a marked temporal point process (MTPP) with normalizing flows to learn the inter-checkin time and geo-distributions. Later, we transfer the model parameters of spatial and temporal flows trained on a data-rich origin region for the next check-in and time prediction in a target region with scarce checkin data. We capture the evolving region-specific checkin dynamics for MTPP and spatial-temporal flows by maximizing the joint likelihood of next checkin with three channels (1) checkin-category prediction, (2) checkin-time prediction, and (3) travel distance prediction. Extensive experiments on different user mobility datasets across the U.S. and Japan show that our model significantly outperforms state-of-the-art methods for modeling continuous-time sequences. Moreover, we also show that Reformd can be easily adapted for product recommendations i.e., sequences without any spatial component.
Classification datasets are often biased in observations, leaving onlya few observations for minority classes. Our key contribution is de-tecting and reducing Under-represented (U-) and Over-represented(O-) artifacts from dataset imbalance, by proposing a Counterfac-tual Generative Smoothing approach on both feature-space anddata-space, namely CGS_f and CGS_d. Our technical contribution issmoothing majority and minority observations, by sampling a ma-jority seed and transferring to minority. Our proposed approachesnot only outperform state-of-the-arts in both synthetic and real-lifedatasets, they effectively reduce both artifact types.
Recommender systems typically operate on high-dimensional sparse user-item matrices. Matrix completion is a very challenging task to predict one's interest based on millions of other users having each seen a small subset of thousands of items. We propose a G lobal-Local Kernel-based matrix completion framework, named GLocal-K, that aims to generalise and represent a high-dimensional sparse user-item matrix entry into a low dimensional space with a small number of important features. Our GLocal-K can be divided into two major stages. First, we pre-train an auto encoder with the local kernelised weight matrix, which transforms the data from one space into the feature space by using a 2d-RBF kernel. Then, the pre-trained auto encoder is fine-tuned with the rating matrix, produced by a convolution-based global kernel, which captures the characteristics of each item. We apply our GLocal-K model under the extreme low-resource setting, which includes only a user-item rating matrix, with no side information. Our model outperforms the state-of-the-art baselines on three collaborative filtering benchmarks: ML-100K, ML-1M, and Douban.
Log anomaly detection, which focuses on detecting anomalous log records, becomes an active research problem because of its importance in developing stable and sustainable systems. Currently, many unsupervised log anomaly detection approaches are developed to address the challenge of limited anomalous samples. However, collecting enough data to train an unsupervised model is not practical when the system is newly deployed online. To tackle this challenge, we propose a transferable log anomaly detection (LogTAD) framework that leverages the adversarial domain adaptation technique to make log data from different systems have a similar distribution so that the detection model is able to detect anomalies from multiple systems. Experimental results show that LogTAD can achieve high accuracy on cross-system anomaly detection by using a small number of logs from the new system.
Error correction is one of the most crucial and time-consuming steps of data preprocessing. State-of-the-art error correction systems leverage various signals, such as predefined data constraints or user-provided correction examples, to fix erroneous values in a semi-supervised manner. While these approaches reduce human involvement to a few labeled tuples, they still need supervision to fix data errors. In this paper, we propose a novel error correction approach to automatically fix data errors of dirty datasets. Our approach pretrains a set of error corrector models on correction examples extracted from the Wikipedia page revision history. It then fine-tunes these models on the dirty dataset at hand without any required user labels. Finally, our approach aggregates the fine-tuned error corrector models to find the actual correction of each data error. As our experiments show, our approach automatically fixes a large portion of data errors of various dirty datasets with high precision.
We propose a semi-supervised learning method called Cformer for automatic clustering of text documents in cases where clusters are described by a small number of labeled examples, while the majority of training examples are unlabeled. We motivate this setting with an application in contextual programmatic advertising, a type of content placement on news pages that does not exploit personal information about visitors but relies on the availability of a high-quality clustering computed on the basis of a small number of labeled samples.
To enable text clustering with little training data, Cformer leverages the teacher-student architecture of Meta Pseudo Labels. In addition to unlabeled data, Cformer uses a small amount of labeled data to describe the clusters aimed at. Our experimental results confirm that the performance of the proposed model improves the state-of-the-art if a reasonable amount of labeled data is available. The models are comparatively small and suitable for deployment in constrained environments with limited computing resources. The source code is available at https://github.com/Aha6988/Cformer
The rise of online map services drives data owners to outsource spatial data to potentially untrusted database providers. Query results are provided along with verification objects that allow confirming their authenticity. Such authentication schemes have been proposed for several spatial and geometric queries, as well as for median queries in one dimension. However, to date, no authentication mechanism exists for centerpoint queries, which return a point lying in the middle of other points in multidimensional space. In this paper, we propose an authentication scheme for centerpoint queries, grounded on the algorithm for centerpoint queries on a finite planar set of points and authenticated aggregation R-trees and accompanying authenticated aggregation queries. We also provide methods for finding the centerpoint of a subset of the complete data set, and implement a range-based method. Our solution has a worst-case time-complexity of O(n log n) and space-complexity of O(n). Our experimental study confirms these claims.
Recently, self-attentive models have shown promise in sequential recommendation, given their potential to capture user long-term preferences and short-term dynamics simultaneously. Despite their success, we argue that self-attention modules, as a non-local operator, often fail to capture short-term user dynamics accurately due to a lack of inductive local bias. To examine our hypothesis, we conduct an analytical experiment on controlled 'short-term' scenarios. We observe a significant performance gap between self-attentive recommenders with and without local constraints, which implies that short-term user dynamics are not sufficiently learned by existing self-attentive recommenders. Motivated by this observation, we propose a simple framework, (Locker) for self-attentive recommenders in a plug-and-play fashion. By combining the proposed local encoders with existing global attention heads, Locker enhances short-term user dynamics modeling, while retaining the long-term semantics captured by standard self-attentive encoders. We investigate Locker with five different local methods, outperforming state-of-the-art self-attentive recom- menders on three datasets by 17.19% (NDCG@20) on average.
Traffic prediction is a classical spaial-temporal prediction problem with many real-world applications.In general, existing traffic prediction methods capture the complex spatial-temporal features by iterative mechanism or non-iterative mechanism. However, the iterative mechanism often causes the prediction error accumulation and the non-iterative mechanism is hard to capture the dynamic propagation information. The shortcomings of both mechanisms lead to their poor performance in long-term prediction tasks. Target at the shortcomings of existing methods, in this paper, we propose a novel deep learning framework called Long-term Traffic Prediction based on Hybrid Model (LTPHM), which is designed to simulate the dynamic transmission process of traffic information on the road network by connecting the prediction values of the current step with the next step. Each spatial-temporal module uses graph convolution (GCN) with an adaptive matrix to capture spatial dependence. Besides, we use Gated Dilated Convolution Networks (GDCN) and Gated Linear Unit convolution networks (GLU) to capture temporal dependence. Since LTPHM integrates the advantages of both iterative and non-iterative prediction, it can efficiently capture the complex and dynamic spatial-temporal features, especially the long-range temporal sequences. Experiments with three real-world traffic datasets demonstrate the effectiveness of our proposed model.
As the source of side information, knowledge graph (KG) plays a critical role in recommender systems. Recently, graph neural networks (GNN) have shown their technical advancements at boosting recommendation performances. Existing GNN-based models mainly focus on aggregation technique and regularization allocation, ignoring the rich entity-aware information hidden in the relation network of KG. In this paper, we explore the relational semantics at the granularity of entities behind a user-item interaction by leveraging knowledge graph, named Entity-aware Collaborative Relation Network (ECRN). Technically, we construct multiple meta-paths from users to entities based on the user-item interaction and item-entity connectivity to obtain user representation, while designing a relation-aware self-attention mechanism to aggregate collaborative signals of items. Empirical results on three benchmarks show that ECRN significantly outperforms state-of-the-art baselines.
Nearest neighbor search is a fundamental problem in data management and analytics with vast applications. However, a seminal paper by Beyer et al demonstrated the curse of dimensionality, where under certain conditions with high dimensionality, all the data points tend to be equidistant and thus the nearest neighbor problem is meaningless. This influential work has spawned a series of investigations of the concentration phenomenon, which, for the most part, are limited to the vector space. In this paper, we extend this investigation to sequence data, which do not have an inherent notion of dimensions or attributes. For similarity measures we consider the commonly used edit distance and longest common subsequence. We perform theoretical analysis and prove conditions under which sequences will concentrate. We also conduct experiments on synthetic data to verify the theoretical findings. Rather than the curse of dimensionality as previous studies demonstrate, we attempt to demonstrate the curse of length for sequential data.
Practical news feed platforms generate a hybrid list of news articles and advertising items (e.g., products, services, or information) and many platforms optimize the position of news articles and advertisements independently. However, they should be arranged with careful consideration of each other, as we show in this study, since user behaviors toward advertisements are significantly affected by the news articles. This paper investigates the effect of news articles on users' ad consumption and shows the dependency between news and ad effectiveness. We conducted a service log analysis and showed that sessions with high-quality news article exposure had more ad consumption than those with low-quality news article exposure. Based on this result, we hypothesized that exposure to high-quality articles will lead to a high ad consumption rate. Thus, we conducted million-scale A/B testing to investigate the effect of high-quality articles on ad consumption, in which we prioritized high-quality articles in the ranking for the treatment group. The A/B test showed that the treatment group's ad consumption, such as the number of clicks, conversions, and sales, increased significantly while the number of article clicks decreased. We also found that users who prefer a social or economic topic had more ad consumption by stratified analysis. These insights regarding news articles and advertisements will help optimize news and ad effectiveness in rankings considering their mutual influence.
Pre-Trained Models (PTMs) can learn general knowledge representations and perform well in Natural Language Processing (NLP) tasks. For the Chinese language, several PTMs are developed, however, most existing methods concentrate on modern Chinese and are not ideal for processing classical Chinese due to the differences in grammars and semantics between these two forms. In this paper, in order to process two forms of Chinese uniformly, we propose a novel Classical and Modern Chinese pre-trained language model (CANCN-BERT), with the advantage of effectively processing both classical and modern Chinese, which is an extension of BERT. Form-aware pre-training tasks are elaborately designed to train our model, so as to better adapt it to classical and modern Chinese corpus. Moreover, we define a joint model, proposing dedicated optimization methods through different paths with the control of the switch mechanism. Our model merges characteristics of both classical and modern Chinese, which can adequately and efficiently enhance the representation ability for both forms. Extensive experiments show that our model outperforms baseline models on processing classical and modern Chinese and achieves significant and consistent improvements. Also, the results of ablation experiments demonstrate the effectiveness of each module.
As data sources become ever more numerous, classification for multi-view data represented by heterogeneous features has been involved in many data mining applications. Most existing methods either directly concatenate all views or separately tackle each view, neglecting the correlation and diversity among views. Moreover, they often encounter an extra hyper-parameter that needs to be manually tuned, degenerating the applicability of models. In this paper, we present a robust supervised learning framework for multi-view classification, seeking a better representation and fusion of multiple views. Specifically, our framework discriminates different views with adaptively optimized view-wise weight factors and coalesces them to learn a joint projection subspace compatible across multiple views in an adaptive-weighting manner, thereby avoiding the intractable hyper-parameter. Meanwhile, the consensus and complementary information of original views can be naturally integrated into the learned subspace, in turn enhancing the discrimination of the subspace for subsequent classification. An efficient convergent algorithm is developed to iteratively optimize the formulated framework. Experiments on real datasets demonstrate the effectiveness and superiority of the proposed method.
Anomaly detection on graphs plays a significant role in various domains, including cybersecurity, e-commerce, and financial fraud detection. However, existing methods on graph anomaly detection usually consider the view in a single scale of graphs, which results in their limited capability to capture the anomalous patterns from different perspectives. Towards this end, we introduce a novel graph anomaly detection framework, namely ANEMONE, to simultaneously identify the anomalies in multiple graph scales. Concretely, ANEMONE first leverages a graph neural network backbone encoder with multi-scale contrastive learning objectives to capture the pattern distribution of graph data by learning the agreements between instances at the patch and context levels concurrently. Then, our method employs a statistical anomaly estimator to evaluate the abnormality of each node according to the degree of agreement from multiple perspectives. Experiments on three benchmark datasets demonstrate the superiority of our method.
Targeted Opinion Word Extraction (TOWE) is a subtask of aspect-based sentiment analysis, which aims to identify the correspondingopinion terms for given opinion targets in a review. To solve theTOWE task, recent works mainly focus on learning the target-aware context representation that infuses target information intocontext representation by using various neural networks. However,it has been unclear how to encode the target information to BERT,a powerful pre-trained language model. In this paper, we proposea novel TOWE model, RABERT (Relation-Aware BERT), that canfully utilize BERT to obtain target-aware context representations.To introduce the target information into BERT layers clearly, wedesign a simple but effective encoding method that adds targetmarkers indicating the opinion targets to the sentence. In addi-tion, we find that the neighbor word information is also importantfor extracting the opinion terms. Therefore, RABERT employs thetarget-sentence relation network and the neighbor-aware relationnetwork to consider both the opinion target and the neighbor wordsinformation. Our experimental results on four benchmark datasetsshow that RABERT significantly outperforms the other baselinesand achieves state-of-the-art performance. We also demonstrate theeffectiveness of each component of RABERT in further analysis
There are many natural questions that are best answered with a list. We address the problem of answering such questions using lists that occur on the Web, i.e. List Question Answering (ListQA). The diverse formats of lists on the Web makes this task challenging. We describe state-of-the-art methods for list extraction and ranking, that also consider the text surrounding the lists as context. Due to the lack of realistic public datasets for ListQA, we present three novel datasets that together are realistic, reproducible and test out-of-domain generalization. We benchmark the above steps on these datasets, with and without context. On the hardest setting (realistic and out-of-domain), we achieve an end-to-end Precision@1 of 51.28% and HITs@5 of 79.38%, effectively demonstrating the difficulty of the task and quantifying the immediate opportunity for improvement. We highlight some future directions through error analysis and release the datasets for further research.
App usage prediction is important for smartphone system optimization to enhance user experience. Existing modeling approaches utilize historical app usage logs along with a wide range of semantic information to predict the app usage; however, they are only effective in certain scenarios and cannot be generalized across different situations. This paper address this problem by developing a model called Contextual and Semantic Embedding model for App Usage Prediction (CoSEM) for app usage prediction that leverages integration of 1) semantic information embedding and 2) contextual information embedding based on historical app usage of individuals. Extensive experiments show that the combination of semantic information and history app usage information enables our model to outperform the baselines on three real-world datasets, achieving an MRR score over 0.55,0.57,0.86 and Hit rate scores of more than 0.71, 0.75, and 0.95, respectively.
Passage retrievers based on neural language models have recently achieved significant performance improvements in ranking tasks. Such ranking models have the advantage of finding the contextual features of queries and documents better than traditional keyword based methods. However, these deep learning-based models are limited by the large amounts of training data required. We propose a new fine-tuning method based on a masked language model (MLM) that is typically used in pre-trained language models. Our model improves the ranking performance using the MLM while efficiently utilizing less training data via data augmentation. The proposed approach applies self-supervised learning to information retrieval without needing additional expensive labeled data. In addition, because masking important terms during the fine-tuning stage can undermine ranking performance, the importance values of each term and sentence in a passage are calculated using the BM25 scheme and applied to the fine-tuning task such that the more important terms are masked less often. Our model is trained with dataset from MS MARCO re-ranking leaderboard and achieves the state-of-the-art MRR@10 performance in the leaderboard except for the ensemble-based method.
Transformer-based rankers have shown state-of-the-art performance. However, their self-attention operation is mostly unable to process long sequences. One of the common approaches to train these rankers is to heuristically select some segments of each document, such as the first segment, as training data. However, these segments may not contain the query-related parts of documents. To address this problem, we propose query-driven segment selection from long documents to build training data. The segment selector provides relevant samples with more accurate labels and non-relevant samples which are harder to be predicted. The experimental results show that the basic BERT-based ranker trained with the proposed segment selector significantly outperforms that trained by the heuristically selected segments, and performs equally to the state-of-the-art model with localized self-attention that can process longer input sequences. Our findings open up new direction to design efficient transformer-based rankers.
An SQL assertion is a declarative statement about data that must always be satisfied in any database state. Assertions were introduced in the SQL92 standard but no commercial DBMS has implemented them so far. Some approaches have been proposed to incrementally determine whether a transaction violates an SQL assertion, but they assume that transactions are applied in isolation, hence not considering the problem of concurrent transaction executions that collaborate to violate an assertion. This is the main stopper for its commercial implementation. To handle this problem, we have developed a technique for efficiently serializing concurrent transactions that might interact to violate an SQL assertion.
Interactive image retrieval involves users searching a collection of images to satisfy their subjective information needs. However, even large image collections are finite and therefore may not be able to satisfy users. An alternate approach would be to explore a generative adversarial network (GAN) and model users' search intents directly in terms of the latent space used by the GAN to generate images. In this article, we present a simulation study exploring the performance of Gaussian Process bandits in the context of interactive GAN exploration. We used recent advances in interpretable GAN controls to investigate the scalability of different approaches in terms of image space dimensionality. While we present several experiments with promising results, none of the approaches tested scale sufficiently well to explore the entire GAN image space.
Data often accumulates in tabular format with many attribute items, and prediction using machine learning adds value to data for business. However, studies on machine learning for tabular data only input attribute values, which reduces accuracy. Therefore, we propose an inference method that inputs attribute values and values from aggregated tabular data that has varying attribute values for each attribute item. In an experiment, we compared our proposed method with AutoGluon-Tabular using AutoML benchmark datasets. Our proposed method achieved the highest accuracy for 21 out of 39 datasets.
The problem of graph alignment is to find corresponding nodes between a pair of graphs. Past work has treated the problem in a monolithic fashion, with the graph as input and the alignment as output, offering limited opportunities to adapt the algorithm to task requirements or input graph characteristics. Recently, node embedding techniques are utilized for graph alignment. In this paper, we study two state-of-the-art graph alignment algorithms utilizing node representations, CONE-Align and GRASP, and describe them in terms of an overarching modular framework. In a targeted experimental study, we exploit this modularity to develop enhanced algorithm variants that are more effective in the alignment task.
Supervised link prediction aims at finding missing links in a network by learning directly from the data suitable criteria for classifying link types into existent or non-existent. Recently, along this line, subgraph-based methods learning a function that maps subgraph patterns to link existence have witnessed great successes. However, these approaches still have drawbacks. First, the construction of the subgraph relies on an arbitrary nodes selection, often ineffective. Second, the inability of such approaches to evaluate adaptively nodes importance reduces flexibility in nodes features aggregation, an important step in subgraph classification. To address these issues, a novel graph-classification based link-prediction model is proposed: Attention and Re-weighting based subgraph Classification for Link prediction (ARCLink). ARCLink first extracts a subgraph around the two nodes whose link should be predicted, by network reweighting, i.e. attributing a weight in the range 0-1 to all links of the original network, and then learns a function to map the subgraph to a continuous vector for classification, thus revealing the nature (non-existence/existence) of the unknown link. For leaning the mapping function, ARCLink generates a vector representation of the extracted subgraph by hierarchically aggregating nodes features according to nodes importance. In contrast to previous studies that either fully ignore or use fixed schemes to compute nodes importance, ARCLink instead learns nodes importance adaptively by employing attention mechanism. Through extensive experiments, ARCLink was validated on a series of real-world networks against state-of-the-art link prediction methods, consistently demonstrating its superior performances
The recent advent of cross-lingual embeddings, such as multilingual BERT (mBERT), provides a strong baseline for zero-shot cross-lingual transfer. There also exists increasing research attention to reduce the alignment discrepancy of cross-lingual embeddings between source and target languages, via generating code-switched sentences by substituting randomly selected words in the source languages with their counterparts of the target languages. Although these approaches improve the performance, naively code-switched sentences can have inherent limitations. In this paper, we propose SCOPA, a novel technique to improve the performance of zero-shot cross-lingual transfer. Instead of using the embeddings of code-switched sentences directly, SCOPA mixes them softly with the embeddings of original sentences. In addition, SCOPA utilizes an additional pairwise alignment objective, which aligns the vector differences of word pairs instead of word-level embeddings, in order to transfer contextualized information between different languages while preserving language-specific information. Experiments on the PAWS-X and MLDoc dataset show the effectiveness of SCOPA.
Fact verification datasets are typically constructed using crowdsourcing techniques due to the lack of text sources with veracity labels. However, the crowdsourcing process often produces undesired biases in data that cause models to learn spurious patterns. In this paper, we propose CrossAug, a contrastive data augmentation method for debiasing fact verification models. Specifically, we employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples. The generated samples are then paired cross-wise with the original pair, forming contrastive samples that facilitate the model to rely less on spurious patterns and learn more robust representations. Experimental results show that our method outperforms the previous state-of-the-art debiasing technique by 3.6% on the debiased extension of the FEVER dataset, with a total performance boost of 10.13% from the baseline. Furthermore, we evaluate our approach in data-scarce settings, where models can be more susceptible to biases due to the lack of training data. Experimental results demonstrate that our approach is also effective at debiasing in these low-resource conditions, exceeding the baseline performance on the Symmetric dataset with just 1% of the original data.
Knowledge Distillation (KD), which transfers the knowledge of a well-trained large model (teacher) to a small model (student), has become an important area of research for practical deployment of recommender systems. Recently, Relaxed Ranking Distillation (RRD) has shown that distilling the ranking information in the recommendation list significantly improves the performance. However, the method still has limitations in that 1) it does not fully utilize the prediction errors of the student model, which makes the training not fully efficient, and 2) it only distills the user-side ranking information, which provides an insufficient view under the sparse implicit feedback. This paper presents Dual Correction strategy for Distillation (DCD), which transfers the ranking information from the teacher model to the student model in a more efficient manner. Most importantly, DCD uses the discrepancy between the teacher model and the student model predictions to decide which knowledge to be distilled. By doing so, DCD essentially provides the learning guidance tailored to "correcting" what the student model has failed to accurately predict. This process is applied for transferring the ranking information from the user-side as well as the item-side to address sparse implicit user feedback. Our experiments show that the proposed method outperforms the state-of-the-art baselines, and ablation studies validate the effectiveness of each component.
To effectively classify graph instances, graph neural networks need to have the capability to capture the part-whole relationship existing in a graph. A capsule is a group of neurons representing complicated properties of entities, which has shown its advantages in traditional convolutional neural networks. This paper proposed novel Capsule Graph Neural Networks that use the EM routing mechanism (CapsGNNEM) to generate high-quality graph embeddings. Experimental results on a number of real-world graph datasets demonstrate that the proposed CapsGNNEM outperforms nine state-of-the-art models in graph classification tasks.
Nowadays, it is common for one natural person to join multiple social networks to enjoy different types of services. User identity linkage (UIL), which aims to link identical identities across different social platforms, has attracted increasing research interests recently. Most existing approaches focus on the sophisticated architecture engineering of the linkage model but ignore the challenge of hubness in the post-processing nearest neighbor search phase. Hubness appears as some identities in a social platform, called hubs, being extra-ordinary close to the identities in the other platform, which will degrade the alignment performance. Different from existing heuristic methods, in this paper we propose a hubness-aware user identity linkage model HAUIL to smoothly learn hubless linkage signals. A carefully-designed objective function is presented to explicitly mitigate the hubness information from the pre-learned linkage guidance. HAUIL can be easily adapted to most existing UIL models. Empirically, we evaluate HAUIL over multiple publicly available datasets, and the experimental results demonstrate its superiority.
Graph-based algorithms have drawn much attention thanks to their impressive success in semi-supervised setups. For better model performance, previous studies have learned to transform the topology of the input graph. However, these works only focus on optimizing the original nodes and edges, leaving the direction of augmenting existing data insufficiently explored. In this paper, we propose a novel heuristic pre-processing technique, namelyLocal Label Consistency Strengthening (ŁLCS), which automatically expands new nodes and edges to refine the label consistency within a dense subgraph. Our framework can effectively benefit downstream models by substantially enlarging the original training set with high-quality generated labeled data and refining the original graph topology. To justify the generality and practicality of ŁLCS, we couple it with the popular graph convolution network and graph attention network to perform extensive evaluations on three standard datasets. In all setups tested, our method boosts the average accuracy by a large margin of 4.7% and consistently outperforms the state-of-the-art.
Graph Convolutional Networks (GCNs) have become the prevailing approach to efficiently learn representations from graph-structured data. Current GCN models adopt a neighborhood aggregation mechanism based on two primary operations, aggregation and combination. The workload of these two processes is determined by the input graph structure, making the graph input the bottleneck of processing GCN. Meanwhile, a large amount of task-irrelevant information in the graphs would hurt the model generalization performance. This brings the opportunity of studying how to remove the redundancy in the graphs. In this paper, we aim to accelerate GCN models by removing the task-irrelevant edges in the graph. We present AdaptiveGCN, an efficient and supervised graph sparsification framework. AdaptiveGCN adopts an edge predictor module to get edge selection strategies by learning the downstream task feedback signals for each GCN layer separately and adaptively in the training stage, then only inference with the selected edges in the test stage to speed up the GCN computation. The experimental results indicate that AdaptiveGCN could yield 43% (on CPU) and 39% (on GPU) GCN model speed-up averagely with comparable model performance on public graph learning benchmarks.
Cross-modal retrieval technology can help people quickly achieve mutual information between cooking recipes and food images. Both the embeddings of the image and the recipe consist of multiple representation subspaces. We argue that multiple aspects in the recipe are related to multiple regions in the food image. It is challenging to improve the cross-modal retrieval quality by making full use of the implicit connection between multiple subspaces of recipes and images. In this paper, we propose a multi-subspace implicit alignment cross-modal retrieval framework of recipes and images. Our framework learns multi-subspace information about cooking recipes and food images with multi-head attention networks; the implicit alignment at the subspace level promotes narrowing the semantic gap between recipe embeddings and food image embeddings; triple loss and adversarial loss are combined to help our framework for cross-modal learning. The experimental results show that our framework significantly outperforms to state-of-the-art methods in terms of MedR and R@K on Recipe 1M.
Cross-modal retrieval is a classic task in the multimedia community, which aims to search for semantically similar results from different modalities. The core of cross-modal retrieval is to learn the most correlated features in a common feature space for the multi-modal data so that the similarity can be directly measured. In this paper, we propose a novel model using optimal transport for bridging the heterogeneity gap in cross-modal retrieval tasks. Specifically, we calculate the optimal transport plans between feature distributions of different modalities and then minimize the transport cost by optimizing the feature embedding functions. In this way, the feature distributions of multi-modal data can be well aligned in the common feature space. In addition, our model combines the complementary losses in different levels: 1) semantic level, 2) distributional level, and 3) pairwise level for improving cross-modal retrieval performance. In extensive experiments, our method outperforms many other cross-modal retrieval methods, which proves the efficacy of using optimal transport in cross-modal retrieval tasks.
We study the task of span-level emotion cause analysis (SECA), which is focused on identifying the specific emotion cause span(s) triggering a certain emotion in the text. Compared to the popular clause-level emotion cause analysis (CECA), it is a finer-grained emotion cause analysis (ECA) task. In this paper, we design a BERT-based graph attention network for emotion cause span(s) identification. The proposed model takes advantage of the structure of BERT to capture the relationship information between emotion and text, and utilizes graph attention network to model the structure information of the text. Our SECA method can be easily used for extracting clause-level emotion causes for CECA as well. Experimental results show that the proposed method consistently outperforms the state-of-the-art ECA methods on benchmark emotion cause dataset.
This paper addresses the task of span-level emotion cause analysis (SECA). It is a finer-grained emotion cause analysis (ECA) task, which aims to identify the specific emotion cause span(s) behind certain emotions in text. In this paper, we formalize SECA as a sequence tagging task for which several variants of neural network-based sequence tagging models to extract specific emotion cause span(s) in the given context. These models combine different types of encoding and decoding approaches. Furthermore, to make our models more "emotionally sensitive'', we utilize the multi-head attention mechanism to enhance the representation of context. Experimental evaluations conducted on two benchmark datasets demonstrate the effectiveness of the proposed models.
OpenStreetMap (OSM) is a free and openly-editable database of geographic information. Over the years, OSM has evolved into the world's largest open knowledge base of geospatial data, and protecting OSM from the risk of vandalized and falsified information has become paramount to ensuring its continued success. However, despite the increasing usage of OSM and a wide interest in vandalism detection on open knowledge bases such as Wikipedia and Wikidata, OSM has not attracted as much attention from the research community, partially due to a lack of publicly available vandalism corpus. In this paper, we report on the construction of the first OSM vandalism corpus, and release it publicly. We describe a user embedding approach to create OSM user embeddings and add embedding features to a machine learning model to improve vandalism detection in OSM. We validate the model against our vandalism corpus, and observe solid improvements in key metrics. The validated model is deployed into production for vandalism detection on Daylight Map.
Methods that learn representations of nodes in a graph play an important role in network analysis. Most of the existing methods of graph representation learning have focused on embedding each node in a graph as a single vector in a low-dimensional continuous space. However, these methods have a crucial limitation: the lack of modeling the uncertainty about the representation. In this work, inspired by Adversarial Variational Bayes (AVB) , we propose GraphAVB, a probabilistic generative model to learn node representations that preserve connectivity patterns and capture the uncertainties in the graph. Unlike Graph2Gauss  deep which embeds each node as a Gaussian distribution, we represent each node as an implicit distribution parameterized by a neural network in the latent space, which is more flexible and expressive to capture the complex uncertainties in real-world graph-structured datasets. To perform the designed variational inference algorithm with neural samplers, we introduce an auxiliary discriminative network that is used to infer the log probability ratio terms in the objective function and allows us to cast maximizing the objective function as a two-player game. Experimental results on multiple real-world graph datasets demonstrate the effectiveness of our proposed method GraphAVB, outperforming many competitive baselines on the task of link prediction. The superior performances of our proposed method GraphAVB also demonstrate that the downstream tasks can benefit from the captured uncertainty.
Most existing aspect-based sentiment analysis (ABSA) research efforts are devoted to extracting the aspect-dependent sentiment features from the sentence towards the given aspect. However, it is observed that about 60% of the testing aspects in commonly used public datasets are unknown to the training set. That is, some sentiment features carry the same polarity regardless of the aspects they are associated with (aspect-invariant sentiment), which props up the high accuracy of existing ABSA models when inevitably inferring sentiment polarities for those unknown testing aspects. Therefore, in this paper, we revisit ABSA from a novel perspective by deploying a novel supervised contrastive learning framework to leverage the correlation and difference among different sentiment polarities and between different sentiment patterns (aspect-invariant/-dependent). This allows improving the sentiment prediction for (unknown) testing aspects in the light of distinguishing the roles of valuable sentiment features. Experimental results on 5 benchmark datasets show that our proposed approach substantially outperforms state-of-the-art baselines in ABSA. We further extend existing neural network-based ABSA models with our proposed framework and achieve improved performance.
Spatial-temporal prediction is a critical problem for intelligent transportation, which is helpful for tasks such as traffic control and accident prevention. Previous studies rely on large-scale traffic data collected from sensors. However, it is unlikely to deploy sensors in all regions due to the device and maintenance costs. This paper addresses the problem via outdoor cellular traffic distilled from over two billion records per day in a telecom company, because outdoor cellular traffic induced by user mobility is highly related to transportation traffic. We study road intersections in urban and aim to predict future outdoor cellular traffic of all intersections given historic outdoor cellular traffic. Furthermore, we propose a new model for multivariate spatial-temporal prediction, mainly consisting of two extending graph attention networks (GAT). First GAT is used to explore correlations among multivariate cellular traffic. Another GAT leverages the attention mechanism into graph propagation to increase the efficiency of capturing spatial dependency. Experiments show that the proposed model significantly outperforms the state-of-the-art methods on our dataset.
Existing data-driven methods can well handle short text generation. However, when applied to the long-text generation scenarios such as story generation or advertising text generation in the commercial scenario, these methods may generate illogical and uncontrollable texts. To address these aforementioned issues, we propose a graph-based grouping planner~(GGP) following the idea of first-plan-then-generate. Specifically, given a collection of key phrases, GGP firstly encodes these phrases into a instance-level sequential representation and a corpus-level graph-based representation separately. With these two synergic representations, we then regroup these phrases into a fine-grained plan, based on which we generate the final long text. We conduct our experiments on three long text generation datasets and the experimental results reveal that GGP significantly outperforms baselines, which proves that GGP can control the long text generation with knowing how to say and in what order.
In interactive IR (IIR), users often seek to achieve different goals (e.g. exploring a new topic, finding a specific known item) at different search iterations and thus may evaluate system performances differently. Without state-aware approach, it would be extremely difficult to simulate and achieve real-time adaptive search evaluation and recommendation. To address this gap, our work identifies users' task states from interactive search sessions and meta-evaluates a series of online and offline evaluation metrics under varying states based on a user study dataset consisting of 1548 unique query segments from 450 search sessions. Our results indicate that: 1) users' individual task states can be identified and predicted from search behaviors and implicit feedback; 2) the effectiveness of mainstream evaluation measures (measured based upon their respective correlations with user satisfaction) vary significantly across task states. This study demonstrates the implicit heterogeneity in user-oriented IR evaluation and connects studies on complex search tasks with evaluation techniques. It also informs future research on the design of state-specific, adaptive user models and evaluation metrics.
Recently, Deep Neural Networks (DNNs) have made remarkable progress for text classification, which, however, still require a large number of labeled data. To train high-performing models with the minimal annotation cost, active learning is proposed to select and label the most informative samples, yet it is still challenging to measure informativeness of samples used in DNNs. In this paper, inspired by piece-wise linear interpretability of DNNs, we propose a novel Active Learning with DivErse iNterpretations (ALDEN) approach. With local interpretations in DNNs, ALDEN identifies linearly separable regions of samples. Then, it selects samples according to their diversity of local interpretations and queries their labels. To tackle the text classification problem, we choose the word with the most diverse interpretations to represent the whole sentence. Extensive experiments demonstrate that ALDEN consistently outperforms several state-of-the-art deep active learning methods.
Existing system dealing with online complaint provides a final decision without explanations. We propose to analyse the complaint text of internet fraud in a fine-grained manner. Considering the complaint text includes multiple clauses with various functions, we propose to identify the role of each clause and classify them into different types of fraud element. We construct a large labeled dataset originated from a real finance service platform. We build an element identification model on top of BERT and propose additional two modules to utilize the context of complaint text for better element label classification, namely, global context encoder and label refiner. Experimental results show the effectiveness of our model.
While demographic attributes, such as age, gender, and location, have been extensively studied, most previous studies usually combine different sources of data, such as the user's biography, pictures, posts, and the user's network to obtain reasonable inference accuracies. However, it is not always practical to collect all those different forms of data. Therefore, in this paper, we consider methods for inferring age that only use Twitter posts (tweet text and emojis). We propose a hierarchical attention neural model that integrates independent linguistic knowledge gained from text and emojis when making a prediction. This hierarchical model is able to capture the intra-post relationship between these different post components, as well as the inter-post relationships of a user's posts. Our empirical evaluation using a data set generated from Wikidata demonstrates that our model achieves better performance than the state-of-the-art models, and still performs well when the number of posts per user is reduced in the training data set.
Understanding inactive users is the key to user growth and engagement for many Internet companies. However, learning inactive users' representations and their preferences is still challenging because the features available are missing and the positive responses or labels are insufficient. In this paper, we propose a cross domain learning approach to exclusively recommend customized items to inactive users by leveraging the knowledge of active users. Particularly, we represent users, no matter active or inactive users, by their friends' browsing behaviors based on a graph neural network (GNN) layer atop of a heterogeneous graph defined on social networks (user-user friendships) and browsing behaviors (user-page clicks). We jointly optimize the learning tasks of active users in source domain and inactive users in target domain based on the domain invariant features extracted from the embedding of our GNN layer, where the domain invariant features that are learned to benefit both tasks on active/inactive users, and are indiscriminate with respect to the shift between the domains. Extensive experiments show that our approach can well capture the preference of inactive users using both public data and real-world data at Alipay.
Federated learning aims to protect users' privacy while performing data analysis from different participants. However, it is challenging to guarantee the training efficiency on heterogeneous systems due to the various computational capabilities and communication bottlenecks. In this work, we propose FedSkel to enable computation-efficient and communication-efficient federated learning on edge devices by only updating the model's essential parts, named skeleton networks. FedSkel is evaluated on real edge devices with imbalanced datasets. Experimental results show that it could achieve up to 5.52x speedups for CONV layers' back-propagation, 1.82x speedups for the whole training process, and reduce 64.8% communication cost, with negligible accuracy loss.
Users usually consult the manufacturers or the internet when they encounter operation questions with an electronics product. In this paper, we explore to represent an operation question as a procedure graph and formulate the problem of operation diagnosis as two sub-tasks, namely error node detection, and correction, on top of the graph. We construct the first benchmark for this task and propose a transformer-based model to integrate external knowledge and context information to enhance the performance. Experimental results show the effectiveness of our proposed model.
Two-tower neural networks are popularly used in review-aware recommender systems, in which two encoders are separately employed to learn representations for users and items from reviews. However, such an architecture isolates the information exchange between two encoders, resulting in suboptimal recommendation accuracy. To this end, we propose a novel two-tower style Neural Recommendation with Cross-modality Mutual Attention (NRCMA), which bridges user encoder and item encoder crossing reviews and ratings, in order to select informative words and reviews to learn better representation for users and items. Extensive experiments on three benchmark datasets demonstrate that the cross-modality mutual attention is beneficial to two-tower neural networks, and NRCMA consistently outperforms state-of-the-art review-aware item recommendation techniques.
Financial documents often contain rich domain information, such as named entities, which could be used to indicate the documents' classification categories. Existing classification methods either ignore such contained financial domain information, achieving less optimal performances, or train document representations in supervised ways, with expensive data labeling costs. In this paper, we propose to leverage domain information to improve classification performance for financial documents, via a graph representation learning model, namely G-MoCo, based on unsupervised graph momentum contrast. With G-MoCo, we could extract latent features from massive unlabeled raw data, and then further use the learned representations for document classification. Compared with the state-of-the-art baselines, representations learned by our method could improve performances by significant margins on a financial document dataset and three non-financial public graph datasets.
Label Smoothing is a widely used technique in many areas. It can prevent the network from being over-confident. However, it hypotheses that the prior distribution of all classes is uniform. Here, we decide to abandon this hypothesis and propose a new smoothing method, called Smoothing with Fake Label. It shares a part of the prediction probability to a new fake class. Our experiment results show that the method can increase the performance of the models on most tasks and outperform the Label Smoothing on text classification and cross-lingual transfer tasks.
Maximum inner product search (MIPS), combined with the hashing method, has become a standard solution to similarity search problems. It often achieves an order of magnitude speedup over nearest neighbor search (NNS) under similar settings. Motivated by the work and achievements along this line, in this paper, we developed a sparse binary hashing method for MIPS to preserve the pairwise similarities with the support of two asymmetric hash functions. We proposed a simple and efficient algorithm that learns two hash functions for the query database and the search database respectively. We conducted experiments to evaluate the proposed method, relying on image retrieval tasks on four benchmark datasets. The empirical results clearly demonstrated the algorithm's promising potential on practical applications in terms of search accuracy and scalability.
Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the temporally evolving pattern. Existing approaches in learning temporally evolving network representations fail to capture the temporal interdependence. In this paper, we propose Toffee, a novel approach for temporal network representation learning based on tensor decomposition. Our method exploits the tensor-tensor product operator to encode the cross-time information, so that the periodic changes in the evolving networks can be captured. Experimental results demonstrate that Toffee outperforms existing methods on multiple real-world temporal networks in generating effective embeddings for the link prediction tasks.
Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2x speedup in efficiency.
Gene Network Graphs (GNGs) are comprised of biomedical data. Deriving structural information from these graphs remains a prime area of research in the domain of biomedical and health informatics. In this paper, we propose Gene Vectors Embodied Deep Attentional Factorization Machines (GDFMs) for the gene to gene interaction prediction. We first initialize GDFM with vector embeddings learned from gene locality configuration and an expression equivalence criterion that preserves their innate similar traits. GDFM uses an attention-based mechanism that manipulates different positions, to learn the representation of sequence, before calculating the pairwise factorized interactions. We further use hidden layers, batch normalization, and dropout to stabilize the performance of our deep structured architecture. An extensive comparison with several state-of-the-art approaches, using Ecoli and Yeast datasets for gene-gene interaction prediction shows the significance of our proposed framework.
Initialization plays a critical role in the training of deep neural networks (DNN). Existing initialization strategies mainly focus on stabilizing the training process to mitigate gradient vanish/explosion problems. However, these initialization methods are lacking in consideration about how to enhance generalization ability. The Information Bottleneck (IB) theory is a well-known understanding framework to provide an explanation about the generalization of DNN. Guided by the insights provided by IB theory, we design two criteria for better initializing DNN. And we further design a neuron campaign initialization algorithm to efficiently select a good initialization for a neural network on a given dataset. The experiments on MNIST dataset show that our method can lead to a better generalization performance with faster convergence.
The ability to skip songs is a core feature in modern online streaming services. Its introduction has led to a new music listening paradigm and has changed the way users interact with the underlying services. Thus, understanding their skipping activity during listening sessions has acquired considerable importance. This is because such implicit feedback signal can be considered a measure of users' satisfaction (dissatisfaction or lack of interest), affecting their engagement with the platforms. Prior work has mainly focused on analysing the skipping activity at an individual song level. In this work, we investigate different behaviours during entire listening sessions with regards to the users' session-based skipping activity. To this end, we propose a data transformation and clustering-based approach to identify and categorise skipping types. Experimental results on the real-world music streaming dataset (Spotify) indicate four main types of session skipping behaviour. A subsequent analysis of short, medium, and long listening sessions demonstrate that these session skipping types are consistent across sessions of varying length. Furthermore, we discuss their distributional differences under various listening context information, i.e. day types (i.e. weekday and weekend), times of the day, and playlist types.
In this paper, we propose a general framework for mitigating the disparities of the predicted classes with respect to secondary attributes within the data (e.g., race, gender etc.). Our proposed method involves learning a multi-objective function that in addition to learning the primary objective of predicting the primary class labels from the data, also employs a clustering-based heuristic to minimize the disparities of the class label distribution with respect to the cluster memberships, with the assumption that each cluster should ideally map to a distinct combination of attribute values. Experiments demonstrate effective mitigation of cognitive biases on a benchmark dataset without the use of annotations of secondary attribute values (the zero-shot case) or with the use of a small number of attribute value annotations (the few-shot case).
Fake news spread widely on social media in various domains, which lead to real-world threats in many aspects like politics, disasters, and finance. Most existing approaches focus on single-domain fake news detection (SFND), which leads to unsatisfying performance when these methods are applied to multi-domain fake news detection. As an emerging field, multi-domain fake news detection (MFND) is increasingly attracting attention. However, data distributions, such as word frequency and propagation patterns, vary from domain to domain, namely domain shift. Facing the challenge of serious domain shift, existing fake news detection techniques perform poorly for multi-domain scenarios. Therefore, it is demanding to design a specialized model for MFND. In this paper, we first design a benchmark of fake news dataset for MFDN with domain label annotated, namely Weibo21, which consists of 4,488 fake news and 4,640 real news from 9 different domains. We further propose an effective Multi-domain Fake News Detection Model (MDFEND) by utilizing domain gate to aggregate multiple representations extracted by a mixture of experts. The experiments show that MDFEND can significantly improve the performance of multi-domain fake news detection. Our dataset and code are available at https://github.com/kennqiang/MDFEND-Weibo21.
Temporal Graph Functional Dependencies (TGFDs) are a class of data quality rules imposing topological, attribute dependency constraints over a period of time. To make TGFDs usable in practice, we study the TGFD discovery problem, and show the satisfiability, implication, and validation problems for k-bounded TGFDs are in PTIME. We introduce the TGFDMiner algorithm, which discovers minimal, frequent TGFDs. Our evaluation shows the efficiency and effectiveness of TGFDMiner, and the utility of TGFDs.
Visual question answering (VQA) is a challenging problem in machine perception, which requires a deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, while powerful yet elegant models like graph neural networks (GNNs) have shown great power in reasoning over graph-structured data. In this work, we propose to bridge the gap between scene graph generation and VQA by leveraging GNNs. In particular, we design a new model called Conditional Enhanced Graph ATtention network (CE-GAT) to encode pairs of visual and semantic scene graphs with both node and edge features, which is seamlessly integrated with a textual question encoder to generate answers through question-graph conditioning. Moreover, to alleviate the training difficulties of CE-GAT towards VQA, we enforce more useful inductive biases in the scene graphs through novel question-guided graph enriching and pruning. Finally, we evaluate the framework on one of the largest available VQA datasets (namely, GQA) with ground-truth scene graphs, achieving the accuracy of 77.87%, compared with the state of the art (namely, the neural state machine (NSM)), which gives 63.17%. Notably, by leveraging existing scene graphs, our framework is much lighter compared with end-to-end VQA methods (e.g., about 95.3% less parameters than a typical NSM).
Despite their high accuracy, deep neural networks (DNNs) are vulnerable to adversarial examples. Currently, adversarial training is the mainstream defense approach against adversarial examples. However, given the unknown nature of adversarial attacks in real life, this approach has fundamental limitations in practical use, as it is impossible to obtain sufficient adversarial examples for the training. In this paper, we propose RanTrain, a simple training approach which employs a background class with random noise images to augment the original DNN model and training data, without requiring any adversarial examples. Experiments have shown that RanTrain works effectively with different datasets and various DNN structures, and it significantly increases the robustness of DNNs to adversarial examples.
Existing commercial search engines often struggle to represent different perspectives of a search query. Argument retrieval systems address this limitation of search engines and provide both positive (PRO) and negative (CON) perspectives about a user's information need on a controversial topic (e.g., climate change). The effectiveness of such argument retrieval systems is typically evaluated based on topical relevance and argument quality, without taking into account the often differing number of documents shown for the argument stances (PRO or CON). Therefore, systems may retrieve relevant passages, but with a biased exposure of arguments. In this work, we analyze a range of non-stochastic fairness-aware ranking and diversity metrics to evaluate the extent to which argument stances are fairly exposed in argument retrieval systems.
Using the official runs of the argument retrieval task Ttouché at CLEF 2020, as well as synthetic data to control the amount and order of argument stances in the rankings, we show that systems with the best effectiveness in terms of topical relevance are not necessarily the most fair or the most diverse in terms of argument stance. The relationships we found between (un)fairness and diversity metrics shed light on how to evaluate group fairness -- in addition to topical relevance -- in argument retrieval settings.
Prediction bias is a well-known problem in classification algorithms, which tend to be skewed towards more represented classes. This phenomenon is even more remarkable in multi-label scenarios, where the number of underrepresented classes is usually larger. In light of this, we hereby present the Prediction Bias Coefficient (PBC), a novel measure that aims to assess the bias induced by label imbalance in multi-label classification. The approach leverages Spearman's rank correlation coefficient between the label frequencies and the F-scores obtained for each label individually. After describing the theoretical properties of the proposed indicator, we illustrate its behaviour on a classification task performed with state-of-the-art methods on two real-world datasets, and we compare it experimentally with other metrics described in the literature.
In location-based search, user's click behavior is naturally bonded with trilateral spatiotemporal information, i.e., the locations of historical user requests, the locations of corresponding clicked items and the occurring time of historical clicks. Appropriate modeling of the trilateral spatiotemporal user click behavior sequence is key to the success of any location-based search service. Though abundant and helpful, existing user behavior modeling methods are insufficient for modeling the rich patterns in trilateral spatiotemporal sequence in that they ignore the interplay among request's geo- graphic information, item's geographic information and the click time. In this work, we study the user behavior modeling problem in location-based search systematically. We propose TRISAN, short for Trilateral Spatiotemporal Attention Network, a novel attention- based neural model that incorporates temporal relatedness into both the modeling of item's geographic closeness and the modeling of request's geographic closeness through a fusion mechanism. In addition, we propose to model the geographic closeness both by distance and by semantic similarity. Extensive experiments demonstrate that the proposed method outperforms existing methods by a large margin and every part of our modeling strategy contributes to its final success.
The impact of ranking systems on humans is an aspect that is getting a lot of attention. In this paper, we consider a class of algorithms, known as reputation-based ranking systems, which rank the items based on a reputation score automatically computed for each user. Recent literature introduced the concept of reputation independence, which considers a sensitive attribute of the users (such as gender or age) and makes the reputation scores independent from that attribute. Here, we show that if we consider a different sensitive attribute w.r.t. a user to introduce independence, reputation scores are still biased. To overcome this issue, we propose an approach to attain equity in the reputation scores computation, independently of any sensitive attribute that characterizes the users.
Astounding results from transformer models with Vision-and Language Pretraining (VLP) on joint vision-and-language downstream tasks have intrigued the multi-modal community. On the one hand, these models are usually so huge that make us more difficult to fine-tune and serve real-time online applications. On the other hand, the compression of the original transformer block will ignore the difference in information between modalities, which leads to the sharp decline of retrieval accuracy.
In this work, we present a very light and effective cross-modal retrieval model compression method. With this method, by adopting a novel random replacement strategy and knowledge distillation, our module can learn the knowledge of the teacher with nearly the half number of parameters reduction. Furthermore, our compression method achieves nearly 130x acceleration with acceptable accuracy. To overcome the sharp decline in retrieval tasks because of compression, we introduce the co-attention interaction module to reflect the different information and interaction information. Experiments show that a multi-modal co-attention block is more suitable for cross-modal retrieval tasks rather than the source transformer encoder block.
Variant calling is a fundamental task that is performed to identify variants in an individual's genome compared to a reference human genome. This task can enable better understanding of an individual's risk to diseases and eventually lead to new innovations in precision medicine and drug discovery. However, variant calling on a large number of human genome sequences requires significant computing and storage resources. While access to such resources is possible today (e.g., through cloud computing), reducing the cost of analyzing genomes has become a major challenge. Motivated by these reasons, we address the problem of accelerating the variant calling pipeline on a large number of human genome sequences using a commodity cluster. We propose a novel approach that synergistically combines data and task parallelism for different stages of the variant calling pipeline across different sequences with minimal synchronization. Our approach employs futures to enable asynchronous computations in order to improve the overall cluster utilization and thereby, accelerate the variant calling pipeline. On a 16-node cluster, we observed that our approach was 3X-4.7X faster than the state-of-the-art Big Data Genomics software.
Tagging based methods are one of the mainstream methods in relational triple extraction. However, most of them suffer from the class imbalance issue greatly. Here we propose a novel tagging based model that addresses this issue from following two aspects. First, at the model level, we propose a three-step extraction framework that can reduce the total number of samples greatly, which implicitly decreases the severity of the mentioned issue. Second, at the intra-model level, we propose a confidence threshold based cross entropy loss that can directly neglect some samples in the major classes. We evaluate the proposed model on NYT and WebNLG. Extensive experiments show that it can address the mentioned issue effectively and achieves state-of-the-art results on both datasets. The source code of our model is available at: https://github.com/neukg/ConCasRTE.
Online medical forums have become a predominant platform for answering health-related information needs of consumers. However, with a significant rise in the number of queries and the limited availability of experts, it is necessary to automatically classify medical queries based on a consumer's intention, so that these questions may be directed to the right set of medical experts. Here, we develop a novel medical knowledge-aware BERT-based model (MedBERT) that explicitly gives more weightage to medical concept-bearing words, and utilize domain-specific side information obtained from a popular medical knowledge base. We also contribute a multi-label dataset for the Medical Forum Question Classification (MFQC) task. MedBERT achieves state-of-the-art performance on two benchmark datasets and performs very well in low resource settings.
High-quality evidence from the biomedical literature is crucial for decision making of oncologists who treat cancer patients. Search for evidence on a specific treatment for a patient is the challenge set by the precision medicine track of TREC in 2020. To address this challenge, we propose a two-step method to incorporate treatment into the query formulation and ranking. Training of such ranking function uses a zero-shot setup to incorporate the novel focus on treatments which did not exist in any of the previous TREC tracks. Our treatment-aware neural reranking approach, FAT, achieves state-of-the-art effectiveness for TREC Precision Medicine 2020. Our analysis indicates that the BERT-based rerankers automatically learn to score documents through identifying concepts relevant to precision medicine, similar to hand-crafted heuristics successful in the earlier studies.
We focus on making recommendations for a new group of users whose preferences are unknown, but we are given the decisions of other groups. By formulating this problem as group recommendation from group implicit feedback, we focus on two of its practical instances: group decision prediction and reverse social choice. Given a set of groups and their observed decisions, group decision prediction intends to predict the decision of a new group of users, whereas reverse social choice aims to infer the preferences of those users involved in observed group decisions. These two problems are of interest to not only group recommendation, but also to personal privacy when the users intend to conceal their personal preferences but have participated in group decisions. To tackle these two problems, we propose and study DeepGroup---a deep learning approach for group recommendation with group implicit data. We empirically assess the predictive power of DeepGroup on various real-world datasets and group decision rules. Our extensive experiments not only demonstrate the efficacy of DeepGroup but also shed light on the privacy-leakage concerns of some decision-making processes.
Nowadays, users are encouraged to activate across multiple online social networks simultaneously. User identity linkage, which aims to reveal the correspondence among different accounts across networks, has been regarded as a fundamental problem for user profiling, marketing, cybersecurity, and recommendation. Existing methods mainly address the prediction problem by utilizing profile, content, or structural features of users in symmetric ways. However, encouraged by online services, information from different social platforms may also be asymmetric, such as geo-locations and texts. It leads to an emerged challenge in aligning users with asymmetric information across networks. Instead of similarity evaluation applied in previous works, we formalize correlation between geo-locations and texts and propose a novel user identity linkage framework for matching users across networks. Moreover, our model can alleviate the label scarcity problem by introducing external text-location pairs. Experimental results on real-world datasets show that our approach outperforms existing methods and achieves state-of-the-art results.
The predominance of biased articles and its consumption by the readers is becoming a considerable issue. Researchers across domains have made efforts to mitigate biases in language. However, due to the subjective nature of the problem, it is not trivial to detect bias embedded in a text. In this paper, we propose a deep linguistically informed multi-task transformer-based model to automatically detect bias in written text. The model is fine-tuned with a domain-specific corpus and further trained for learning the objectives. We evaluate the performance of the proposed model with respect to baseline systems across multiple datasets. We observed that augmenting linguistic features along with contextual embedding improves the performance of the neural network model to automatically detect bias in text.
Persuasive conversation leverages conversational strategies by the persuader to change the attitude or behavior of a persuadee towards achieving a specific goal. It involves understanding the linguistic and cognitive principles underlying the organization of strategic disclosures and appeals employed in human persuasion. One of the main challenges of such conversation is the inability of a persuader to detect the outcome of their conversation during the interaction. Such prior knowledge can help a persuader to change their conversation strategy and pre-empt possible conversation failures. In this paper, we propose a technique that analyses conversations to predict whether the persuader is going to successfully persuade the persuadee. We propose a joint model of latent utterance categorization to predict the success or the failure of a persuasive conversation. This latent categorization allows the model to identify high-level conversational contexts that influence patterns of language in a persuasive conversation. We evaluate the performance of our model on an openly available dataset. Our preliminary results demonstrate that the proposed model outperforms competitive baselines.
Learning domain information for a downstream task is important to improve the performance of sentiment analysis. However, the labeling task to obtain a sufficient amount of training data in an application domain tends to be highly time-consuming and tedious. To solve this problem, we propose a novel method to effectively learn domain information and improve sentiment analysis performance with a small amount of training data. We use the masked language model (MLM), which is a self-supervised learning model, to calculate word weights and improve a downstream fine-tuning task for sentiment analysis. In particular, the MLM with the calculated word weights is executed simultaneously with the fine-tuning task. The results show that the proposed model achieves better performances than previous models in four different datasets for sentiment analysis.
Named Entity Disambiguation (NED) and linking has been traditionally evaluated on natural language content that is both well-written and contextually rich. However, many NED approaches display poor performance on text sources that are short and noisy. In this paper, we study the problem of entity disambiguation for short text and propose a location-aware NED framework that resolves ambiguities in text with little other contextual cues. We show that the spatial dimension is crucial in disambiguating named entities and that the location inference is less utilized in many NED systems. Our proposed framework integrates (in an unsupervised manner) spatial signals that are readily available for many sources that emit short text (e.g., micro-blogs, search queries, and news streams). Our evaluation on news headlines and tweets reveals that a simple spatial embedding improves the accuracy of competitive baseline NED approaches from the literature by 8% for the news headlines and by 4% on tweets.
Long sequence time-series forecasting(LSTF) plays an important role in a variety of real-world application scenarios, such as electricity forecasting, weather forecasting, and traffic flow forecasting. It has previously been observed that transformer-based models have achieved outstanding results on LSTF tasks, which can reduce the complexity of the model and maintain stable prediction accuracy. Nevertheless, there are still some issues that limit the performance of transformer-based models for LSTF tasks: (i) the potential correlation between sequences is not considered; (ii) the inherent structure of encoder-decoder is difficult to expand after being optimized from the aspect of complexity. In order to solve these two problems, we propose a transformer-based model, named AGCNT, which is efficient and can capture the correlation between the sequences in the multivariate LSTF task without causing the memory bottleneck. Specifically, AGCNT has several characteristics: (i) a probsparse adaptive graph self-attention, which maps long sequences into a low-dimensional dense graph structure with an adaptive graph generation and captures the relationships between sequences with an adaptive graph convolution; (ii) the stacked encoder with distilling probsparse graph self-attention integrates the graph attention mechanism and retains the dominant attention of the cascade layer, which preserves the correlation between sparse queries from long sequences; (iii) the stacked decoder with generative inference generates all prediction values in one forward operation, which can improve the inference speed of long-term predictions. Experimental results on 4 large-scale datasets demonstrate the AGCNT outperforms state-of-the-art baselines.
Audio-driven talking face generation is an active research direction in the field of virtual reality. The main challenge is that the generated lip shape of the speaker is out of sync with the input audio. To address this challenge, we propose a novel solution to synthesize lip-synchronized, high-quality, realistic video given input audio. We first decompose the target person's video frames into 3D face model parameters, and the information bottleneck is inserted into the audio-to-expression network to learn the mapping between audio features and expression parameters. Then, we replace the expression parameters in the target video frame with the extracted expression parameters from audio and re-render the face. Finally, we add high-level audio embedding extracted from the raw audio and lip landmarks embedding in the neural rendering network. The 3D face shapes, 2D landmarks, and audio embedding provide complementary information for the neural rendering network which guarantees the generation of lip-synchronized high-quality video portraits from the synthesized rendered faces. Experimental results show that compared with other talking face generation methods, our method is the best concerning lip synchronization with high video definition.
The task of keyphrase generation, unlike extraction, aims to generate the phrases which succinctly capture the key information of the source text, that are even absent in the document (i.e., do not match any contiguous sub-sequence of source text). Despite the significant progress achieved by sequence-to-sequence (seq2seq) models in modelling such high entropy task, they are limited by their deterministic modelling capability which limits the generation of a diverse set of keyphrases. To address the above limitation, in this paper, we propose to incorporate Conditional Variational Autoencoder (CoVA) into seq2seq models for its ability to represent a set of keyphrases as a probabilistic distribution which improves the diversity of the generated keyphrases. We model the probabilistic distribution using a hierarchical latent structure where a global latent variable tries to model the diversity among the keyphrases and local latent variables control the generation of each keyphrase to make them coherent. Experimental results on four benchmark datasets of research papers demonstrate the effectiveness of our proposed approach in achieving a large improvement in diversity along with modest gains in quality with respect to previous models.
Recent advances in dense retrieval techniques have offered the promise of being able not just to re-rank documents using contextualised language models such as BERT, but also to use such models to identify documents from the collection in the first place. However, when using dense retrieval approaches that use multiple embedded representations for each query, a large number of documents can be retrieved for each query, hindering the efficiency of the method. Hence, this work is the first to consider efficiency improvements in the context of a dense retrieval approach (namely ColBERT), by pruning query term embeddings that are estimated not to be useful for retrieving relevant documents. Our proposed query embeddings pruning reduces the cost of the dense retrieval operation, as well as reducing the number of documents that are retrieved and hence require to be fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, when reducing the number of query embeddings used from 32 to 3 based on the collection frequency of the corresponding tokens, query embedding pruning results in no statistically significant differences in effectiveness, while reducing the number of documents retrieved by 70%. In terms of mean response time for the end-to-end to end system, this results in a 2.65x speedup.
In theory, the variational auto-encoder (VAE) is not suitable for recommendation tasks, although it has been successfully utilized for collaborative filtering (CF) models. In this paper, we propose a Gaussian Copula-Vector Quantized Autoencoder (GC-VQAE) model that differs prior arts in two key ways: (1) Gaussian Copula helps to model the dependencies among latent variables which are used to construct a more complex distribution compared with the mean-field theory; and (2) by incorporating a vector quantisation method into encoders our model can learn discrete representations which are consistent with the observed data rather than directly sampling from the simple Gaussian distributions. Our approach is able to circumvent the "posterior collapse'' issue and break the prior constraint to improve the flexibility of latent vector encoding and learning ability. Empirically, GC-VQAE can significantly improve the recommendation performance compared to existing state-of-the-art methods.
Sports game summarization aims to generate news articles from live text commentaries. A recent state-of-the-art work, SportsSum, not only constructs a large benchmark dataset, but also proposes a two-step framework. Despite its great contributions, the work has three main drawbacks: 1) the noise existed in SportsSum dataset degrades the summarization performance; 2) the neglect of lexical overlap between news and commentaries results in low-quality pseudo-labeling algorithm; 3) the usage of directly concatenating rewritten sentences to form news limits its practicability. In this paper, we publish a new benchmark dataset SportsSum2.0, together with a modified summarization framework. In particular, to obtain a clean dataset, we employ crowd workers to manually clean the original dataset. Moreover, the degree of lexical overlap is incorporated into the generation of pseudo labels. Further, we introduce a reranker-enhanced summarizer to take into account the fluency and expressiveness of the summarized news. Extensive experiments show that our model outperforms the state-of-the-art baseline.
Clarification has attracted much attention because of its many potential applications especially in Web search. Since search queries are very short, the underlying user intents are often ambiguous. This makes it challenging for search engines to return the appropriate results that pertain to the users' actual information needs. To address this issue, asking clarifying questions has been recognized as a critical technique. Although previous studies have analyzed the importance of asking to clarify, generating clarifying questions for Web search remains under-explored. In this paper, we tackle this problem in a template-guided manner. Our objective is jointly learning to select question templates and fill question slots, using Transformer-based networks. We conduct experiments on MIMICS, a collection of datasets containing real Web search queries sampled from Bing's search logs. Our method is demonstrated to achieve significant improvements over various competitive baselines.
Nodes in networks may have one or more functions that determine their role in the system. As opposed to local proximity, which captures the local context of nodes, the role identity captures the functional "role" that nodes play in a network, such as being the center of a group, or the bridge between two groups. This means that nodes far apart in a network can have similar structural role identities. Several recent works have explored methods for embedding the roles of nodes in networks. However, these methods all rely on either approximating or indirect modeling of structural equivalence. In this paper, we present a novel and flexible framework using stress majorization, to transform the high-dimensional role identities in networks directly (without approximation or indirect modeling) to a low-dimensional embedding space. Our method is also flexible, in that it does not rely on specific structural similarity definitions. We evaluated our method on the tasks of node classification, clustering, and visualization, using three real-world and five synthetic networks. Our experiments show that our framework achieves superior results than existing methods in learning node role representations.
Recent years have seen a rise in the development of representational learning methods for graph data. Most of these methods, however, focus on node-level representation learning at various scales (e.g., microscopic, mesoscopic, and macroscopic node embedding). In comparison, methods for representation learning on whole graphs are currently relatively sparse. In this paper, we propose a novel unsupervised whole graph embedding method. Our method uses spectral graph wavelets to capture topological similarities on each k-hop sub-graph between nodes and uses them to learn embeddings for the whole graph. We evaluate our method against 12 well-known baselines on 4 real-world datasets and show that our method achieves the best performance across all experiments, outperforming the current state-of-the-art by a considerable margin.
Recently, Graph Convolution Network (GCN) based methods have achieved outstanding performance for recommendation. These methods embed users and items in Euclidean space, and perform graph convolution on user-item interaction graphs. However, real-world datasets usually exhibit tree-like hierarchical structures, which make Euclidean space less effective in capturing user-item relationship. In contrast, hyperbolic space, as a continuous analogue of a tree-graph, provides a promising alternative. In this paper, we propose a fully hyperbolic GCN model for recommendation, where all operations are performed in hyperbolic space. Utilizing the advantage of hyperbolic space, our method is able to embed users/items with less distortion and capture user-item interaction relationship more accurately. Extensive experiments on public benchmark datasets show that our method outperforms both Euclidean and hyperbolic counterparts and requires far lower embedding dimensionality to achieve comparable performance.
Graph Neural Networks (GNNs) have achieved great success among various domains. Nevertheless, most GNN methods are sensitive to the quality of graph structures. To tackle this problem, some studies exploit different graph structure learning strategies to refine the original graph structure. However, these methods only consider feature information while ignoring available label information. In this paper, we propose a novel label-informed graph structure learning framework which incorporates label information explicitly through a class transition matrix. We conduct extensive experiments on seven node classification benchmark datasets and the results show that our method outperforms or matches the state-of-the-art baselines.
Multivariate time series (MTS) such as multiple medical measures in intensive care units (ICU) are irregularly acquired and hold missing values. Conducting learning tasks on such irregular MTS with missing values, e.g., predicting the mortality of ICU patients, poses significant challenge to existing MTS forecasting models and recurrent neural networks (RNNs), which capture the temporal dependencies within a time series. This work proposes a bidirectional coupled MTS learning (BiCMTS) method to represent both forward and backward value couplings within a time series by RNNs and between MTS by self-attention networks; the learned bidirectional intra- and inter-time series coupling representations are fused to estimate missing values. We test BiCMTS on both data imputation and mortality prediction for ICU patients, showing a great potential of leveraging the deep and hidden relations captured in RNNs by the BiCMTS-learned intra- and inter-time series value couplings in MTS.
Transformer-based language models (e.g. BERT, RoBERT, GPT, etc) have shown remarkable performance in many natural language processing tasks and their multilingual variants make it easier to handle cross-lingual tasks without using machine translation system. In this paper, we apply multilingual BERT in cross-lingual information retrieval (CLIR) task with triplet loss to learn the relevance between queries and documents written in different languages. Moreover, we align the token embeddings from different languages via adversarial networks to help the language model to learn cross-lingual sentence representation. We achieve the state-of-the-art result on the newly published CLIR dataset: CLIRMatrix. Furthermore, we show that the adversarial multilingual BERT can also get the competitive result in the zero-shot setting in some specific languages when we are lack of CLIR training data in a specific language.
To inhibit the spread of rumorous information, fact checking aims at retrieving evidence to verify the truthfulness of a given statement. Fact checking methods typically use knowledge graphs (KGs) as external repositories and develop reasoning methods to retrieve evidence from KGs. As real-world statement is often complex and contains multiple claims, multi-claim fact verification is not only necessary but more important for practical applications. However, existing methods only focus on verifying a single claim (i.e. a single-claim statement). Multiple claims imply rich context information and modeling the interrelations between claims can facilitate better verification of a multi-claim statement as a whole. In this paper, we propose a computational method to model inter-claim interactions for multi-claim fact checking. To focus on relevant claims within a statement, our method first extracts topics from the statement and connects the triple claims in the statement to form a claim graph. It then learns a policy-based agent to sequentially select topic-related triples from the claim graph. To fully exploit information from the statement, our method further employs multiple agents and develops a hierarchical attention mechanism to verify multiple claims as a whole. Experimental results on two real-world datasets show the effectiveness of our method for multi-claim fact verification.
Cold start problem is one of the most challenging and long-standing problems in recommender systems, and cross-domain recommendation (CDR) methods are effective for tackling it. Most cold-start related CDR methods require training a mapping function between high-dimensional embedding space using overlapping user data. However, the overlapping data is scarce in many recommendation tasks, which makes it difficult to train the mapping function. In this paper, we propose a new approach for CDR, which aims to alleviate the training difficulty. The proposed method can be viewed as a special parameterization of the mapping function without hurting expressiveness, which makes use of non-overlapping user data and leads to effective optimization. Extensive experiments on two real-world CDR tasks are performed to evaluate the proposed method. In the case that there are few overlapping data, the proposed method outperforms the existed state-of-the-art method by 14% (relative improvement).
In the information explosion era, recommender systems (RSs) are widely studied and applied to discover user-preferred information. A RS performs poorly when suffering from the cold-start issue, which can be alleviated if incorporating Knowledge Graphs (KGs) as side information. However, most existing works neglect the facts that node degrees in KGs are skewed and massive amount of interactions in KGs are recommendation-irrelevant. To address these problems, in this paper, we propose Differentiable Sampling on Knowledge Graph for Recommendation with Relational GNN (DSKReG) that learns the relevance distribution of connected items from KGs and samples suitable items for recommendation following this distribution. We devise a differentiable sampling strategy, which enables the selection of relevant items to be jointly optimized with the model training procedure. The experimental results demonstrate that our model outperforms state-of-the-art KG-based recommender systems. The code is available online at https://github.com/YuWang-1024/DSKReG.
Personalized recommender systems are playing an increasingly important role for online services. Graph Neural Network (GNN) based recommender models have demonstrated a superior capability to model users' interests thanks to rich relational information encoded in graphs. However, with the ever-growing volume of online information and the high computational complexity of training GNNs, it is difficult to perform frequent updates to provide the most up-to-date recommendations. There have been several attempts towards training GNN models in an incremental fashion to enable faster training times and permit more frequent model updates using the latest training data. The main technique is knowledge distillation, which aims to allow model updates while preserving key aspects of the model that were learned from the historical data. In this work, we develop a novel Graph Structure Aware Contrastive Knowledge Distillation for Incremental Learning in recommender systems, which is tailored to focus on the rich relational information in the recommendation context. We combine the contrastive distillation formulation with intermediate layer distillation to inject layer-level supervision. We demonstrate the effectiveness of our proposed distillation framework for GNN based recommendation systems on four commonly used datasets, showing consistent improvement over state-of-the-art alternatives.
Irregularly, asynchronously and sparsely sampled multivariate time series (IASS-MTS) are characterized by sparse non-uniform time intervals between successive observations and different sampling rates amongst series. Those properties pose substantial challenges to mainstream machine learning models for learning complicated relations within and across IASS-MTS. This is because that most of the models assume that the time series in question are even, complete (fixed-dimensional features) and synchronous. To address these challenges, we present a novel time-aware Dual-Attention and Memory-Augmented Network (DAMA-Net). The proposed model can leverage both time irregularity, multi-sampling rates and global temporal patterns information inherent in IASS-MTS so as to learn more effective representations for improving prediction performance. Comprehensive experiments on real datasets show that the DAMA-Net outperforms the state-of-the-art methods in multivariate time series classification task.
Post-click conversion rate (CVR) estimation is a crucial task in online advertising and recommendation systems. To address the sample selection bias problem in traditional CVR models trained in click space, recent studies perform entire space multi-task learning based on the probability of events in user behavior funnels like "impression-click-conversion". However, those models learn the feature representation of each task independently, and omit potential inter-task correlations that can help improve the CVR estimation performance. In this paper, we propose AutoHERI, an entire space CVR model with automated hierarchical representation integration, which leverages the interplay across multi-tasks' representation learning. It performs neural architecture search to learn optimal connections between layer-wise representations of different tasks. Besides, AutoHERI achieves better search efficiency with one-shot search algorithm, and thus it can be easily extended to new scenarios that have more complex user behaviors. Both offline and online experimental results on large-scale real-world datasets verify that AutoHERI outperforms previous entire space models significantly.
Despite the vast amount of information encoded in knowledge graphs, they often remain incomplete. Neural networks, in particular Graph Convolutional Neural Networks, have been shown to be effective predictors to complete information about the class affiliation of entities in knowledge graphs. However, these models remain ignorant to their predictions confidence due to their used point estimate of a softmax output. In this paper, we combine Graph Convolutional Neural Networks with recent developments in the field of Evidential Learning by placing a Dirichlet distribution on the class probabilities to overcome this problem. We use the continuous output of a Graph Convolutional Neural Network as parameters for a Dirichlet distribution. In this way, the predictions of the model are represented as a distribution over possible softmax outputs, rather than a point estimate of a softmax output. The experiments show that a better performance in predicting class affiliations can be achieved compared to recent models. In addition, the experiments show that this approach overcomes the well-known problem of overconfident prediction of deterministic neural networks.
In recent years, incomplete multi-view clustering has drawn increasing attention due to the existence of large amounts of unlabeled incomplete data whose views are not fully observed in the practical applications. Although many traditional methods have been extended to address the incomplete learning problem, most of them exploit the shallow models and ignore the geometric structure. To address these issues, we proposed a structural deep incomplete multi-view clustering network. Specifically, the proposed method can simultaneously explore the high-level features and high-order geometric structure information of data with several view-specific graph convolutional encoder networks and can directly obtain the optimal clustering indicator matrix in one stage. Experimental results on several datasets with the comparison of state-of-the-art methods validate the superiority of the proposed method.
Ad retrieval in sponsored search aims to understand user search intentions (user queries) and retrieves a set of ads inferred as being relevant to the queries. Due to the huge amount of search traffic and multiple views of relevance (such as co-clicking, co-bidding or textual similar), it is highly desirable but remain challenging to achieve a large-scale, multi-view matching between queries and ads, particularly in industrial settings. In this paper, we propose a scalable multi-view ad retrieval engine SMAD that we developed and deployed at Taobao, the largest e-commerce platform in China. We construct a multi-relation query-item-ad graph capturing different views of query-ad relevance, which is of large scale and with complex structure. Since in e-commerce platform, the queries and products are organized into a category tree, to deal with the large scale of the graph, we propose a category constrained graph sampling and partition method to enable distributed parallel offline training. To tackle the complex multi-view structure, we propose a multi-view parallel deep neural network (DNN) model to combine the information from different views in a principled way. According to offline experiments and online A/B tests, our framework significantly outperforms baselines in terms of relevance, coverage, and revenue.
Feature selection is a prevalent data preprocessing paradigm for various learning tasks. Due to the expensive cost of acquiring supervision information, unsupervised feature selection sparks great interests recently. However, existing unsupervised feature selection algorithms do not have fairness considerations and suffer from a high risk of amplifying discrimination by selecting features that are over associated with protected attributes such as gender, race, and ethnicity. In this paper, we make an initial investigation of the fairness-aware unsupervised feature selection problem and develop a principled framework, which leverages kernel alignment to find a subset of high-quality features that can best preserve the information in the original feature space while being minimally correlated with protected attributes. Specifically, different from the mainstream in-processing debiasing methods, our proposed framework can be regarded as a model-agnostic debiasing strategy that eliminates biases and discrimination before downstream learning algorithms are involved. Experimental results on real-world datasets demonstrate that our framework achieves a good trade-off between feature utility and promoting feature fairness.
Click-Through Rate (CTR) prediction, whose aim is to predict the probability of whether a user will click on an item, is an essential task for many online applications. Due to the nature of data sparsity and high dimensionality of CTR prediction, a key to making effective prediction is to model high-order feature interaction. An efficient way to do this is to perform inner product of feature embeddings with self-attentive neural networks. To better model complex feature interaction, in this paper we propose a novel DisentanglEd Self-atTentIve NEtwork (DESTINE) framework for CTR prediction that explicitly decouples the computation of unary feature importance from pairwise interaction. Specifically, the unary term models the general importance of one feature on all other features, whereas the pairwise interaction term contributes to learning the pure impact for each feature pair. We conduct extensive experiments using two real-world benchmark datasets. The results show that DESTINE not only maintains computational efficiency but achieves consistent improvements over state-of-the-art baselines.
Dense subgraph detection is a fundamental building block for a variety of applications. Most of the existing methods aim to discover dense subgraphs within either a single network or a multi-view network while ignoring the informative node dependencies across multiple layers of networks in a complex system. To date, it largely remains a daunting task to detect dense subgraphs on multi-layered networks. In this paper, we formulate the problem of dense subgraph detection on multi-layered networks based on cross-layer consistency principle. We further propose a novel algorithm DESTINE based on projected gradient descent with the following advantages. First, armed with the cross-layer dependencies, DESTINE is able to detect significantly more accurate and meaningful dense subgraphs at each layer. Second, it scales linearly w.r.t. the number of links in the multi-layered network. Extensive experiments demonstrate the efficacy of the proposed DESTINE algorithm in various cases.
Nowadays, deep learning models are widely adopted in web-scale applications such as recommender systems, and online advertising. In these applications, embedding learning of categorical features is crucial to the success of deep learning models. In these models, a standard method is that each categorical feature value is assigned a unique embedding vector which can be learned and optimized. Although this method can well capture the characteristics of the categorical features and promise good performance, it can incur a huge memory cost to store the embedding table, especially for those web-scale applications. Such a huge memory cost significantly holds back the effectiveness and usability of EDRMs. In this paper, we propose a binary code based hash embedding method which allows the size of the embedding table to be reduced in arbitrary scale without compromising too much performance. Experimental evaluation results show that one can still achieve 99% performance even if the embedding table size is reduced 1000× smaller than the original one with our proposed method.
Embedding learning for categorical features is crucial for the deep learning-based recommendation models (DLRMs). Each feature value is mapped to an embedding vector via an embedding learning process. Conventional methods configure a fixed and uniform embedding size to all feature values from the same feature field. However, such a configuration is not only sub-optimal for embedding learning but also memory costly. Existing methods that attempt to resolve these problems, either rule-based or neural architecture search (NAS)-based, need extensive efforts on the human design or network training. They are also not flexible in embedding size selection or in warm-start-based applications. In this paper, we propose a novel and effective embedding size selection scheme. Specifically, we design an Adaptively-Masked Twins-based Layer (AMTL) behind the standard embedding layer. AMTL generates a mask vector to mask the undesired dimensions for each embedding vector. The mask vector brings flexibility in selecting the dimensions and the proposed layer can be easily added to either untrained or trained DLRMs. Extensive experimental evaluations show that the proposed scheme outperforms competitive baselines on all the benchmark tasks, and is also memory-efficient, saving 60% memory usage without compromising any performance metrics.
User profiling has long been an important problem that investigates user interests in many real applications. Some recent works regard users and their interacted objects as entities of a graph and turn the problem into a node classification task. However, they neglect the difference of distinct interaction types, e.g. user clicks an item v.s. user purchases an item, and thus cannot incorporate such information well. To solve these issues, we propose to leverage the relation-aware heterogeneous graph method for user profiling, which also allows capturing significant meta relations. We adopt the query, key, and value mechanism in a transformer fashion for heterogeneous message passing so that entities can effectively interact with each other. Via such interactions on different relation types, our model can generate representations with rich information for the user profile prediction. We conduct experiments on two real-world e-commerce datasets and observe a significant performance boost of our approach.
Multi-sentence argument linking aims at detecting implicit event arguments across sentences, which is indispensable when textual events span across multiple sentences in a document. Previous studies suffer from the inherent limitations of error propagation and lack the explicit modeling of the local and non-local interactions in a textual event. In this paper, we propose an event-aware hierarchical encoder for multi-sentence argument linking. Specifically, we introduce a hierarchical encoder to explicitly capture the local and global interactions in a textual event. Furthermore, we introduce an auxiliary task to predict the event-relevant context in a manner of multi-task learning, which can implicitly benefit the argument linking model to be aware of the event-relevant context. The empirical results on the widely used argument linking dataset show that our model significantly outperforms the baselines, which demonstrates the effectiveness of our proposed method.
Inferring causal effect from observational data has attracted much attention from various domains. Under the potential outcome framework, the estimation of counterfactuals is crucial for the investigation of causal effect at the individual level. Existing representation learning approaches focus on learning one balanced feature space, which ignores certain information predictive to the outcomes. To fully utilize the predictive information, we propose a Subspace learning based Counterfactual Inference (SCI) method to estimate causal effect at the individual level. Different from existing work, SCI learns both a common subspace, which preserves the information across all the treatment groups, and treatment-specific subspaces, which retain the information associated with each specific treatment. Learning from two kinds of subspaces helps SCI obtain better causal effect estimations than state-of-the-art methods, demonstrated by a series of experiments on synthetic and real-world datasets.
Machine learning-based Android malware detection models suffer from model degradation over time due to ecosystem evolution, which means models trained on history data perform poorly on newly arrived data. Existing solutions to handle the above problem focus on sophisticated feature engineering to find stable features, which is labor-intensive. In this paper, we try to mitigate model degradation by substituting the representation paradigm from Euclidean (vector) to non-Euclidean (graph) without changing features and propose a graph-based Android malware detection model called GraphEvolveDroid. Specifically, we first construct a directed evolutionary network with the KNN model, where each node represents an APP and the starting APP node of each edge is the ancestor of the ending APP node. Then we use stacked GCN layers to transmit the information of ancestor nodes to child nodes so that the shift of the distribution of child nodes can be suppressed. Experimental results on a large real dataset spanning three years demonstrate that GraphEvolveDroid could significantly mitigate model degradation because of slowing down the shift of data distribution.
Dense retrieval systems conduct first-stage retrieval using embedded representations and simple similarity metrics to match a query to documents. Its effectiveness depends on encoded embeddings to capture the semantics of queries and documents, a challenging task due to the shortness and ambiguity of search queries. This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval. ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels. It also keeps the document index unchanged to reduce overhead. ANCE-PRF significantly outperforms ANCE and other recent dense retrieval systems on several datasets. Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
For a good advertising effect, images in the ad should be highly relevant with the ad title. The images in an ad are normally selected from the gallery based on their relevance scores with the ad's title. To ensure the selected images are relevant with the title, a reliable text-image matching model is necessary. The state-of-the-art text- image matching model, cross-modal BERT, only understands the visual content in the image, which is sub-optimal when the image description is available. In this work, we present MixBERT, an adimage relevance scoring model. It models the ad-image relevance by matching the ad title with the image description and visual content. MixBERT adopts a two-stream architecture. It adaptively selects the useful information from noisy image description and suppresses the noise impeding effective matching. To effectively describe the details in visual content of the image, a set of local convolutional features is used as the initial representation of the image. Moreover, to enhance the perceptual capability of our model in key entities which are important to advertising, we upgrade masked language modeling in vanilla BERT to masked key entity modeling. Offline and online experiments demonstrate its effectiveness.