Knowledge graphs have been used to support a wide range of applications and enhance search and QA for Google, Bing, Amazon Alexa, etc. However, we often miss long-tail knowledge, including unpopular entities, unpopular relations, and unpopular verticals. In this talk we describe our efforts in harvesting knowledge from semi-structured websites, which are often populated according to some templates using vast volume of data stored in underlying databases.
We describe our Ceres system, which extracts knowledge from semi-structured web. AutoCeres is a ClosedIE system that extracts knowledge according to existing ontology. It improves the accuracy of fully automatic knowledge extraction from 60%+ of state-of-the-art to 90%+ on semi-structured data. OpenCeres is the first-ever OpenIE system on semi-structured data, that is able to identify new relations not readily included in existing ontologies. ZeroShotCeres goes further and enables extracting knowledge for completely new domains, where there is no seed knowledge for bootstrapping the extraction. Finally, we describe our other efforts in ontology alignment, entity linkage, graph mining, and QA, that allow us to best leverage the knowledge we extract for search and QA.
Biomedicine has always been a fertile and challenging domain for computational discovery science. Indeed, the existence of millions of scientific articles, thousands of databases, and hundreds of ontologies offer exciting opportunities to mine our collective knowledge, were we not stymied by incompatible formats, incomplete and overlapping vocabularies, confusing licensing policies, and heterogeneous data access points.
In this talk, I will discuss our work to create computational standards, platforms, and methods to wrangle knowledge into simple, but effective representations based on semantic web technologies that are maximally FAIR - Findable, Accessible, Interoperable, and Reuseable - and to further use these representations for biomedical knowledge discovery. However, only with crucial additional developments will this emerging Internet of FAIR data and services enable automated scientific discovery on a global scale.
In this paper, we propose a unified framework for Ensemble Block Co-clustering (EBCO), which aims to fuse multiple basic co-clusterings into a consensus structured affinity matrix. Each co-clustering to be fused is obtained by applying a co-clustering method on the same document-term dataset. This fusion process reinforces the individual quality of the multiple basic data co-clusterings within a single consensus matrix. Besides, the proposed framework enables a completely unsupervised co-clustering where the number of co-clusters is automatically inferred based on the non trivial generalized modularity. We first define an explicit objective function which allows the joint learning of the basic co-clusterings aggregation and the consensus block co-clustering. Then, we show that EBCO generalizes the one side ensemble clustering to an ensemble block co-clustering context. We also establish theoretical equivalence to spectral co-clustering and weighted double spherical k-means clustering for textual data. Experimental results on various real-world document-term datasets demonstrate that EBCO is an efficient competitor to some state-of-the-art ensemble and co-clustering methods.
The task of session search focuses on using interaction data to improve relevance for the user's next query at the session level. In this paper, we formulate session search as a personalization task under the framework of learning to rank. Personalization approaches re-rank results to match a user model. Such user models are usually accumulated over time based on the user's browsing behaviour. We use a pre-computed and transparent set of user models based on concepts from the social science literature. Interaction data are used to map each session to these user models. Novel features are then estimated based on such models as well as sessions' interaction data. Extensive experiments on test collections from the TREC session track show statistically significant improvements over current session search algorithms.
Consistent Query Answering (CQA) with respect to primary keys is the following problem. Given a database instance that is possibly inconsistent with respect to its primary key constraints, define a repair as an inclusion-maximal consistent subinstance. Given a Boolean query q, the problem CERTAINTY(q) takes a database instance as input, and asks whether q is true in every repair. For every Boolean conjunctive query q, the complement of CERTAINTY(q) can be straightforwardly implemented in Answer Set Programming (ASP) by means of a generate-and-test approach: first generate a repair, and then test whether it falsifies the query. Theoretical research has recently revealed that for every self-join-free Boolean conjunctive query q, the complexity class of CERTAINTY(q) is one of FO, L-complete, or coNP-complete. Faced with this complexity trichotomy, one can hypothesize that in practice, the full power of generate-and-test is a computational overkill when CERTAINTY(q) is in the low complexity classes FO or L. We investigate part of this hypothesis within the context of ASP, by asking the following question: whenever CERTAINTY(q) is in FO, does a dedicated first-order algorithm exhibit significant performance gains compared to a generic generate-and-test implementation? We first elaborate on the construction of such dedicated first-order algorithms in ASP, and then empirically address this question.
Reducing hidden bias in the data and ensuring fairness in algorithmic data analysis has recently received significant attention. In this paper, we address the problem of identifying a densest subgraph, while ensuring that none of one binary protected attribute is disparately impacted.
Unfortunately, the underlying algorithmic problem is NP-hard, even in its approximation version: approximating the densest fair subgraph with a polynomial-time algorithm is at least as hard as the densest subgraph problem of at most k vertices, for which no constant approximation algorithms are known.
Despite such negative premises, we are able to provide approximation results in two important cases. In particular, we are able to prove that a suitable spectral embedding allows recovery of an almost optimal, fair, dense subgraph hidden in the input data, whenever one is present, a result that is further supported by experimental evidence. We also show a polynomial-time, $2$-approximation algorithm, whenever the underlying graph is itself fair. We finally prove that, under the small set expansion hypothesis, this result is tight for fair graphs.
The above theoretical findings drive the design of heuristics, which we experimentally evaluate on a scenario based on real data, in which our aim is to strike a good balance between diversity and highly correlated items from Amazon co-purchasing graphs.
Even though knowledge graphs have proven very useful for several tasks, they are marked by incompleteness. Completion algorithms aim to extend knowledge graphs by predicting missing (subject, predicate, object) triples, usually by training a model to discern between correct (positive) and incorrect (negative) triples. However, under the open-world assumption in which a missing triple is not negative but unknown, negative triple generation is challenging. Although negative triples are known to drive the accuracy of completion models, its impact has not been thoroughly examined yet. To evaluate accuracy, test triples are considered positive and negative triples are derived from them. The evaluation protocol is thus impacted by the generation of negative triples, which remains to be analyzed. Another issue is that the knowledge graphs available for evaluation contain anomalies like severe redundancy, and it is unclear how anomalies affect the accuracy of completion models. In this paper, we analyze the impact of negative triple generation during both training and testing on translation-based completion models. We examine four negative triple generation strategies, which are also used to evaluate the models when anomalies in the test split are included and discarded. In addition to previously-studied anomalies like near-same predicates, we include another anomaly: knowledge present in the test that is missing from the training split. Our main conclusion is that the most common strategy for negative triple generation (local-closed world assumption) can be mimicked by a combination of a naive and a immediate neighborhood strategies. This result suggests that completion models can be learned independently for certain subgraphs, which would render completion models useful in the context of knowledge graph evolution. Although anomalies are considered harmful since they artificially increase the accuracy of completion models, our results show otherwise for certain knowledge graphs, which calls for further research efforts.
Temporal dynamic graphs are graphs whose topology evolves over time, with nodes and edges added and removed between different time snapshots. Embedding such graphs in a low-dimensional space is important for a variety of tasks, including graphs' similarities, time series trends analysis and anomaly detection, graph visualization, graph classification, and clustering. Despite the importance of the temporal element in these tasks, existing graph embedding methods focus on capturing the graph's nodes in a static mode and/or do not model the graph in its entirety in temporal dynamic mode. In this study, we present tdGraphEmbed, a novel temporal graph-level embedding approach that extend the random-walk based node embedding methods to globally embed both the nodes of the graph and its representation at each time step, thus creating representation of the entire graph at each step. Our approach was applied to graph similarity ranking, temporal anomaly detection, trend analysis, and graph visualizations tasks, where we leverage our temporal embedding in a fast and scalable way for each of the tasks. An evaluation of tdGraphEmbed on five real-world datasets shows that our approach can outperform state-of-the-art approaches used for graph embedding and node embedding in temporal graphs.
With the ever-increasing growth of online recruitment data, job-resume matching has become an important task to automatically match jobs with suitable resumes. This task is typically casted as a supervised text matching problem. Supervised learning is powerful when the labeled data is sufficient. However, on online recruitment platforms, job-resume interaction data is sparse and noisy, which affects the performance of job-resume match algorithms.
To alleviate these problems, in this paper, we propose a novel multi-view co-teaching network from sparse interaction data for job-resume matching. Our network consists of two major components, namely text-based matching model and relation-based matching model. The two parts capture semantic compatibility in two different views, and complement each other. In order to ad- dress the challenges from sparse and noisy data, we design two specific strategies to combine the two components. First, two com- ponents share the learned parameters or representations, so that the original representations of each component can be enhanced. More importantly, we adopt a co-teaching mechanism to reduce the influence of noise in training data. The core idea is to let the two components help each other by selecting more reliable training instances. The two strategies focus on representation enhancement and data enhancement, respectively. Compared with pure text-based matching models, the proposed approach is able to learn better data representations from limited or even sparse interaction data, which is more resistible to noise in training data. Experiment results have demonstrated that our model is able to outperform state-of-the-art methods for job-resume matching.
Graph summarization is the task of finding condensed representations of graphs such that a chosen set of (structural) subgraph features in the graph summary are equivalent to the input graph. Existing graph summarization algorithms are tailored to specific graph summary models, only support one-time batch computation, are designed and implemented for a specific task, or evaluated using static graphs. Our novel, incremental, parallel algorithm addresses all these shortcomings. We support various structural graph summary models defined in our formal language FLUID. All graph summaries defined with FLUID can be updated in time O(Δ · dk), where Δ is the number of additions, deletions, and modifications to the input graph, d is its maximum degree, and k is the maximum distance in the subgraphs considered. We empirically evaluate the performance of our algorithm on benchmark and real-world datasets. Our experiments show that, for commonly used summary models and datasets, the incremental summarization algorithm almost always outperforms their batch counterpart, even when about $50%$ of the graph database changes. The source code and the experimental results are openly available for reproducibility and extensibility.
We investigated how users evaluate passage-length answers for non-factoid questions. We conduct a study where answers were presented to users, sometimes shown with automatic word highlighting. Users were tasked with evaluating answer quality, correctness, completeness, and conciseness. Words in the answer were also annotated, both explicitly through user mark up and implicitly through user gaze data obtained from eye-tracking. Our results show that the correctness of an answer strongly depends on its completeness, conciseness is less important.
Analysis of the annotated words showed correct and incorrect answers were assessed differently. Automatic highlighting helped users to evaluate answers quicker while maintaining accuracy, particularly when highlighting was similar to annotation. We fine-tuned a BERT model on a non-factoid QA task to examine if the model attends to words similar to those annotated. Similarity was found, consequently, we propose a method to exploit the BERT attention map to generate suggestions that simulate eye gaze during user evaluation.
Outlier detection is the task responsible for finding novel or rare phenomena that provide valuable insights in many areas of the industry. The neighborhood-based algorithms are largely used to tackle this problem due to the intuitive interpretation and wide applicability in different domains. Their major drawback is the intensive neighborhood search that takes hours or even days to complete in large data, thus being impractical in many real-world scenarios. This paper proposes HySortOD -- a novel algorithm that uses an efficient hypercube-ordering-and-searching strategy for fast outlier detection. Its main focus is the analysis of data with many instances and a low-to-moderate number of dimensions. We performed comprehensive experiments using real data with up to ~500k instances and ~120 dimensions, where our new algorithm outperformed 7 state-of-the-art competitors in runtime, being up to 4 orders of magnitude faster in large data. Specifically, 12 well-known benchmark datasets were deeply investigated and one case study in the crucial task of breast cancer detection was also performed to demonstrate that our approach can be successfully used as an out-of-the-box solution for real-world, non-benchmark problems. Based on our experiments, we also identified default parameter values that allow us to be parameter-free and yet report high-quality results.
Deep learning has unlocked new paths towards the emulation of the peculiarly-human capability of learning from examples. While this kind of bottom-up learning works well for tasks such as image classification or object detection, it is not as effective when it comes to natural language processing. Communication is much more than learning a sequence of letters and words: it requires a basic understanding of the world and social norms, cultural awareness, commonsense knowledge, etc.; all things that we mostly learn in a top-down manner. In this work, we integrate top-down and bottom-up learning via an ensemble of symbolic and subsymbolic AI tools, which we apply to the interesting problem of polarity detection from text. In particular, we integrate logical reasoning within deep learning architectures to build a new version of SenticNet, a commonsense knowledge base for sentiment analysis.
We propose laconic classification as a novel way to understand and compare the performance of diverse image classifiers. The goal in this setting is to minimise the amount of information (aka. entropy) required in individual test images to maintain correct classification. Given a classifier and a test image, we compute an approximate minimal-entropy positive image for which the classifier provides a correct classification, becoming incorrect upon any further reduction. The notion of entropy offers a unifying metric that allows to combine and compare the effects of various types of reductions (e.g., crop, colour reduction, resolution reduction) on classification performance, in turn generalising similar methods explored in previous works. Proposing two complementary frameworks for computing the minimal-entropy positive images of both human and machine classifiers, in experiments over the ILSVRC test-set, we find that machine classifiers are more sensitive entropy-wise to reduced resolution (versus cropping or reduced colour for machines, as well as reduced resolution for humans), supporting recent results suggesting a texture bias in the ILSVRC-trained models used. We also find, in the evaluated setting, that humans classify the minimal-entropy positive images of machine models with higher precision than machines classify those of humans.
To mitigate the problem of over-dependence of a pseudo-relevance feedback algorithm on the top-M document set, we make use of a set of equivalence classes of queries rather than one single query. These query equivalents are automatically constructed either from a) a knowledge base of prior distributions of terms with respect to the given query terms, or b) iteratively generated from a relevance model of term distributions in the absence of such priors. These query variants are then used to estimate the retrievability of each document with the hypothesis that documents that are more likely to be retrieved at top-ranks for a larger number of these query variants are more likely to be effective for relevance feedback. Results of our experiments show that our proposed method is able to achieve substantially better precision at top-ranks (e.g. higher nDCG@5 and P@5 values) for ad-hoc IR and points-of-interest (POI) recommendation tasks.
Several geographical latent representation models that capture geographical influences among points-of-interest (POIs) have been proposed. Although the models improve POI recommendation performance, they depend on shallow methods that cannot effectively capture highly non-linear geographical influences from complex user-POI networks. In this paper, we propose a new graph-based geographical latent representation model (GGLR) which can capture highly non-linear geographical influences from complex user-POI networks. Our proposed GGLR considers two types of geographical influences: ingoing influences and outgoing influences. Based on a graph auto-encoder, geographical latent representations of ingoing and outgoing influences are trained to increase geographical influences between two consecutive POIs that frequently appear in check-in histories. Furthermore, we propose a graph neural network-based POI recommendation model (GPR) that uses the trained geographical latent representations of ingoing and outgoing influences for the estimation of user preferences. In the experimental evaluation on real-world datasets, we show that GGLR effectively captures highly non-linear geographical influences and GPR achieves state-of-the-art performance.
Dynamic graphs such as the user-item interactions graphs and financial transaction networks are ubiquitous nowadays. While numerous representation learning methods for static graphs have been proposed, the study of dynamic graphs is still in its infancy. A main challenge of modeling dynamic graphs is how to effectively encode temporal and structural information into nonlinear and compact dynamic embeddings. To achieve this, we propose a principled graph-neural-based approach to learn continuous-time dynamic embeddings. We first define a temporal dependency interaction graph(TDIG) that is induced from sequences of interaction data. Based on the topology of this TDIG, we develop a dynamic message passing neural network named TDIG-MPNN, which can capture the fine-grained global and local information on TDIG. In addition, to enhance the quality of continuous-time dynamic embeddings, a novel selection mechanism comprised of two successive steps, i.e., co-attention and gating, is applied before the above TDIG-MPNN layer to adjust the importance of the nodes by considering high-order correlation between interactive nodes' k-depth neighbors on TDIG. Finally, we cast our learning problem in the framework of temporal point processes (TPPs) where we use TDIG-MPNN to design a neural intensity function for the dynamic interaction processes. Our model achieves superior performance over alternatives on temporal interaction prediction (including tranductive and inductive tasks) on multiple datasets.
Tag-aware recommender systems (TRS) utilize rich tagging records to better depict user portraits and item features. Recently, many efforts have been done to improve TRS with neural networks. However, these solutions rustically rely on the tag-based features for recommendation, which is insufficient to ease the sparsity, ambiguity and redundancy issues introduced by tags, thus hindering the recommendation performance. In this paper, we propose a novel tag-aware recommendation model named Tag Graph Convolutional Network (TGCN), which leverages the contextual semantics of multi-hop neighbors in the user-tag-item graph to alleviate the above issues. Specifically, TGCN first employs type-aware neighbor sampling and aggregation operation to learn the type-specific neighborhood representations. Then we leverage attention mechanism to discriminate the importance of different node types and creatively employ Convolutional Neural Network (CNN) as type-level aggregator to perform vertical and horizontal convolutions for modeling multi-granular feature interactions. Besides, a TransTag regularization function is proposed to accurately identify user's substantive preference. Extensive experiments on three public datasets and a real industrial dataset show that TGCN significantly outperforms state-of-the-art baselines for tag-aware top-N recommendation.
Heterogeneous information networks (HINs) have been ubiquitous in the real-world. HIN embeddings, which encode various information of the networks into low-dimensional vectors, can facilitate a wide range of applications on graph-structured data. Existing HIN embedding methods include random walk based methods that may not fully utilize the edge semantics and knowledge graph embedding methods that restrict the expression ability of topological information. In this paper, we propose a novel adaptive embedding framework, which integrates these two kinds of methods to preserve both topological information and relational information. By incorporating an assistant knowledge graph embedding model, the proposed framework performs efficient biased random walk under the guidance of edge semantics.
Sequential recommenders that capture users' dynamic intents by modeling sequential behavior, are able to accurately recommend items to users. Previous studies on sequential recommendations (SRs) mostly focus on optimizing the recommendation accuracy, thus ignoring the diversity of recommended items. Many existing methods for improving the diversity of recommended items are not applicable to SRs because they assume that user intents are static and rely on post-processing the list of recommended items to promote diversity. We consider both accuracy and diversity by reformulating SRs as a list generation task, and propose an integrated approach with an end-to-end neural model, called intent-aware diversified sequential recommendation (IDSR). Specifically, we introduce an implicit intent mining (IIM) module for SR to capture multiple user intents reflected in sequences of user behavior. We design an intent-aware diversity promoting (IDP) loss function to supervise the learning of the IIM module and guide the model to take diversity into account during training. Extensive experiments on four datasets show that IDSR significantly outperforms state-of-the-art methods in terms of recommendation diversity while yielding comparable or superior recommendation accuracy.
Social media is a vital means for information-sharing due to its easy access, low cost, and fast dissemination characteristics. However, increases in social media usage have corresponded with a rise in the prevalence of cyberbullying. Most existing cyberbullying detection methods aresupervised and, thus, have two key drawbacks: (1) The data labeling process is often time-consuming and labor-intensive; (2) Current labeling guidelines may not be generalized to future instances because of different language usage and evolving social networks. To address these limitations, this work introduces a principled approach forunsupervised cyberbullying detection. The proposed model consists of two main components: (1) Arepresentation learning network that encodes the social media session by exploiting multi-modal features, e.g., text, network, and time. (2) Amulti-task learning network that simultaneously fits the comment inter-arrival times and estimates the bullying likelihood based on a Gaussian Mixture Model. The proposed model jointly optimizes the parameters of both components to overcome the shortcomings of decoupled training. Our core contribution is an unsupervised cyberbullying detection model that not only experimentally outperforms the state-of-the-art unsupervised models, but also achieves competitive performance compared to supervised models.
Mining data collected from industrial manufacturing process plays an important role for intelligent manufacturing in Industry 4.0. In this paper, we propose a deep convolutional model for predicting wafer fabrication quality in an intelligent integrated-circuit manufacturing application. The wafer fabrication quality prediction is motivated by the need for improving product line efficiency and reducing manufacturing cost by detecting potential defective work-in-process (WIP) wafers. This work considers the following two crucial data characteristics for wafer fabrication. First, our model is designed to learn spatial correlation between quality measurements on WIP wafers and fabrication results through an encoder-decoder neural network. Second, we leverage the fact that different products share the same raw manufacturing process to enable the knowledge transferring between prediction models of different products. Performance evaluation on real data sets is conducted to validate the strengths of our model on quality prediction, model interpretability, and feasibility of transferring knowledge.
The dramatically growing availability of observational data is being witnessed in various domains of science and technology, which facilitates the study of causal inference. However, estimating treatment effects from observational data is faced with two major challenges, missing counterfactual outcomes and treatment selection bias. Matching methods are among the mostly widely used and fundamental approaches to estimating treatment effects, but existing matching methods have poor performance when facing data with high dimensional and complicated variables. We propose a feature selection representation matching (FSRM) method based on deep representation learning and matching, which maps the original covariate space into a selective, nonlinear, and balanced representation space, and then conducts matching in the learned representation space. FSRM adopts deep feature selection to minimize the influence of irrelevant variables for estimating treatment effects and incorporates a regularizer based on the Wasserstein distance to learn balanced representations. We evaluate the performance of our FSRM method on three datasets, and the results demonstrate superiority over the state-of-the-art methods.
Textual data is common and informative auxiliary information for recommender systems. Most prior art utilizes text for rating prediction, but rare work connects it to top-recommendation. Moreover, although advanced recommendation models capable of incorporating auxiliary information have been developed, none of these are specifically designed to model textual information, yielding a limited usage scenario for typical user-to-item recommendation. In this work, we present a framework of text-aware preference ranking (TPR) for top- recommendation, in which we comprehensively model the joint association of user-item interaction and relations between items and associated text. Using the TPR framework, we construct a joint likelihood function that explicitly describes two ranking structures: 1) item preference ranking (IPR) and 2) word relatedness ranking (WRR), where the former captures the item preference of each user and the latter captures the word relatedness of each item. As these two explicit structures are by nature mutually dependent, we propose TPR-OPT, a simple yet effective learning criterion that additionally includes implicit structures, such as relatedness between items and relatedness between words for each user for model optimization. Such a design not only successfully describes the joint association among users, words, and text comprehensively but also naturally yields powerful representations that are suitable for a range of recommendation tasks, including user-to-item, item-to-item, and user-to-word recommendation, as well as item-to-word reconstruction. In this paper, extensive experiments have been conducted on eight recommendation datasets, the results of which demonstrate that by including textual information from item descriptions, the proposed TPR model consistently outperforms state-of-the-art baselines on various recommendation tasks.
NDCG and similar measures remain standard for the offline evaluation of search, recommendation, question answering and similar systems. These measures require definitions for two or more relevance levels, which human assessors then apply to judge individual documents. Due to this dependence on a definition of relevance, it can be difficult to extend these measures to account for factors beyond relevance. Rather than propose extensions to these measures, we instead propose a radical simplification to replace them. For each query, we define a set of ideal rankings and compute the maximum rank similarity between members of this set and an actual ranking generated by a system. This maximum similarity to an ideal ranking becomes our effectiveness measure, replacing NDCG and similar measures. We propose rank biased overlap (RBO) to compute this rank similarity, since it was specifically created to address the requirements of rank similarity between search results. As examples, we explore ideal rankings that account for document length, diversity, and correctness.
Dynamic inference is an emerging technique that reduces the computational cost of deep neural network under resource-constrained scenarios, such as inference on mobile devices. One way to achieve dynamic inference is to leverage multi-branch neural networks that apply different computation on input data by following different branches. Conventional research on multi-branch neural networks mainly targeted at improving the accuracy of each branch, and use manually designed rules to decide which input follows which branch of the network. Furthermore, these networks often provide a small number of exits, limiting their ability to adapt to external changes. In this paper, we investigate the problem of designing a flexible multi-branch network and early-exiting policies that can adapt to the resource consumption to individual inference request without impacting the inference accuracy. We propose a lightweight branch structure that also provides fine-grained flexibility for early-exiting and leverage Markov decision process (MDP) to automatically learn the early-exiting policies. Our proposed model, EPNet, was effective in reducing inference cost without impacting accuracy by choosing the most suitable branch exit. We also observe that EPNet achieved 3% higher accuracy with an inference budget, compared to state-of-the-art approaches.
Forecasting influenza-like illness (ILI) is of prime importance to epidemiologists and health-care providers. Early prediction of epidemic outbreaks plays a pivotal role in disease intervention and control. Most existing work has either limited long-term prediction performance or fails to capture spatio-temporal dependencies in data. In this paper, we design a cross-location attention based graph neural network (Cola-GNN) for learning time series embeddings in long-term ILI predictions. We propose a graph message passing framework to combine graph structures (e.g., geolocations) and time-series features (e.g., temporal sequences) in a dynamic propagation process. We compare the proposed method with state-of-the-art statistical approaches and deep learning models. We conducted a set of extensive experiments on real-world epidemic-related datasets from the United States and Japan. The proposed method demonstrated strong predictive performance and leads to interpretable results for long-term epidemic predictions.
Product-related question answering (QA) is an important but challenging task in E-Commerce. It leads to a great demand on automatic review-driven QA, which aims at providing instant responses towards user-posted questions based on diverse product reviews. Nevertheless, the rich information about personal opinions in product reviews, which is essential to answer those product-specific questions, is underutilized in current generation-based review-driven QA studies. There are two main challenges when exploiting the opinion information from the reviews to facilitate the opinion-aware answer generation: (i) jointly modeling opinionated and interrelated information between the question and reviews to capture important information for answer generation, (ii) aggregating diverse opinion information to uncover the common opinion towards the given question. In this paper, we tackle opinion-aware answer generation by jointly learning answer generation and opinion mining tasks with a unified model. Two kinds of opinion fusion strategies, namely, static and dynamic fusion, are proposed to distill and aggregate important opinion information learned from the opinion mining task into the answer generation process. Then a multi-view pointer-generator network is employed to generate opinion-aware answers for a given product-related question. Experimental results show that our method achieves superior performance in real-world E-Commerce QA datasets, and effectively generate opinionated and informative answers.
User profiling has very important applications for many downstream tasks, such as recommender system, behavior prediction and market strategy. Most existing methods only focus on modeling user profiles of one social network with plenty of data. However, user profiles are difficult to acquire, especially when the data is scarce. Modeling user profiles under such conditions often leads to poor performance. Fortunately, we observed that not only user attributes but also user relationships are useful for user profiling and benefit the results. Meanwhile, similar users have similar behavior in different social networks. Finding user dependencies between social networks will help to infer user profiles. Motivated by such observations, in this paper, we for the first time propose to study the user profiling problem from the transfer learning perspective. We design an efficient User Profile transferring acrOss Networks (UPON) framework, which transfers knowledge of user relationship from one social network with plenty of data to facilitate the user profiling on the other social network with scarce data. In UPON, we first design a novel graph convolutional networks based characteristic-aware domain attention model (GCN-CDAM) to find user dependencies within and between domains (referring to social networks). We then design a dual-domain weighted adversarial learning method to solve the domain shift problem existing in the transferring procedure. Experimental results on Twitter-Foursquare dataset demonstrate that UPON outperforms the state-of-the-art models.
We introduce the concept of expected exposure as the average attention ranked items receive from users over repeated samples of the same query. Furthermore, we advocate for the adoption of the principle of equal expected exposure: given a fixed information need, no item should receive more or less expected exposure than any other item of the same relevance grade. We argue that this principle is desirable for many retrieval objectives and scenarios, including topical diversity and fair ranking. %Leveraging user models from existing retrieval metrics, we propose a general evaluation methodology based on expected exposure and draw connections to related metrics in information retrieval evaluation. Importantly, this methodology relaxes classic information retrieval assumptions, allowing a system, in response to a query, to produce a distribution over rankings instead of a single fixed ranking. We study the behavior of the expected exposure metric and stochastic rankers across a variety of information access conditions, including ad hoc retrieval and recommendation. %We believe that measuring and optimizing expected exposure metrics using randomization opens a new area for retrieval algorithm development and progress.
The Alternating Direction Method of Multipliers (ADMM) and its distributed version have been widely used in machine learning. In the iterations of ADMM, model updates using local private data and model exchanges among agents impose critical privacy concerns. Despite some pioneering works to relieve such concerns, differentially private ADMM still confronts many research challenges. For example, the guarantee of differential privacy (DP) relies on the premise that the optimality of each local problem can be perfectly attained in each ADMM iteration, which may never happen in practice. The model trained by DP ADMM may have low prediction accuracy. In this paper, we address these concerns by proposing a novel (Improved) Plausible differentially Private ADMM algorithm, called PP-ADMM and IPP-ADMM. In PP-ADMM, each agent approximately solves a perturbed optimization problem that is formulated from its local private data in an iteration, and then perturbs the approximate solution with Gaussian noise to provide the DP guarantee. To further improve the model accuracy and convergence, an improved version IPP-ADMM adopts sparse vector technique (SVT) to determine if an agent should update its neighbors with the current perturbed solution. The agent calculates the difference of the current solution from that in the last iteration, and if the difference is larger than a threshold, it passes the solution to neighbors; or otherwise the solution will be discarded. Moreover, we propose to track the total privacy loss under the zero-concentrated DP (zCDP) and provide a generalization performance analysis. Experiments on real-world datasets demonstrate that under the same privacy guarantee, the proposed algorithms are superior to the state of the art in terms of model accuracy and convergence rate.
Attributed networks nowadays are ubiquitous in a myriad of high-impact applications, such as social network analysis, financial fraud detection, and drug discovery. As a central analytical task on attributed networks, node classification has received much attention in the research community. In real-world attributed networks, a large portion of node classes only contains limited labeled instances, rendering a long-tail node class distribution. Existing node classification algorithms are unequipped to handle the few-shot node classes. As a remedy, few-shot learning has attracted a surge of attention in the research community. Yet, few-shot node classification remains a challenging problem as we need to address the following questions: (i) How to extract meta-knowledge from an attributed network for few-shot node classification? (ii) How to identify the informativeness of each labeled instance for building a robust and effective model? To answer these questions, in this paper, we propose a graph meta-learning framework -- Graph Prototypical Networks (GPN). By constructing a pool of semi-supervised node classification tasks to mimic the real test environment, GPN is able to perform meta-learning on an attributed network and derive a highly generalizable model for handling the target classification task. Extensive experiments demonstrate the superior capability of GPN in few-shot node classification.
Spreadsheets are popular and widely used for data presentation and management, where users create tables in various structures to organize and present data. Table formatting is an important yet tedious task for better exhibiting table structures and data relationships. However, without the aid of intelligent tools, manual formatting remains a tedious and time-consuming task. In this paper, we propose CellGAN, a neural formatting model for learning and recommending formats of spreadsheet tables. Based on a novel conditional generative adversarial network (cGAN) architecture, CellGAN learns table formatting from real-world spreadsheet tables in a self-supervised fashion without requiring human labeling. In CellGAN we devise two mechanisms, row/column-wise pooling and local refinement network, to address challenges from the spreadsheet domain. We evaluate the effectiveness of CellGAN against real-world datasets using both quantitative metrics and human perception studies. The results indicate remarkable performance gains over rule-based methods, graphical models or direct application of the state-of-the-art cGANs used in visual synthesis tasks. Neural Formatting is the first step towards auto-formatting for spreadsheet tables with promising results.
Graph Neural Networks (GNNs) have been widely applied to fraud detection problems in recent years, revealing the suspiciousness of nodes by aggregating their neighborhood information via different relations. However, few prior works have noticed the camouflage behavior of fraudsters, which could hamper the performance of GNN-based fraud detectors during the aggregation process. In this paper, we introduce two types of camouflages based on recent empirical studies, i.e., the feature camouflage and the relation camouflage. Existing GNNs have not addressed these two camouflages, which results in their poor performance in fraud detection problems. Alternatively, we propose a new model named CAmouflage-REsistant GNN (CARE-GNN), to enhance the GNN aggregation process with three unique modules against camouflages. Concretely, we first devise a label-aware similarity measure to find informative neighboring nodes. Then, we leverage reinforcement learning (RL) to find the optimal amounts of neighbors to be selected. Finally, the selected neighbors across different relations are aggregated together. Comprehensive experiments on two real-world fraud datasets demonstrate the effectiveness of the RL algorithm. The proposed CARE-GNN also outperforms state-of-the-art GNNs and GNN-based fraud detectors. We integrate all GNN-based fraud detectors as an opensource toolbox https://github.com/safe-graph/DGFraud. The CARE-GNN code and datasets are available at https://github.com/YingtongDou/CARE-GNN.
With advancements of deep learning techniques, it is now possible to generate super-realistic images and videos, i.e., deepfakes. These deepfakes could reach mass audience and result in adverse impacts on our society. Although lots of efforts have been devoted to detect deepfakes, their performance drops significantly on previously unseen but related manipulations and the detection generalization capability remains a problem. Motivated by the fine-grained nature and spatial locality characteristics of deepfakes, we propose Locality-Aware AutoEncoder (LAE) to bridge the generalization gap. In the training process, we use a pixel-wise mask to regularize local interpretation of LAE to enforce the model to learn intrinsic representation from the forgery region, instead of capturing artifacts in the training set and learning superficial correlations to perform detection. We further propose an active learning framework to select the challenging candidates for labeling, which requires human masks for less than 3% of the training data, dramatically reducing the annotation efforts to regularize interpretations. Experimental results on three deepfake detection tasks indicate that LAE could focus on the forgery regions to make decisions. The analysis further shows that LAE outperforms the state-of-the-arts by 6.52%, 12.03%, and 3.08% respectively on three deepfake detection tasks in terms of generalization accuracy on previously unseen manipulations.
Argument search engines identify, extract, and rank the most important arguments for and against a given controversial topic. A number of such systems have recently been developed, usually focusing on classic information retrieval ranking methods that are based on frequency information. An important aspect that has been ignored so far by search engines is the quality of arguments. We present a quality-aware ranking framework for arguments already extracted from texts and represented as argument graphs, considering multiple established quality measures. An extensive evaluation with a standard benchmark collection demonstrates that taking quality into account significantly helps to improve retrieval quality for argument search. We also publish a dataset in which arguments with respect to topics were tediously annotated by humans with three widely accepted argument quality dimensions.
Recent advances in the Internet of Things (IoT) technology have led to a surge on the popularity of sensing applications. As a result, people increasingly rely on information obtained from sensors to make decisions in their daily life. Unfortunately, in most sensing applications, sensors are known to be error-prone and their measurements can become misleading at any unexpected time. Therefore, in order to enhance the reliability of sensing applications, apart from the physical phenomena/processes of interest, we believe it is also highly important to monitor the reliability of sensors and clean the sensor data before analysis on them being conducted. Existing studies often regard sensor reliability monitoring and sensor data cleaning as separate problems. In this work, we propose RelSen, a novel optimization-based framework to address the two problems simultaneously via utilizing the mutual dependence between them. Furthermore, RelSen is not application-specific as its implementation assumes a minimal prior knowledge of the process dynamics under monitoring. This significantly improves its generality and applicability in practice. In our experiments, we apply RelSen on an outdoor air pollution monitoring system and a condition monitoring system for a cement rotary kiln. Experimental results show that our framework can timely identify unreliable sensors and remove sensor measurement errors caused by three types of most commonly observed sensor faults.
In recent years, algorithm research in the area of recommender systems has shifted from matrix factorization techniques and their latent factor models to neural approaches. However, given the proven power of latent factor models, some newer neural approaches incorporate them within more complex network architectures. One specific idea, recently put forward by several researchers, is to consider potential correlations between the latent factors, i.e., embeddings, by applying convolutions over the user-item interaction map. However, contrary to what is claimed in these articles, such interaction maps do not share the properties of images where Convolutional Neural Networks (CNNs) are particularly useful. In this work, we show through analytical considerations and empirical evaluations that the claimed gains reported in the literature cannot be attributed to the ability of CNNs to model embedding correlations, as argued in the original papers. Moreover, additional performance evaluations show that all of the examined recent CNN-based models are outperformed by existing non-neural machine learning techniques or traditional nearest-neighbor approaches. On a more general level, our work points to major methodological issues in recommender systems research.
Given a user query, traditional multi-turn retrieval-based dialogue systems first retrieve a set of candidate responses from the historical dialogue sessions. Then the response selection models select the most appropriate response to the given query. However, previous work only considers the matching between the query and the response but ignores the informative dialogue session in which the response is located. Nevertheless, this session, composed of the response, the response's history and the response's future, always contains valuable contextual information which can help the response selection task. More specifically, if the current query and a response's history both refer to the same question, we can conclude that this response is quite likely to answer this query. As for the response's future, it can always provide contextual hints and supplementary information that might be omitted in the response. Inspired by such motivation, we propose a query-to-session matching (QSM) framework to make full use of the session information: matching the query with the candidate session instead of the response only. Different from the previous work which ranks response directly, the response in the session with the highest query-to-session matching score will be selected as the desired response. In our proposed framework, the query, history, and future are all sequences of utterances, which makes it necessary to model the relationships among the utterances. So we propose a novel dialogue flow aware query-to-session matching (DF-QSM) model. The dialogue flows model the relationships among the utterances through a memory network. To our best knowledge, our paper is the first work to utilize both the response's history and future in the response selection task. The experimental results on three multi-turn response selection benchmarks show that our proposed model outperforms existing state-of-the-art methods by a large margin.
Relational database systems hold massive amounts of text, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, pre-trained word embeddings are increasingly utilized to convert text values into meaningful numbers. However, a naïve one-to-one mapping of each word in a database to a word embedding vector misses incorporating rich context information given by the database schema. Thus, we propose a novel relational retrofitting framework Retro to learn numerical representations of text values in databases, capturing the rich information encoded by pre-trained word embedding models as well as context information provided by tabular and foreign key relations in the database. We defined relation retrofitting as an optimization problem, present an efficient algorithm solving it, and investigate the influence of various hyperparameters. Further, we develop simple feed-forward and complex graph convolutional neural network architectures to operate on those representations. Our evaluation shows that the proposed embeddings and models are ready-to-use for many ML tasks, such as text classification, imputation, and link prediction, and even outperform state-of-the-art techniques.
Knowledge graph embedding, which aims to learn low-dimensional embeddings of entities and relations, plays a vital role in a wide range of applications. It is crucial for knowledge graph embedding models to model and infer various relation patterns, such as symmetry/antisymmetry, inversion, and composition. However, most existing methods fail to model the non-commutative composition pattern, which is essential, especially for multi-hop reasoning. To address this issue, we propose a new model called Rotate3D, which maps entities to the three-dimensional space and defines relations as rotations from head entities to tail entities. By using the non-commutative composition property of rotations in the three-dimensional space, Rotate3D can naturally preserve the order of the composition of relations. Experiments show that Rotate3D outperforms existing state-of-the-art models for link prediction and path query answering. Further case studies demonstrate that Rotate3D can effectively capture various relation patterns with a marked improvement in modeling the composition pattern.
Existing review-based recommendation models mainly learn long- term user and item representations from a set of reviews. Due to the ignorance of rich side information of reviews, these models suffer from two drawbacks: 1) they fail to capture short-term changes of user preferences and item features reflected in reviews and 2) they cannot accurately model high-order user-item collaborative signals from reviews. To overcome these limitations, we propose a multi-view approach named Set-Sequence-Graph (SSG), to augment existing single-view (i.e., view of set) methods by introducing two additional views of exploiting reviews: sequence and graph. In particular, with reviews organized in forms of set, sequence, and graph respectively, we design a three-way encoder architecture that jointly captures long-term (set), short-term (sequence), and collaborative (graph) features of users and items for recommendation. For the sequence encoder, we propose a short-term priority attention network that explicitly takes the order and personalized time intervals of reviews into consideration. For the graph encoder, we design a novel review-aware graph attention network to model high-order multi-aspect relations in the user-item graph. To combat the potential redundancy in captured features, our fusion module employs a cross-view decorrelation mechanism to encourage diverse representations from multiple views for integration. Experiments on public datasets demonstrate that SSG significantly outperforms state-of-the-art methods.
Knowledge graphs (KGs), that have become the backbone of many critical knowledge-centric applications, are mostly automatically constructed based on an ensemble of extraction techniques applied over diverse data sources. It is, therefore, important to establish the provenance of results for a query to determine how these were computed. Provenance is shown to be useful for assigning confidence scores to the results, for debugging the KG generation itself, and for providing answer explanations. In many such applications, certain queries are registered as standing queries since their answers are needed often. However, KGs keep continuously changing due to reasons such as changes in the source data, improvements to the extraction techniques, refinement/enrichment of information, and so on. This raises the issue of efficiently maintaining the provenance polynomials of complex graph pattern queries for dynamic and large KGs instead of having to recompute them from scratch each time the KG is updated. Addressing this issue, we present a framework HUKA that uses provenance polynomials for tracking the derivation of query results over knowledge graphs by encoding the edges involved in generating the answer. More importantly, HUKA also maintains these provenance polynomials in the face of updates---insertions as well as deletions of facts---in the underlying KG. Experimental results over large real-world KGs such as YAGO and DBpedia with various benchmark SPARQL query workloads reveals that HUKA can be almost 50 times faster than existing systems for provenance computation on dynamic KGs.
Few-shot relation classification seeks to classify incoming query instances after meeting only few support instances. This ability is gained by training with large amount of in-domain annotated data. In this paper, we tackle an even harder problem by further limiting the amount of data available at training time. We propose a few-shot learning framework for relation classification, which is particularly powerful when the training data is very small. In this framework, models not only strive to classify query instances, but also seek underlying knowledge about the support instances to obtain better instance representations. The framework also includes a method for aggregating cross-domain knowledge into models by open-source task enrichment. Additionally, we construct a brand new dataset: the TinyRel-CM dataset, a few-shot relation classification dataset in health domain with purposely small training data and challenging relation classes. Experimental results demonstrate that our framework brings performance gains for most underlying classification models, outperforms the state-of-the-art results given small training data, and achieves competitive results with sufficiently large training data.
Embedding knowledge graphs (KGs) into continuous vector spaces is currently an active research area. Soft rules, despite their uncertainty, are highly beneficial to KG embedding. However, they have not been studied enough in recent methods. A major challenge here is how to devise a principled framework, which efficiently and effectively integrates such soft logical information into embedding models. This paper proposes a highly scalable and effective method for preserving soft logical regularities by imposing soft rule constraints on relation representations. Specifically, we first represent relations as bilinear forms and map entity representations into a non-negative and bounded space. Then we derive a rule-based regularization that merely enforces relation representations to satisfy constraints introduced by soft rules. The proposed method has the following advantages: 1) it regularizes relations directly with the complexity of rule learning independent of entity set size, improving scalability; 2) it imposes prior logical information upon the structure of the embedding space, and would be beneficial to knowledge reasoning. Evaluation in link prediction on Freebase and DBpedia shows the effectiveness of our approach over many competitive baselines. Code and datasets are available at https://github.com/StudyGroup-lab/SLRE.
With the recent advancement of deep learning, molecular representation learning -- automating the discovery of feature representation of molecular structure, has attracted significant attention from both chemists and machine learning researchers. Deep learning can facilitate a variety of downstream applications, including bio-property prediction, chemical reaction prediction, etc. Despite the fact that current SMILES string or molecular graph molecular representation learning algorithms (via sequence modeling and graph neural networks, respectively) have achieved promising results, there is no work to integrate the capabilities of both approaches in preserving molecular characteristics (e.g, atomic cluster, chemical bond) for further improvement. In this paper, we propose GraSeq, a joint graph and sequence representation learning model for molecular property prediction. Specifically, GraSeq makes a complementary combination of graph neural networks and recurrent neural networks for modeling two types of molecular inputs, respectively. In addition, it is trained by the multitask loss of unsupervised reconstruction and various downstream tasks, using limited size of labeled datasets. In a variety of chemical property prediction tests, we demonstrate that our GraSeq model achieves better performance than state-of-the-art approaches.
Understanding user interaction behaviors remains a challenging problem. Quantifying behavior dynamics over time as users complete tasks has only been done in specific domains. In this paper, we present a user behavior model built using behavior embeddings to compare behaviors and their change over time. To this end, we first define the formal model and train the model using both action (e.g., copy/paste) embeddings and user interaction feature (e.g., length of the copied text) embeddings. Having obtained vector representations of user behaviors, we then define three measurements to model behavior dynamics over time, namely: behavior position, displacement, and velocity. To evaluate the proposed methodology, we use three real world datasets: (i) tens of users completing complex data curation tasks in a lab setting, (ii) hundreds of crowd workers completing structured tasks in a crowdsourcing setting, and (iii) thousands of editors completing unstructured editing tasks on Wikidata. Through these datasets, we show that the proposed methodology can: (i) surface behavioral differences among users; (ii) recognize relative behavioral changes; and (iii) discover directional deviations of user behaviors. Our approach can be used (i) to capture behavioral semantics from data in a consistent way, (ii) to quantify behavioral diversity for a task and among different users, and (iii) to explore the temporal behavior evolution with respect to various task properties (e.g., structure and difficulty).
In the past decade, the heterogeneous information network (HIN) has become an important methodology for modern recommender systems. To fully leverage its power, manually designed network templates, i.e., meta-structures, are introduced to filter out semantic-aware information. The hand-crafted meta-structure rely on intense expert knowledge, which is both laborious and data-dependent. On the other hand, the number of meta-structures grows exponentially with its size and the number of node types, which prohibits brute-force search. To address these challenges, we propose Genetic Meta-Structure Search (GEMS) to automatically optimize meta-structure designs for recommendation on HINs. Specifically, GEMS adopts a parallel genetic algorithm to search meaningful meta-structures for recommendation, and designs dedicated rules and a meta-structure predictor to efficiently explore the search space. Finally, we propose an attention based multi-view graph convolutional network module to dynamically fuse information from different meta-structures. Extensive experiments on three real-world datasets suggest the effectiveness of GEMS, which consistently outperforms all baseline methods in HIN recommendation. Compared with simplified GEMS which utilizes hand-crafted meta-paths, GEMS achieves over 6% performance gain on most evaluation metrics. More importantly, we conduct an in-depth analysis on the identified meta-structures, which sheds light on the HIN based recommender system design.
How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation. Previous works usually employ various neural network architectures (e.g., recurrent neural networks, attention mechanisms, and hierarchical structures) to model the history. However, a recent empirical study by Sankar et al. has shown that these architectures lack the ability of understanding and modeling the dynamics of the dialogue history. For example, the widely used architectures are insensitive to perturbations of the dialogue history, such as words shuffling, utterances missing, and utterances reordering. To tackle this problem, we propose a Ranking Enhanced Dialogue generation framework in this paper. Despite the traditional representation encoder and response generation modules, an additional ranking module is introduced to model the ranking relation between the former utterance and consecutive utterances. Specifically, the former utterance and consecutive utterances are treated as query and corresponding documents, and both local and global ranking losses are designed in the learning process. In this way, the dynamics in the dialogue history can be explicitly captured. To evaluate our proposed models, we conduct extensive experiments on three public datasets, i.e., bAbI, PersonaChat, and JDC. Experimental results show that our models produce better responses in terms of both quantitative measures and human judgments, as compared with the state-of-the-art dialogue generation models. Furthermore, we give some detailed experimental analysis to show where and how the improvements come from.
Today, large amounts of valuable data are distributed among millions of user-held devices, such as personal computers, phones, or Internet-of-things devices. Many companies collect such data with the goal of using it for training machine learning models allowing them to improve their services. User-held data is, however, often sensitive, and collecting it is problematic in terms of privacy. We address this issue by proposing a novel way of training a supervised classifier in a distributed setting akin to the recently proposed federated learning paradigm, but under the stricter privacy requirement that the server that trains the model is assumed to be untrusted and potentially malicious. We thus preserve user privacy by design, rather than by trust. In particular, our framework, called secret vector machine (SecVM), provides an algorithm for training linear support vector machines (SVM) in a setting in which data-holding clients communicate with an untrusted server by exchanging messages designed to not reveal any personally identifiable information. We evaluate our model in two ways. First, in an offline evaluation, we train SecVM to predict user gender from tweets, showing that we can preserve user privacy without sacrificing classification performance. Second, we implement SecVM's distributed framework for the Cliqz web browser and deploy it for predicting user gender in a large-scale online evaluation with thousands of clients, outperforming baselines by a large margin and thus showcasing that SecVM is suitable for production environments.
Recurrent Neural Networks (RNNs) are the state-of-the-art approach to sequential learning. However, standard RNNs use the same amount of computation to generate their hidden states at each timestep, regardless of the input data. Recent works have begun to tackle this rigid assumption by imposing a priori-determined patterns for updating the states at each step. These approaches could lend insights into the dynamics of RNNs and possibly speed up inference. However, the pre-determined nature of the current update strategies limits their application. To overcome this, we instead design the first fully-learned approach, SA-RNN, that augments any RNN by predicting discrete update patterns at the fine granularity of individual hidden state neurons. This is achieved through the parameterization of a distribution of update-likelihoods driven by the input data. Unlike related methods, our approach imposes no assumptions on the structure of the update patterns. Better yet, our method adapts its update patterns online, allowing different dimensions to be updated conditionally based on the input. To learn which dimensions to update, the model solves a multi-objective optimization problem, maximizing task performance while minimizing the number of updates based on a unified control. Using five publicly-available datasets spanning three sequential learning settings, we demonstrate that our method consistently achieves higher accuracy with fewer updates compared to state-of-the-art alternatives. We also show the benefits of learning to sparsely-update a large hidden state as opposed to densely-update a small hidden state. As an added benefit, our method can be directly applied to a wide variety of models containing RNN architectures.
We propose a flexible framework for clustering hypergraph-structured data based on recently proposed random walks utilizing edge-dependent vertex weights. When incorporating edge-dependent vertex weights (EDVW), a weight is associated with each vertex-hyperedge pair, yielding a weighted incidence matrix of the hypergraph. Such weightings have been utilized in term-document representations of text data sets. We explain how random walks with EDVW serve to construct different hypergraph Laplacian matrices, and then develop a suite of clustering methods that use these incidence matrices and Laplacians for hypergraph clustering. Using several data sets from real-life applications, we compare the performance of these clustering algorithms experimentally against a variety of existing hypergraph clustering methods. We show that the proposed methods produce high-quality clusters and conclude by highlighting avenues for future work.
Embedding entities and relations into continuous vector spaces has attracted a surge of interest in recent years. Most embedding methods assume that all test entities are available during training, which makes it time-consuming to retrain embeddings for newly emerging entities. To address this issue, recent works apply the graph neural network on the existing neighbors of the unseen entities. In this paper, we propose a novel framework, namely Virtual Neighbor (VN) network, to address three key challenges. Firstly, to reduce the neighbor sparsity problem, we introduce the concept of the virtual neighbors inferred by rules. And we assign soft labels to these neighbors by solving a rule-constrained problem, rather than simply regarding them as unquestionably true. Secondly, many existing methods only use one-hop or two-hop neighbors for aggregation and ignore the distant information that may be helpful. Instead, we identify both logic and symmetric path rules to capture complex patterns. Finally, instead of one-time injection of rules, we employ an iterative learning scheme between the embedding method and virtual neighbor prediction to capture the interactions within. Experimental results on two knowledge graph completion tasks demonstrate that our VN network significantly outperforms state-of-the-art baselines. Furthermore, results on Subject/Object-R show that our proposed VN network is highly robust to the neighbor sparsity problem.
The data of egocentric networks (ego networks) are very important for evaluating and validating the algorithms and machine learning approaches in Online Social Networks (OSNs). Nevertheless, obtaining the ego network data from OSNs is not a trivial task. Conventional manual approaches are time-consuming, and only a small number of users would agree to contribute their data. This is because there are two important factors that should be considered simultaneously for this data acquisition task: i) users' willingness to contribute their data, and ii) the structure of the ego network. However, addressing the above two factors to obtain the more complete ego network data has not received much research attention. Therefore, in this paper, we make our first attempt to address this issue by proposing a new research problem, named Willingness Maximization for Ego Network Extraction in Online Social Networks (WMEgo), to identify a set of ego networks from the OSN such that the willingness of the users to contribute their data is maximized. We prove that WMEgo is NP-hard and propose a 1/2*(1 1/e)-approximation algorithm, named Ego Network Identification with Maximum Willingness (EIMW). We conduct an evaluation study with 672 volunteers to validate the proposed WMEgo and EIMW, and perform extensive experiments on multiple real datasets to demonstrate the effectiveness and efficiency of our approach.
Recently, knowledge-grounded conversations in the open domain gain great attention from researchers. Existing works on retrieval-based dialogue systems have paid tremendous efforts to utilize neural networks to build a matching model, where all of the context and knowledge contents are used to match the response candidate with various representation methods.
Actually, different parts of the context and knowledge are differentially important for recognizing the proper response candidate, as many utterances are useless due to the topic shift. Those excessive useless information in the context and knowledge can influence the matching process and leads to inferior performance. To address this problem, we propose a multi-turn Response Selection Model that can Detect the relevant parts of the Context and Knowledge collection (RSM-DCK). Our model first uses the recent context as a query to pre-select relevant parts of the context and knowledge collection at the word-level and utterance-level semantics. Further, the response candidate interacts with the selected context and knowledge collection respectively. In the end, the fused representation of the context and response candidate is utilized to post-select the relevant parts of the knowledge collection for matching with more confidence. We test our proposed model on two benchmark datasets. Evaluation results indicate that our model achieves better performance than the existing methods, and can effectively detect the relevant context and knowledge for response selection.
Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. We validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches.
Many achievements have been made by studying how to visualize large-scale and high-dimensional data in typically 2D or 3D space. Normally, such a process is performed through a non-parametric (unsupervised) approach which is limited in handling the unseen data. In this work, we study the parametric (supervised) model which is capable to learn a mapping between high-dimensional data space Rd and low-dimensional latent space Rs with similarity structure in Rd preserved where s l d. The GNNVis is proposed, a framework that applies the idea of Graph Neural Networks (GNNs) to the parametric learning process and the learned mapping serves as a Visualizer (Vis) to compute the low-dimensional embeddings of unseen data online. In our framework, the features of data nodes, as well as the (hidden) information of their neighbors are fused to conduct Dimension Reduction. To the best of our knowledge, none of the existing visualization works have studied how to combine such information into the learning representation. Moreover, the learning process of GNNVis is designed as an end-to-end manner and can easily be extended to arbitrary Dimension Reduction methods if the corresponding objective function is given. Based on GNNVis, several typical dimension reduction methods t-SNE, LargeVis, and UMAP are investigated. As a parametric framework, GNNVis is an inherently efficient Visualizer capable of computing the embeddings of large-scale unseen data. To guarantee its scalability in the Training Stage, a novel training strategy with Subgraph Negative Sampling (SNS) is conducted to reduce the corresponding cost. Experimental results in real datasets demonstrate the advantages of GNNVis. The visualization quality of GNNVis outperforms the state-of-the-art parametric models, and is comparable to that of the non-parametric models.
With the rapid progress of global urbanization and function division among different geographical regions, it is of urgent need to develop methods that can find regions of desired future function distributions in applications. For example, a company tends to open a new branch in a region where the growth trend of industrial sectors fits its strategic goals, or is similar to that of an existing company location; while a job hunter tends to search regions where his/her expertise aligns with the industrial growth trend providing sufficient job opportunities to sustain future employment and job-hopping.
Our solution is to learn a distribution (aka. embedding) of the growth of various industrial sectors for each region, so that the embeddings of different regions can be searched, or compared for similarity querying. We consider the fine granularity of ZIP code areas as they are usually representative of the regional functions. By effectively utilizing open data on the Internet such as government data (e.g., from US Census Bureau) and third-party data for supervised learning, we propose to first construct a multigraph that captures the various relationships between regions such as direct flight connections and shared school districts, and then learn region embeddings using a novel graph convolutional network architecture. Our multigraph convnet (MGCN) differentiates various feature types such as demographic, social, economic and housing features, and learns different weights on different features and spatial relationships for effective data-driven feature aggregation.
While deep learning is known to require large amounts of data to train, our weighted MGCN (WMGCN) is designed to minimize the number of parameters so that it does not underfit on the limited amount of open data. Extensive experiments are conducted to compare our WMGCN model with several competitive baselines to demonstrate the superiority of our WMGCN design.
Recommendation systems aim to assist users to discover most preferred contents from an ever-growing corpus of items. Although recommenders have been greatly improved by deep learning, they still face several challenges: (1) Behaviors are much more com- plex than words in sentences, so traditional attentive and recurrent models have limitations capturing the temporal dynamics of user preferences. (2) The preferences of users are multiple and evolving, so it is difficult to integrate long-term memory and short-term intent.
In this paper, we propose a temporal gating methodology to improve attention mechanism and recurrent units, so that temporal information can be considered in both information filtering and state transition. Additionally, we propose a hybrid sequential recommender, named Multi-hop Time-aware Attentive Memory network (MTAM), to integrate long-term and short-term preferences. We use the proposed time-aware GRU network to learn the short-term intent and maintain prior records in user memory. We treat the short-term intent as a query and design a multi-hop memory reading operation via the proposed time-aware attention to generate user representation based on the current intent and long-term memory. Our approach is scalable for candidate retrieval tasks and can be viewed as a non-linear generalization of latent factorization for dot-product based Top-K recommendation. Finally, we conduct extensive experiments on six benchmark datasets and the experimental results demonstrate the effectiveness of our MTAM and temporal gating methodology.
Information networks, such as social and citation networks, are ubiquitous in the real world so that network analysis plays an important role in data mining and knowledge discovery. To alleviate the sparsity problem of network analysis, it is common to capture the network semantics by projecting nodes onto a vector space as network embeddings. Moreover, random walks are usually exploited to efficiently learn node embeddings and preserve network proximity. In addition to proximity structure, heterogeneous networks have more knowledge about the types of nodes. However, to profit from heterogeneous knowledge, most of the existing approaches guide the random walks through predefined meta-paths or specific strategies, which can distort the understanding of network structures. Furthermore, traditional random walk-based approaches much favor the nodes with higher degrees while other nodes are equivalently important for the downstream applications. In this paper, we propose Meta-context Aware Random Walks (MARU) to overcome these challenges, thereby learning richer and more unbiased representations for heterogeneous networks. To reduce the bias in classical random walks, the algorithm of bidirectional extended random walks is introduced to improve the fairness of representation learning. Based on the enhanced random walks, the meta-context aware skip-gram model is then presented to learn robust network embeddings with dynamic meta-contexts. Therefore, MARU can not only fairly understand the overall network structures but also leverage the sophisticated heterogeneous knowledge in the networks. Extensive experiments have been conducted on three real-world large-scale publicly available datasets. The experimental results demonstrate that MARU significantly outperforms state-of-the-art heterogeneous network embedding methods across three general machine learning tasks, including multi-label node classification, node clustering, and link prediction.
Social recommendation tasks exploit social connections to enhance recommendation performance. To fully utilize each user's first-order and high-order neighborhood preferences, recent approaches incorporate influence diffusion process for better user preference modeling. Despite the superior performance of these models, they either neglect the latent individual interests hidden in the user-item interactions or rely on computationally expensive graph attention models to uncover the item-induced sub-relations, which essentially determine the influence propagation passages. Considering the sparse substructures are derived from original social network, we name them as partial relationships between users. We argue such relationships can be directly modeled such that both personal interests and shared interests can propagate along a few channels (or dimensions) of latent users' embeddings. To this end, we propose a partial relationship aware influence diffusion structure via a computationally efficient multi-channel encoding scheme. Specifically, the encoding scheme first simplifies graph attention operation based on a channel-wise sparsity assumption, and then adds an InfluenceNorm function to maintain such sparsity. Moreover, ChannelNorm is designed to alleviate the oversmoothing problem in graph neural network models. Extensive experiments on two benchmark datasets show that our method is comparable to state-of-the-art graph attention-based social recommendation models while capturing user interests according to partial relationships more efficiently.
How do users on social platforms consume content shared by their friends? Is this consumption socially motivated, and can we predict it? Considerable prior work has focused on inferring and learning user preferences with respect to broadcasted, or open-network content in public spheres like webpages or public videos. However, user engagement with narrowcasted, closed-network content shared by their friends is considerably under-explored, despite being a commonplace activity. Here we bridge this gap by focusing on consumption of visual media content in closed-network settings, using data from Snapchat, a large multimedia-driven social sharing service with over 200M daily active users. Broadly, we answer questions around content consumption patterns, social factors that are associated with such consumption habits, and predictability of consumption time. We propose models for patterns in users' time-spending behaviors across friends, and observe that viewers preferentially and consistently spend more time on content from certain friends, even without considering any explicit notion of intrinsic content value. We also find that consumption time is highly correlated with several engagement-based social factors, suggesting a large social role in closed-network content consumption. Finally, we propose a novel approach of modeling future consumption time as a learning-to-rank task over users? friends. Our results demonstrate significant predictive value (0.815 P@1, 0.650 nDCG@10) using only social factors. We expect our work to motivate additional research in modeling consumption and ranking of online closed-network content.
Recent recommender systems have started to employ knowledge distillation, which is a model compression technique distilling knowledge from a cumbersome model (teacher) to a compact model (student), to reduce inference latency while maintaining performance. The state-of-the-art methods have only focused on making the student model to accurately imitate the predictions of the teacher model. They have a limitation in that the prediction results incompletely reveal the teacher's knowledge. In this paper, we propose a novel knowledge distillation framework for recommender system, called DE-RRD, which enables the student model to learn from the latent knowledge encoded in the teacher model as well as from the teacher's predictions. Concretely, DE-RRD consists of two methods: 1) Distillation Experts (DE) that directly transfers the latent knowledge from the teacher model. DE exploits "experts" and a novel expert selection strategy for effectively distilling the vast teacher's knowledge to the student with limited capacity. 2) Relaxed Ranking Distillation (RRD) that transfers the knowledge revealed from the teacher's prediction with consideration of the relaxed ranking orders among items. Our extensive experiments show that DE-RRD outperforms the state-of-the-art competitors and achieves comparable or even better performance to that of the teacher model with faster inference time.
In the last decade, there has been great progress in the field of machine learning and deep learning. These models have been instrumental in addressing a great number of problems. However, they have struggled when it comes to dealing with high dimensional data. In recent years, representation learning models have proven to be quite efficient in addressing this problem as they are capable of capturing effective lower-dimensional representations of the data. However, most of the existing models are quite ineffective when it comes to dealing with high dimensional spatiotemporal data as they encapsulate complex spatial and temporal relationships that exist among real-world objects. High-dimensional spatiotemporal data of cities represent urban communities. By learning their social structure we can better quantitatively depict them and understand factors influencing rapid growth, expansion, and changes.
In this paper, we propose a collective embedding framework that leverages the use of auto-encoders and Laplacian score to learn effective embeddings of spatiotemporal networks of urban communities. In addition, we also develop a weighted degree centrality measure for constructing spatiotemporal heterogeneous networks. To evaluate the performance of our proposed model, we implement it on real-world urban community data. Experimental results demonstrate the effectiveness of our model over state-of-the-art alternatives.
Click-through rate prediction is an important task in commercial recommender systems and it aims to predict the probability of a user clicking on an item. The event of a user clicking on an item is accompanied by several user and item features. As modelling the feature interactions effectively can lead to better predictions, it has been the focus of many recent approaches including deep learning-based models. However, the existing approaches either (i) model all possible feature interactions for a given order, or (ii) manually select which feature interactions to model. Besides, they use the same network structure or function to model all the feature interactions while ignoring the difference of complexity among them. To address these issues, we propose a neural architecture search based approach called AutoFeature that automatically finds essential feature interactions and selects an appropriate structure to model each of these interactions. Specifically, we first define a flexible architecture search space for the CTR prediction task which covers many popular designs such as PIN, PNN and DeepFM, etc., and enables higher-order interactions. Then we propose an efficient neural architecture search algorithm that recursively refines the search space by partitioning it into several subspaces and samples from higher quality ones. Extensive experiments on multiple CTR prediction benchmarks show the superiority of our AutoFeature over the state-of-the-art baselines. In addition, our experiments show that the learned architectures use fewer flops/parameters and hence can efficiently incorporate higher-order feature interactions. This further boosts the performance. Finally, we show that AutoFeature can find meaningful feature interactions.
Consider a network in which items propagate in a manner determined by their inherent characteristics or features. How should we select such inherent content features of a message emanating from a given set of nodes, so as to engender high influence spread over the network? This influential feature set selection problem has received scarce attention, contrary to its dual, influential node set selection counterpart, which calls to select the initial adopter nodes from which a fixed message emanates, so as to reach high influence. However, the influential feature set selection problem arises in many practical settings, where initial adopters are given, while propagation depends on the perception of certain malleable message features. We study this problem for a diffusion governed by a content-aware linear threshold (CALT) model, by which, once the aggregate weight of influence on a node exceeds a randomly chosen threshold, the item goes through. We show that the influence spread function is not submodular, hence a greedy algorithm with approximation guarantees is inadmissible. We propose a method that learns the parameters of the CALT model and adapt the SimPath diffusion estimation method to build a heuristic for the influential feature selection problem. Our experimental study demonstrates the efficacy and efficiency of our technique over synthetic and real data.
It is well-known that online behavior is long-tailed, with most cascaded actions being short and a few being very long. A prominent drawback in generative models for online events is the inability to describe unpopular items well. This work addresses these shortcomings by proposing dual mixture self-exciting processes to jointly learn from groups of cascades. We first start from the observation that maximum likelihood estimates for content virality and influence decay are separable in a Hawkes process. Next, our proposed model, which leverages a Borel mixture model and a kernel mixture model, jointly models the unfolding of a heterogeneous set of cascades. When applied to cascades of the same online items, the model directly characterizes their spread dynamics and supplies interpretable quantities, such as content virality and content influence decay, as well as methods for predicting the final content popularities. On two retweet cascade datasets --- one relating to YouTube videos and the second relating to controversial news articles --- we show that our models capture the differences between online items at the granularity of items, publishers and categories. In particular, we are able to distinguish between far-right, conspiracy, controversial and reputable online news articles based on how they diffuse through social media, achieving an F1 score of 0.945. On holdout datasets, we show that the dual mixture model provides, for reshare diffusion cascades especially unpopular ones, better generalization performance and, for online items, accurate item popularity predictions.
Tables in Wikipedia articles contain a wealth of knowledge that would be useful for many applications if it were structured in a more coherent, queryable form. An important problem is that many of such tables contain the same type of knowledge, but have different layouts and/or schemata. Moreover, some tables refer to entities that we can link to Knowledge Bases (KBs), while others do not. Finally, some tables express entity-attribute relations, while others contain more complex n-ary relations. We propose a novel knowledge extraction technique that tackles these problems. Our method first transforms and clusters similar tables into fewer unified ones to overcome the problem of table diversity. Then, the unified tables are linked to the KB so that knowledge about popular entities propagates to the unpopular ones. Finally, our method applies a technique that relies on functional dependencies to judiciously interpret the table and extract n-ary relations. Our experiments over 1.5M Wikipedia tables show that our clustering can group many semantically similar tables. This leads to the extraction of many novel n-ary relations.
In contrast to traditional online videos, live multi-streaming supports real-time social interactions between multiple streamers and viewers, such as donations. However, donation and multi-streaming channel recommendations are challenging due to complicated streamer and viewer relations, asymmetric communications, and the tradeoff between personal interests and group interactions. In this paper, we introduce Multi-Stream Party (MSP) and formulate a new multi-streaming recommendation problem, called Donation and MSP Recommendation (DAMRec). We propose Multi-stream Party Recommender System (MARS) to extract latent features via socio-temporal coupled donation-response tensor factorization for donation and MSP recommendations. Experimental results on Twitch and Douyu manifest that MARS significantly outperforms existing recommenders by at least 38.8% in terms of hit ratio and mean average precision.
Network embedding models aim to learn low-dimensional representations for nodes and/or edges in graphs. For social networks, learning edge representations is especially beneficial as we need to describe or explain the relationships, activities, and interactions between users. Existing approaches that learn stand-alone node embeddings, and represent edges as pairs of node embeddings, are limited in their applicability because nodes participate in multiple relationships, which should be considered. In addition, social networks often contain multiple types of edges, which yields multi-view contexts that need to be considered in the representation. In this paper, we propose a new methodology, MERL, that (1) captures asymmetry in multiple views by learning well-defined edge representations that are responsive to the difference between the source and destination node roles, and (2) incorporates textual communications to identify multiple source of social signals (e.g. strength and affinity) that moderate the impact of different views between users. Our experiments show that MERL outperforms alternative state-of-the-art embedding models on link prediction and multilabel classification tasks across multiple views in social network datasets. We further analyze the learned embeddings of MERL and demonstrate they are more correlated with the existence of view-based edges compared to previous methods.
Acknowledging the dynamic nature of knowledge graphs, the problem of learning temporal knowledge graph embeddings has recently gained attention. Essentially, the goal is to learn vector representation for the nodes and edges of a knowledge graph taking time into account. These representations must preserve certain properties of the original graph, so as to allow not only classification or clustering tasks, as for classical graph embeddings, but also approximate time-dependent query answering or link predictions over knowledge graphs. For instance, "who was the leader of Germany in 1994?'' or "when was Bonn the capital of Germany?''
Several existing work in the area adapt existing knowledge graph embedding models, adding a time dimension, usually restricting to one time granularity, like years or days, or treating time as fixed labels. However, this is not adequate for many facts of life, for instance historical and sensory data. In this work, we introduce and evaluate an approach that gracefully adjusts to time validity of virtually any granularity. Our model is robust to non-contiguous validity periods. It is generic enough to adapt to many existing non-temporal models and its size (number of parameters) does not depend on the size of the graph (number of entities and relations).
News recommendation systems? purpose is to tackle the immense amount of news and offer personalized recommendations to users. A major issue in news recommendation is to capture the precise news representations for the efficacy of recommended items. Commonly, news contents are filled with well-known entities of different types. However, existing recommendation systems overlook exploiting external knowledge about entities and topical relatedness among the news. To cope with the above problem, in this paper, we propose Topic-Enriched Knowledge Graph Recommendation System(TEKGR). Three encoders in TEKGR handle news titles in two perspectives to obtain news representation embedding: (1) to extract meaning of news words without considering latent knowledge features in the news and (2) to extract semantic knowledge of news through topic information and contextual information from a knowledge graph. After obtaining news representation vectors, an attention network compares clicked news to the candidate news in order to get the user's final embedding. Our TEKGR model is superior to existing news recommendation methods by manipulating topical relations among entities and contextual features of entities. Experimental results on two public datasets show that our approach outperforms state-of-the-art deep recommendation approaches.
This paper presents an efficient method of extracting n-ary relations from multiple sentences which is called Entity-path and Discourse relation-centric Relation Extractor (EDCRE). Unlike previous approaches, the proposed method focuses on an entity link, which consists of dependency edges between entities, and discourse relations between sentences. Specifically, the proposed model consists of two main sub-models. The first one encodes sentences with a higher weight on the entity link while considering the other edges with an attention mechanism. To consider various latent discourse relations between sentences, the second sub-model encodes discourse relations between adjacent sentences considering the contents of each sentence. Experiment results on the cross-sentence relation extraction dataset, PubMed, and the document-level relation extraction dataset, DocRED, show that the proposed model outperforms state-of-the-art methods of extracting relations across sentences. Furthermore, ablation study proves that both the two main sub-models have noticeable effect on the relation extraction task.
Accurate demand forecasting of different public transport modes (e.g., buses and light rails) is essential for public service operation. However, the development level of various modes often varies significantly, which makes it hard to predict the demand of the modes with insufficient knowledge and sparse station distribution (i.e., station-sparse mode). Intuitively, different public transit modes may exhibit shared demand patterns temporally and spatially in a city. As such, we propose to enhance the demand prediction of station-sparse modes with the data from station-intensive mode and design a Memory-Augmented Multi-task Re current Network (MATURE) to derive the transferable demand patterns from each mode and boost the prediction of station-sparse modes through adapting the relevant patterns from the station-intensive mode. Specifically, MATURE comprises three components: 1) a memory-augmented recurrent network for strengthening the ability to capture the long-short term information and storing temporal knowledge of each transit mode; 2) a knowledge adaption module to adapt the relevant knowledge from a station-intensive source to station-sparse sources; 3) a multi-task learning framework to incorporate all the information and forecast the demand of multiple modes jointly. The experimental results on a real-world dataset covering four public transport modes demonstrate that our model can promote the demand forecasting performance for the station-sparse modes.
In real-world scenarios, the data are widespread that are annotated with a set of candidate labels but a single ground-truth label per-instance. The learning paradigm with such data, formally referred to as Partial Label (PL) learning, has recently drawn much attention. The traditional PL methods estimate the confidences being the ground-truth label of candidate labels with various regularizations and constraints, however, they only consider the local information, resulting in potentially less accurate estimations as well as worse classification performance. To alleviate this problem, we propose a novel PL method, namely PArtial label learNing by simultaneously leveraging GlObal and Local consIsteNcies (Pangolin). Specifically, we design a global consistency regularization term to pull instances associated with similar labeling confidences together by minimizing the distances between instances and label prototypes, and a local consistency term to push instances marked with no same candidate labels away by maximizing their distances. We further propose a nonlinear kernel extension of Pangolin, and employ the Taylor approximation trick for efficient optimization. Empirical results demonstrate that Pangolin significantly outperforms the existing PL baseline methods.
Personalized review generation (PRG) aims to automatically produce review text reflecting user preference, which is a challenging natural language generation task. Most of previous studies do not explicitly model factual description of products, tending to generate uninformative content. Moreover, they mainly focus on word-level generation, but cannot accurately reflect more abstractive user preference in multiple aspects. To address the above issues, we propose a novel knowledgeenhanced PRG model based on capsule graph neural network (CapsGNN). We first construct a heterogeneous knowledge graph (HKG) for utilizing rich item attributes. We adopt Caps-GNN to learn graph capsules for encoding underlying characteristics from the HKG. Our generation process contains two major steps, namely aspect sequence generation and sentence generation. First, based on graph capsules, we adaptively learn aspect capsules for inferring the aspect sequence. Then, conditioned on the inferred aspect label, we design a graph-based copy mechanism to generate sentences by incorporating related entities or words from HKG. To our knowledge, we are the first to utilize knowledge graph for the PRG task. The incorporated KG information is able to enhance user preference at both aspect and word levels. Extensive experiments on three real-world datasets have demonstrated the effectiveness of our model on the PRG task.
The huge amount of graph data are published and shared for research and business purposes, which brings great benefit for our society. However, user privacy is badly undermined even though user identity can be anonymized. Graph de-anonymization to identify nodes from an anonymized graph is widely adopted to evaluate users' privacy risks. Most existing de-anonymization methods which are heavily reliant on side information (e.g., seeds, user profiles, community labels) are unrealistic due to the difficulty of collecting this side information. A few graph de-anonymization methods only using structural information, called seed-free methods, have been proposed recently, which mainly take advantage of the local and manual features of nodes while overlooking the global structural information of the graph for de-anonymization.
In this paper, a seed-free graph de-anonymization method is proposed, where a deep neural network is adopted to learn features and an adversarial framework is employed for node matching. To be specific, the latent representation of each node is obtained by graph autoencoder. Furthermore, an adversarial learning model is proposed to transform the embedding of the anonymized graph to the latent space of auxiliary graph embedding such that a linear mapping can be derived from a global perspective. Finally, the most similar node pairs in the latent space as the anchor nodes are utilized to launch propagation to de-anonymize all the remaining nodes. The extensive experiments on some real datasets demonstrate that our method is comparative with the seed-based approaches and outperforms the start-of-the-art seed-free method significantly.
Personalized recommender systems are important to assist user decision-making in the era of information overload. Meanwhile, explanations of the recommendations further help users to better understand the recommended items so as to make informed choices, which gives rise to the importance of explainable recommendation research. Textual sentence-based explanation has been an important form of explanations for recommender systems due to its advantage in communicating rich information to users. However, current approaches to generating sentence explanations are either limited to predefined sentence templates, which restricts the sentence expressiveness, or opt for free-style sentence generation, which makes it difficult for sentence quality control. In an attempt to benefit both sentence expressiveness and quality, we propose a Neural Template (NETE) explanation generation framework, which brings the best of both worlds by learning sentence templates from data and generating template-controlled sentences that comment about specific features. Experimental results on real-world datasets show that NETE consistently outperforms state-of-the-art explanation generation approaches in terms of sentence quality and expressiveness. Further analysis on case study also shows the advantages of NETE on generating diverse and controllable explanations.
Online health communities (OHCs) provide a popular channel for users to seek information, suggestions and support during their medical treatment and recovery processes. To help users find relevant information easily, we present CLIR, an effective system for recommending relevant discussion threads to users in OHCs. We identify that thread content and user interests can be categorized in two dimensions: topics and concepts. CLIR leverages Latent Dirichlet Allocation model to summarize the topic dimension and uses Convolutional Neural Network to encode the concept dimension. It then builds a thread neural network to capture thread characteristics and builds a user neural network to capture user interests by integrating these two dimensions and their interactions. Finally, it matches the target thread's characteristics with candidate users' interests to make recommendations. Experimental evaluation with multiple OHC datasets demonstrates the performance advantage of CLIR over the state-of-the-art recommender systems on various evaluation metrics.
In this paper, we study a problem of trapping malicious web crawlers in social networks to minimize the attacks from crawlers with malicious intents to steal personal/private information. The problem is to find where to place a given set of traps over a graph so as to minimize the expected number of users who possibly fall prey to a (possibly random) set of malicious crawlers, each of which traverses the graph in a random-walk fashion for a random finite time. We first show that this problem is NP-hard and also a monotone submodular maximization problem. We then present a greedy algorithm that achieves a ($1-1/e$)-approximation. We also develop an $(ε,δ)$-approximation Monte Carlo estimator to ease the computation of the greedy algorithm and thus make the algorithm scalable for large graphs. We finally present extensive simulation results to show that our algorithm significantly outperforms other baseline algorithms based on various centrality measures.
For better user satisfaction and business effectiveness, Click-Through Rate (CTR) prediction is one of the most important tasks in E-commerce. It is often the case that users' interests different from their past routines may emerge or impressions such as promotional items may burst in a very short period. In essence, such changes relate to item evolution problem, which has not been investigated by previous studies. The state-of-the-art methods in the sequential recommendation, which use simple user behaviors, are incapable of modeling these changes sufficiently. It is because, in the user behaviors, outdated interests may exist and the popularity of an item over time is not well represented. To address these limitations, we introduce time-aware item behaviors for addressing the recommendation of emerging preference. The time-aware item behavior for an item is a set of users who interact with this item with timestamps. The rich interaction information of users for an item may help to model its evolution. In this work, we propose a CTR prediction model TIEN based on the time-aware item behavior. In TIEN, by leveraging the interaction time intervals, information of similar users in a short time interval helps identify the emerging user interest of the target user. By using the sequential time intervals, the item's popularity over time can be captured in evolutionary item dynamics. Noisy users who interact with items accidentally are further eliminated thus learning robust personalized item dynamics. To the best of our knowledge, this is the first study to the item evolution problem for E-commerce CTR prediction. We conduct extensive experiments on five real-world CTR prediction datasets. The results show that the TIEN model consistently achieves remarkable improvements to the state-of-the-art methods.
Neural ranking models have recently gained much attention in Information Retrieval community and obtain good ranking performance. However, most of these retrieval models focus on capturing the textual matching signals between query and document but do not consider user behavior information that may be helpful for the retrieval task. Specifically, users' click and query reformulation behavior can be represented by a click-through bipartite graph and a session-flow graph, respectively. Such graph representations contain rich user behavior information and may help us better understand users' search intent beyond the textual information. In this study, we aim to incorporate this rich information encoded in these two graphs into existing neural ranking models.
We present two graph-based neural ranking models (\emphEmbRanker and AggRanker ) to enrich learned text representations with graph information that captures rich users' interaction behavior information. Experimental results on a large-scale publicly available benchmark dataset show that the two models outperform most existing neural ranking models that only consider textual information, which illustrates the effectiveness of integrating graph information with textual information. Further analyses show how graph information complements text matching signals and examine whether these two models can be adopted in practical applications.
Express systems are widely deployed in many major cities. One type of important tasks in the system is to pick up packages from customers in time. As pick-up requests come in real time and there are many couriers picking up packages, how to dispatch couriers to ensure the cooperation among them and to complete more pick-up tasks in a long time, is very important but challenging. In this paper, we propose a reinforcement learning based framework to learn courier dispatching policies. At first, we divide the city into independent regions, inner each of which a constant number of couriers pick up packages at the same time. Besides reducing problem complexity, city division has practical operation benefits. Afterwards, we focus on each region separately. For each region, we propose a Cooperative Multi-Agent Reinforcement Learning model, i.e. CMARL, to learn the optimal courier dispatching policy in it. CMARL tries to maximize the total number of completed pick-up tasks by all couriers in a long time. Our model achieves this target by combining two Markov Decision Processes, one to guarantee the cooperation among couriers, and the other one to ensure the long-term optimization. After obtaining the value functions of these two MDPs, a new value function is designed to trade off them, based on which we can infer the courier dispatching policy. Experiments based on real-world road network data and historical express data from Beijing are conducted, to confirm the superiority of our model compared with nine baselines.
Distant supervision provides a means to create a large number of weakly labeled data at low cost for relation classification. However, the resulting labeled instances are very noisy, containing data with wrong labels. Many approaches have been proposed to select a subset of reliable instances for neural model training, but they still suffer from noisy labeling problem or underutilization of the weakly-labeled data. To better select more reliable training instances, we introduce a small amount of manually labeled data as reference to guide the selection process. In this paper, we propose a meta-learning based approach, which learns to reweight noisy training data under the guidance of reference data. As the clean reference data is usually very small, we propose to augment it by dynamically distilling the most reliable elite instances from the noisy data. Experiments on several datasets demonstrate that the reference data can effectively guide the selection of training data, and our augmented approach consistently improves the performance of relation classification comparing to the existing state-of-the-art methods.
In most previous studies, the aspect-related text is considered an important clue for the Aspect-based Sentiment Analysis (ABSA) task, and thus various attention mechanisms have been proposed to leverage the interactions between aspects and context. However, it is observed that some sentiment expressions carry the same polarity regardless of the aspects they are associated with. In such cases, it is not necessary to incorporate aspect information for ABSA. More observations on the experimental results show that blindly leveraging interactions between aspects and context as features may introduce noises when analyzing those aspect-invariant sentiment expressions, especially when the aspect-related annotated data is insufficient. Hence, in this paper, we propose an Adversarial Multi-task Learning framework to identify the aspect-invariant/dependent sentiment expressions without extra annotations. In addition, we adopt a gating mechanism to control the contribution of representations derived from aspect-invariant and aspect-dependent hidden states when generating the final contextual sentiment representations for the given aspect. This essentially allows the exploitation of aspect-invariant sentiment features for better ABSA results. Experimental results on two benchmark datasets show that extending existing neural models using our proposed framework achieves superior performance. In addition, the aspect-invariant data extracted by the proposed framework can be considered as pivot features for better transfer learning of the ABSA models on unseen aspects.
Attributed network embedding (ANE) attempts to represent a network in short code, while retaining information about node topological structures and node attributes. A node's feature and topological structure information could be divided into different local aspects, while in many cases, not all the information but part of the information contained in several local aspects determine the relations among different nodes. Most of the existing works barely concern and identify the aspect influence from network embedding to our knowledge. We attempt to use local embeddings to represent local aspect information and propose InfomaxANE which encodes both global and local embeddings from the perspective of mutual information. The local aspect embeddings are forced to learn and extract different aspect information from nodes' features and topological structures by using orthogonal constraint. A theoretical analysis is also provided to further confirm its correctness and rationality. Besides, to provide complete and refined information for local encoders, we also optimize feature aggregation in SAGE with different structures: feature similarities are concerned and aggregator is seperated from encoder. InfomaxANE is evaluated on both node clustering and node classification tasks (including both transductive and inductive settings) with several benchmark datasets, the results show the outperformance of InfomaxANE over competitive baselines. We also verify the significance of each module in our proposed InfomaxANE in the additional experiment.
Next Point-of-Interest (POI) recommendation is a longstanding problem across the domains of Location-Based Social Networks (LBSN) and transportation. Recent Recurrent Neural Network (RNN) based approaches learn POI-POI relationships in a local view based on independent user visit sequences. This limits the model's ability to directly connect and learn across users in a global view to recommend semantically trained POIs. In this work, we propose a Spatial-Temporal-Preference User Dimensional Graph Attention Network (STP-UDGAT), a novel explore-exploit model that concurrently exploits personalized user preferences and explores new POIs in global spatial-temporal-preference (STP) neighbourhoods, while allowing users to selectively learn from other users. In addition, we propose random walks as a masked self-attention option to leverage the STP graphs' structures and find new higher-order POI neighbours during exploration. Experimental results on six real-world datasets show that our model significantly outperforms baseline and state-of-the-art methods.
Recommendation Systems (RS) have become an essential part of many online services. Due to its pivotal role in guiding customers towards purchasing, there is a natural motivation for unscrupulous parties to spoof RS for profits. In this paper, we study the shilling attack: a subsistent and profitable attack where an adversarial party injects a number of user profiles to promote or demote a target item. Conventional shilling attack models are based on simple heuristics that can be easily detected, or directly adopt adversarial attack methods without a special design for RS. Moreover, the study on the attack impact on deep learning based RS is missing in the literature, making the effects of shilling attack against real RS doubtful. We present a novel Augmented Shilling Attack framework (AUSH) and implement it with the idea of Generative Adversarial Network. AUSH is capable of tailoring attacks against RS according to budget and complex attack goals, such as targeting a specific user group. We experimentally show that the attack impact of AUSH is noticeable on a wide range of RS including both classic and modern deep learning based RS, while it is virtually undetectable by the state-of-the-art attack detection model.
Prediction tasks about students such as predicting students' academic performances have practical real-world significance at both the student level and the college level. With the rapid construction of smart campuses, colleges not only offer residence and academic programs but also record students' daily life. The digital footprints provide an opportunity to offer better solutions for prediction tasks. In this paper, we aim to propose a general deep neural network which can jointly model student heterogeneous daily behaviors generated from digital footprints and social influence to deal with prediction tasks. To this end, we design a variant of LSTM and a novel attention mechanism to model the daily behavior sequence. The proposed LSTM is able to consider context information (e.g., weather conditions) while modeling the daily behavior sequence. The proposed attention mechanism can dynamically learn the different importance degrees of different days for every student. Based on behavior information, we propose an unsupervised way to construct a social network to model social influence. Moreover, we design a residual network based decoder to model the complex interactions between the features and get the predicted values such as future academic performances. Qualitative and quantitative experiments on two real-world datasets collected from a college have demonstrated the effectiveness of our model.
Topic detection in social media is a challenging task due to large-scale short, noisy and informal nature of messages. Most existing methods only consider textual content or simultaneously model the posts and the first-order structural characteristics of social networks. They ignore the impact of larger neighborhoods in microblog conversations on topics. Moreover, the simple combination of separated content and structure representations fails to capture their nonlinear correlation and different importance in topic inference. To this end, we propose a novel random walk based Parallel Social Contexts Fusion Topic Model (PCFTM) for weibo conversations. Firstly, a user-level conversation network with content information is built by the reposting and commenting relationships among users. Through random walks of different lengths on network, we obtain the user sequences containing the parallel content and structure contexts, which are used to acquire the flexible-order proximity of users. Then we propose a self-fusion network embedding to capture the nonlinear correlation between parallel social contexts. It is achieved by taking the content embedding sequence processed by CNN as the initial value of structure embedding sequence fed to Bi-LSTM. Meanwhile, a user-level self-attention is further used to mine the different importance of users to topics. Lastly, the user sequence embedding is incorporated into neural variational inference for detecting topics, which adaptively balances the intrinsic complementarity between content and structure, and fully uses both local and global social contexts in topic inference. Extensive experiments on three real-world weibo datasets demonstrate the effectiveness of our proposed model.
Data sparsity is a challenge problem that most modern recommender systems are confronted with. By leveraging the knowledge from relevant domains, the cross-domain recommendation technique can be an effective way of alleviating the data sparsity problem. In this paper, we propose a novel Bi-directional Transfer learning method for cross-domain recommendation by using Graph Collaborative Filtering network as the base model (BiTGCF). BiTGCF not only exploits the high-order connectivity in user-item graph of single domain through a novel feature propagation layer, but also realizes the two-way transfer of knowledge across two domains by using the common user as the bridge. Moreover, distinct from previous cross-domain collaborative filtering methods, BiTGCF fuses users' common features and domain-specific features during transfer. Experimental results on four couple benchmark datasets verify the effectiveness of BiTGCF over state-of-the-art models in terms of bi-directional cross domain recommendation.
Recommender systems play a fundamental role in web applications in filtering massive information and matching user interests. While many efforts have been devoted to developing more effective models in various scenarios, the exploration on the explainability of recommender systems is running behind. Explanations could help improve user experience and discover system defects. In this paper, after formally introducing the elements that are related to model explainability, we propose a novel explainable recommendation model through improving the transparency of the representation learning process. Specifically, to overcome the representation entangling problem in traditional models, we revise traditional graph convolution to discriminate information from different layers. Also, each representation vector is factorized into several segments, where each segment relates to one semantic aspect in data. Different from previous work, in our model, factor discovery and representation learning are simultaneously conducted, and we are able to handle extra attribute information and knowledge. In this way, the proposed model can learn interpretable and meaningful representations for users and items. Unlike traditional methods that need to make a trade-off between explainability and effectiveness, the performance of our proposed explainable model is not negatively affected after considering explainability. Finally, comprehensive experiments are conducted to validate the performance of our model as well as explanation faithfulness.
Multiple domain fusion has been widely used for urban anomalies forecasting problem, as urban anomalies such as traffic accidents or illegal assembly are usually caused by many complex factors and they would affect many fields. Although many efforts have been devoted to fusing multiple datasets for anomalies detection, most of the work is to extract the spatio-temporal features one by one from multiple datasets and then fuse to get the result or anomaly score. However, the correlation between data from multiple domains at each moment is ignored, which is especially important when detecting anomalies by analyzing the impacts from multiple datasets. In this paper, we propose a novel end-to-end deep learning based framework, namely deep spatio-temporal multiple domain fusion network to collect the impacts of urban anomalies on multiple datasets and detect anomalies in each region of the city at next time interval in turn. We formulate the problem on a weighted graph and obtain spatiotemporal features with adaptive graph convolution and temporal convolution. In addition, a cross-domain convolution network is applied to fully obtain connection between multiple domains. We evaluate our method with real-world dataset collected in New York City and experiments on our model show the advantages nearly 10% beyond the state-of-the-art urban anomalies detection methods.
To provide more accurate personalized product search (PPS) results, it is compulsory to go beyond modeling user-query-item interaction. Graph embedding techniques open the potential to integrate node information and topological structure information. Existing graph embedding enhanced PPS methods are mostly based on entity-relation-entity graph learning. In this work, we propose to consider structural relationship in users' product search scenario with graph embedding by latent representation learning. We argue that explicitly modeling the structural relationship in graph embedding is essential for more accurate PPS results. We propose a novel method, Graph embedding based Structural Relationship Representation Learning (GraphSRRL), which explicitly models the structural relationship in users-queries-products interaction. It combines three key conjunctive graph patterns to learn graph embedding for better PPS. In addition, GraphSRRL facilitates the learning of affinities between users (resp. queries or products) in the designed geometric operation in low-dimensional latent space. We conduct extensive experiments on four datasets to evaluate GraphSRRL for PPS. Experimental results show that GraphSRRL outperforms the state-of-the-art algorithm on real-world search datasets by at least 50.7% in term of Hit@10 and 48.7% in terms of NDCG@10.
Re-ranking is a critical task for large-scale commercial recommender systems. Given the initial ranked lists, top candidates are re-ranked to improve the accuracy of the ranking results. However, existing re-ranking strategies are sub-optimal due to (i) most prior works do not consider explicit item relationships, like being substitutable or complementary, which may mutually influence the user satisfaction on other items in the lists, and (ii) they usually apply an identical re-ranking strategy for all users, with personalized user preferences and intents ignored. To resolve the problem, we construct a heterogeneous graph to fuse the initial scoring information and item relationships information. We develop a graph neural network based framework, IRGPR, to explicitly model transitive item relationships by recursively aggregating relational information from multi-hop neighborhoods. We also incorporate a novel intent embedding network to embed personalized user intents into the propagation. We conduct extensive experiments on real-world datasets, demonstrating the effectiveness of IRGPR in re-ranking. Further analysis reveals that modeling the item relationships and personalized intents are particularly useful for improving the performance of re-ranking.
Commercial search engines generally maintain hundreds of thousands of machines equipped with large sized DRAM in order to process huge volume of user queries with fast responsiveness, which incurs high hardware cost since DRAM is very expensive. Recently, NVM Optane SSD has been considered as a promising underlying storage device due to its price advantage over DRAM and speed advantage over traditional slow block devices. However, to achieve a comparable efficiency performance with in-memory index, applying NVM to both latency and I/O bandwidth critical applications such as search engine still faces non-trivial challenges, because NVM has much lower I/O speed and bandwidth compared to DRAM.
In this paper, we propose an NVM SSD-optimized query processing framework, aiming to address both the latency and bandwidth issues of using NVM in search engines. Our framework consists of three distinguished properties. First, we propose a pipelined query processing methodology which significantly reduces the I/O waiting time by fine-grained overlapping of the computation and I/O operations. Second, we propose a cache-aware query reordering algorithm which enables queries sharing more data to be processed adjacently so that the I/O traffic is minimized. Third, we propose a data prefetching mechanism which reduces the extra thread waiting time due to data sharing and improves bandwidth utilization. Extensive experimental studies show that our framework significantly outperforms the state-of-the-art baselines, which obtains comparable processing latency and throughput with DRAM while using much less space.
Probabilistic graphical models, such as Markov random fields (MRF), exploit dependencies among random variables to model a rich family of joint probability distributions. Inference algorithms, such as belief propagation (BP), can effectively compute the marginal posteriors for decision making. Nonetheless, inferences involve sophisticated probability calculations and are difficult for humans to interpret. Among all existing explanation methods for MRFs, no method is designed for fair attributions of an inference outcome to elements on the MRF where the inference takes place. Shapley values provide rigorous attributions but so far have not been studied on MRFs. We thus define Shapley values for MRFs to capture both probabilistic and topological contributions of the variables on MRFs. We theoretically characterize the new definition regarding independence, equal contribution, additivity, and submodularity. As brute-force computation of the Shapley values is challenging, we propose GraphShapley, an approximation algorithm that exploits the decomposability of Shapley values, the structure of MRFs, and the iterative nature of BP inference to speed up the computation. In practice, we propose meta-explanations to explain the Shapley values and make them more accessible and trustworthy to human users. On four synthetic and nine real-world MRFs, we demonstrate that GraphShapley generates sensible and practical explanations.
Spam activities on multifarious online platforms, such as the opinion spam and fake following relationships have been extensively studied for years. Existing works separately employ hand-crafted features --- mainly extracted from user behavior, text information, and relational network, to detect the specific spamming phenomenon on a certain kind of online platform. Although these attempts have made some headway, rapidly emerging spamming categories and frequently changing cheating strategies lead detection models to be subject to circumscribed usability and fragile effectiveness.
This paper is the first attempt to develop a general and feature-free fraud detection model, tackling the longstanding and thorny challenges in spam detection area. To achieve this, we first transform diverse relational networks that are contaminated by fraudsters into the unified matrix form. We then deal with the spam problem from a fresh perspective inspired by the pairwise learning thought in the area of recommender system. By comparing pairwise ranking relations of all the entities in the unified matrix, a new pairwise loss objective function is formulated to identify instances that occupy higher rankings as inferior (spamming) results. To further boost detection performance, we incorporate the pairwise ranking detection method and the widely used structure-based algorithm into an integrated framework. Experiments on real-word datasets of different Web applications show significant improvements of our proposed framework over competitive methods.
Network embedding is an active research area due to the prevalence of network-structured data. While the state of the art often learns high-quality embedding vectors for high-degree nodes with abundant structural connectivity, the quality of the embedding vectors for low-degree or tail nodes is often suboptimal due to their limited structural connectivity. While many real-world networks are long-tailed, to date little effort has been devoted to tail node embedding. In this paper, we formulate the goal of learning tail node embeddings as a few-shot regression problem, given the few links on each tail node. In particular, since each node resides in its own local context, we personalize the regression model for each tail node. To reduce overfitting in the personalization, we propose a locality-aware meta-learning framework, called meta-tail2vec, which learns to learn the regression model for the tail nodes at different localities. Finally, we conduct extensive experiments and demonstrate the promising results of meta-tail2vec. (Supplemental materials including code and data are available at https://github.com/smufang/meta-tail2vec.)
Link prediction, which centers on whether or not a pair of nodes is likely to be connected, is a fundamental problem in complex network analysis. Network-embedding-based link prediction has shown strong performance and robustness in previous studies on complex networks, recommendation systems, and knowledge graphs. This approach has certain drawbacks, however; namely, the hierarchical structure of a subgraph is ignored and the importance of different nodes is not distinguished. In this study, we established the Subgraph Hierarchy Feature Fusion (SHFF) model for link prediction. To probe the existence of links between node pairs, the SHFF first extracts a subgraph around the two nodes and learns a function to map the subgraph to a vector for subsequent classification. This reveals any link between the two target nodes. The SHFF learns a function to obtain a representation of the extracted subgraph by hierarchically aggregating the features of nodes in that subgraph, which is accomplished by grouping nodes with similar structures and assigning different importance to the nodes during the feature fusion process. We compared the proposed model against other state-of-the-art link-prediction methods on a wide range of data sets to find that it consistently outperforms them.
In recent years, heterogeneous network representation learning has attracted considerable attentions with the consideration of multiple node types. However, most of them ignore the rich set of network attributes (attributed network) and different types of relations (multiplex network), which can hardly recognize the multi-modal contextual signals across different interactions. While a handful of network embedding techniques are developed for attributed multiplex heterogeneous networks, they are significantly limited to the scalability issue on large-scale network data, due to their heavy cost both in computation and memory. In this work, we propose a Fast Attributed Multiplex heterogeneous network Embedding framework (FAME) for large-scale network data, by mapping the units from different modalities (i.e., network topological structures, various node features and relations) into the same latent space in a very efficient way. Our FAME is an integrative architecture with the scalable spectral transformation and sparse random projection, to automatically preserve both attribute semantics and multi-type interactions in the learned embeddings. Extensive experiments on four real-world datasets with various network analytical tasks, demonstrate that FAME achieves both effectiveness and significant efficiency over state-of-the-art baselines. The source code is available at: https://github.com/ZhijunLiu95/FAME.
Network embedding, which aims at learning low-dimensional representations of nodes in a network, has drawn much attention for various network mining tasks, ranging from link prediction to node classification. In addition to network topological information, there also exist rich attributes associated with network structure, which exerts large effects on the network formation. Hence, many efforts have been devoted to tackling attributed network embedding tasks. However, they are also limited in their assumption of static network data as they do not account for evolving network structure as well as changes in the associated attributes. Furthermore, scalability is a key factor when performing representation learning on large-scale networks with huge number of nodes and edges. In this work, we address these challenges by developing the DRLAN-Dynamic Representation Learning framework for large-scale Attributed Networks. The DRLAN model generalizes the dynamic attributed network embedding from two perspectives: First, we develop an integrative learning framework with an offline batch embedding module to preserve both the node and attribute proximities, and online network embedding model that recursively updates learned representation vectors. Second, we design a recursive pre-projection mechanism to efficiently model the attribute correlations based on the associative property of matrices. Finally, we perform extensive experiments on three real-world network datasets to show the superiority of DRLAN against state-of-the-art network embedding techniques in terms of both effectiveness and efficiency. The source code is available at: https://github.com/ZhijunLiu95/DRLAN.
Multiple-choice Machine Comprehension (MC) is an important and challenging nature language processing (NLP) task where the machine is required to make the best answer from candidate answer set given particular passage and question. Existing approaches either only utilize the powerful pre-trained language models or only rely on an over complicated matching network that is design supposed to capture the relationship effectively among the triplet of passage, question and candidate answers. In this paper, we present a novel architecture, Dual Head-wise Coattention network (called DHC), which is a simple and efficient attention neural network designed to perform multiple-choice MC task. Our proposed DHC not only support a powerful pre-trained language model as encoder, but also models the MC relationship as attention mechanism straightforwardly, by head-wise matching and aggregating method on multiple layers, which better model relationships sufficiently between question and passage, and cooperate with large pre-trained language models more efficiently. To evaluate the performance, we test our proposed model on five challenging and well-known datasets for multiple-choice MC: RACE, DREAM, SemEval-2018 Task 11, OpenBookQA, and TOEFL. Extensive experimental results demonstrate that our proposal can achieve a significant increase in accuracy comparing existing models based on all five datasets, and it consistently outperforms all tested baselines including the state-of-the-arts techniques. More remarkably, our proposal is a pluggable and more flexible model, and it thus can be plugged into any pre-trained Language Models based on BERT. Ablation studies demonstrate its state-of-the-art performance and generalization.
Urban traffic flow forecasting is a critical issue in intelligent transportation systems. It is quite challenging due to the complicated spatiotemporal dependency and essential uncertainty brought about by the dynamic urban traffic conditions. In most of existing methods, the spatial correlation is captured by utilizing graph neural networks (GNNs) throughout a fixed graph based on local spatial proximity. However, urban road conditions are complex and changeable, which leads to the interactions between roads should also be dynamic over time. In addition, the global contextual information of roads are also crucial for accurate forecasting. In this paper, we exploit spatiotemporal correlation of urban traffic flow and construct a dynamic weighted graph by seeking both spatial neighbors and semantic neighbors of road nodes. Multi-head self-attention temporal convolution network is utilized to capture local and long-range temporal dependencies across historical observations. Besides, we propose an adaptive graph gating mechanism to extract selective spatial dependencies within multi-layer stacking and correct information deviations caused by artificially defined spatial correlation. Extensive experiments on real world urban traffic dataset from Didi Chuxing GAIA Initiative have verified the effectiveness, and the multi-step forecasting performance of our proposed models outperforms the state-of-the-art baselines. The source code of our model is publicly available at https://github.com/RobinLu1209/STAG-GCN.
Nonnegative matrix factorization (NMF) is a popular approach to model data, however, most models are unable to flexibly take into account multiple matrices across sources and time or apply only to integer-valued data. We introduce a probabilistic, Gaussian Process-based, more inclusive NMF-based model which jointly analyzes nonnegative data such as text data word content from multiple sources in a temporal dynamic manner. The model collectively models observed matrix data, source-wise latent variables, and their dependencies and temporal evolution with a full-fledged hierarchical approach including flexible nonparametric temporal dynamics. Experiments on simulated data and real data show the model out-performs, comparable models. A case study on social media and news demonstrates the model discovers semantically meaningful topical factors and their evolution
Learning of classification models from real-world data often requires substantial human effort devoted to instance annotation. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To address this problem we explore a new type of human feedback - region-based feedback. Briefly, a region is defined as a hypercubic subspace of the input data space and represents a subpopulation of data instances; the region's label is a human assessment of the class proportion of the data subpopulation. By using learning from label proportions algorithms one can learn instance-based classifiers from such labeled regions. In general, the key challenge is that there can be infinite many regions one can define and query in a given data space. To minimize the number and complexity of region-based queries, we propose and develop a hierarchical active learning solution that aims at incrementally building a concise hierarchy of regions. Furthermore, to avoid building a possibly class-irrelevant region hierarchy, we further propose to grow multiple different hierarchies in parallel and expand those more informative hierarchies. Through experiments on numerous data sets, we demonstrate that methods using region-based feedback can learn very good classifiers from very few and simple queries, and hence are highly effective in reducing human annotation effort needed for building classification models.
Graph classification aims to extract accurate information from graph-structured data for classification and is becoming more and more important in the graph learning community. Although Graph Neural Networks (GNNs) have been successfully applied to graph classification tasks, most of them overlook the scarcity of labeled graph data in many applications. For example, in bioinformatics, obtaining protein graph labels usually needs laborious experiments. Recently, few-shot learning has been explored to alleviate this problem with only a few labeled graph samples of test classes. The shared sub-structures between training classes and test classes are essential in the few-shot graph classification. Existing methods assume that the test classes belong to the same set of super-classes clustered from training classes. However, according to our observations, the label spaces of training classes and test classes usually do not overlap in a real-world scenario. As a result, the existing methods don't well capture the local structures of unseen test classes. To overcome the limitation, in this paper, we propose a direct method to capture the sub-structures with a well initialized meta-learner within a few adaptation steps. More specifically, (1) we propose a novel framework consisting of a graph meta-learner, which uses GNNs based modules for fast adaptation on graph data, and a step controller for the robustness and generalization of meta-learner; (2) we provide quantitative analysis for the framework and give a graph-dependent upper bound of the generalization error based on our framework; (3) the extensive experiments on real-world datasets demonstrate that our framework gets state-of-the-art results on several few-shot graph classification tasks compared to baselines.
The modern data arrive continuously in a rapid and time-varying stream, which appears to generate unstable associations on the data structure. However, most of the existing methods focus on dealing with the static data, and they cannot fully take them into the structure construction. To address this issue, we propose an online unsupervised Feature Selection method via Multi-Cluster structure Preservation (FSMCP for short). FSMCP weighs all features by minimizing the differences between the Multi-Cluster structures in the original and the selected feature space. The structure integrates the three-level associations, i.e., the individual-level associations, the aggregation-level associations, and the streaming-level associations. To provide informative features in time, FSMCP check and update the associations as soon as new instances arrive. In comparison with the baseline methods, FSMCP holds better efficiency than offline methods, while still providing almost similar or even better quantitative feature subsets. It outperforms the existing online methods with average NMI improvement of 10.33%.
Personalized search aims to improve the search quality by re-ranking the candidate document list based on user's historical behavior. Existing approaches focus on modeling the order information of user's search history by sequential methods such as Recurrent Neural Network (RNN). However, these methods usually ignore the fine-grained time information associated with user actions. In fact, the time intervals between queries can help to capture the evolution of query intent and document interest of users. Besides, the time intervals between past actions and current query can reflect the re-finding tendency more accurately than discrete steps in RNN. In this paper, we propose PSTIE, a fine-grained Time Information Enhanced model to construct more accurate user interest representations for Personalized Search. To capture the short-term interest of users, we design time-aware LSTM architectures for modeling the subtle interest evolution of users in continuous time. We further leverage time in calculating the re-finding possibility of users to capture the long-term user interest. We propose two methods to utilize the time-enhanced user interest into personalized ranking. Experiments on two datasets show that PSTIE can effectively improve the ranking quality over state-of-the-art models.
Research activity spanning more than five decades has led to index organizations, compression schemes, and traversal algorithms that allow extremely rapid response to ranked queries against very large text collections. However, little attention has been paid to the interactions between these many components, and the additivity of algorithmic improvements has not been explored. Here we examine the extent to which efficiency improvements add up. We employ four query processing algorithms, four compression codecs, and all possible combinations of four distinct further optimizations, and compare the performance of the 256 resulting systems to determine when and how different optimizations interact. Our results over two test collections show that efficiency enhancements are, for the most part, additive, and that there is little risk of negative interactions. In addition, our detailed profiling across this large pool of systems leads to key insights as to why the various individual enhancements work well, and indicates that optimizing "simpler" implementations can result in higher query throughput than is available from non-optimized versions of the more "complex" techniques, with clear implications for the choices needing to be made by practitioners.
Entity alignment aims to identify equivalent entity pairs from different Knowledge Graphs (KGs), which is essential in integrating multi-source KGs. Recently, with the introduction of GNNs into entity alignment, the architectures of recent models have become more and more complicated. We even find two counter-intuitive phenomena within these methods: (1) The standard linear transformation in GNNs is not working well. (2) Many advanced KG embedding models designed for link prediction task perform poorly in entity alignment. In this paper, we abstract existing entity alignment methods into a unified framework, Shape-Builder & Alignment, which not only successfully explains the above phenomena but also derives two key criteria for an ideal transformation operation. Furthermore, we propose a novel GNNs-based method, Relational Reflection Entity Alignment (RREA). RREA leverages Relational Reflection Transformation to obtain relation specific embeddings for each entity in a more efficient way. The experimental results on real-world datasets show that our model significantly outperforms the state-of-the-art methods, exceeding by 5.8%-10.9% on Hits@1.
Signed networks are mathematical structures that encode positive and negative relations between entities such as friend/foe or trust/distrust. Recently, several papers studied the construction of useful low-dimensional representations (embeddings) of these networks for the prediction of missing relations or signs. Existing network embedding methods for sign prediction, however, generally enforce different notions of status or balance theories in their optimization function. These theories, are often inaccurate or incomplete which negatively impacts method performance.
In this context, we introduce conditional signed network embedding (CSNE). Our novel probabilistic approach models structural information about the signs in the network separately from fine-grained detail. Structural information is represented in the form of a prior, while the embedding itself is used for capturing fine-grained information. These components are then integrated in a rigorous manner. CSNE's accuracy depends on the existence of sufficiently powerful structural priors for modelling signed networks, currently unavailable in the literature. Thus, as a second main contribution, which we find to be highly valuable in its own right, we also introduce a novel approach to construct priors based on the Maximum Entropy (MaxEnt) principle. These priors can model the polarity of nodes (the degree to which their links are positive) as well as signed triangle counts (a measure for the degree structural balance holds to in a network).
Experiments on a variety of real-world networks confirm that CSNE outperforms the state-of-the-art on the task of sign prediction. Moreover, the MaxEnt priors on their own, while less accurate than full CSNE, achieve accuracies competitive with the state-of-the-art at very limited computational cost, thus providing an excellent runtime-accuracy trade-off in resource-constrained situations.
The task of generating incorrect options for multiple-choice questions is termed as distractor generation problem. The task requires high cognitive skills and is extremely challenging to automate. Existing neural approaches for the task leverage encoder-decoder architecture to generate long distractors. However, in this process two critical points are ignored - firstly, many methods use Jaccard similarity over a pool of candidate distractors to sample the distractors. This often makes the generated distractors too obvious or not relevant to the question context. Secondly, some approaches did not consider the answer in the model, which caused the generated distractors to be either answer-revealing or semantically equivalent to the answer.
In this paper, we propose a novel Hierarchical Multi-Decoder Network (HMD-Net) consisting of one encoder and three decoders, where each decoder generates a single distractor. To overcome the first problem mentioned above, we include multiple decoders with a dis-similarity loss in the loss function. To address the second problem, we exploit richer interaction between the article, question, and answer with a SoftSel operation and a Gated Mechanism. This enables the generation of distractors that are in context with questions but semantically not equivalent to the answers. The proposed model outperformed all the previous approaches significantly in both automatic and manual evaluations. In addition, we also consider linguistic features and BERT contextual embedding with our base model which further push the model performance.
Recent advances in text-related tasks on the Web, such as text (topic) classification and sentiment analysis, have been made possible by exploiting mostly the "rule of more": more data (massive amounts) more computing power, more complex solutions. We propose a shift in the paradigm to do "more with less" by focusing, at maximum extent, just on the task at hand (e.g., classify a single test instance). Accordingly, we propose MetaLazy, a new supervised lazy text classification meta-strategy that greatly extends the scope of lazy solutions. Lazy classifiers postpone the creation of a classification model until a given test instance for decision making is given. MetaLazy exploits new ideas and solutions, which have in common their lazy nature, producing altogether a solution for text classification, which is simpler, more efficient, and less data demanding than new alternatives. It extends and evolves the lazy creation of the model for the test instance by allowing: (i) to dynamically choose the best classifier for the task; (ii) the exploration of distances in the neighborhood of the test document when learning a classification model, thus diminishing the importance of irrelevant training instances; and (iii) a better representational space for training and test documents by augmenting them, in a lazy fashion, with new co-occurrence based features considering just those observed in the specific test instance. In a sizeable experimental evaluation, considering topics and sentiment analysis datasets and nine baselines, we show that our MetaLazy instantiations are among the top performers in most situations, even when compared to state-of-the-art deep learning classifiers such as Deep Network Transformer Architectures.
Whenever countries are threatened by a pandemic, as is the case with the COVID-19 virus, governments need help to take the right actions to safeguard public health as well as to mitigate the negative effects on the economy. A restrictive approach can seriously damage the economy. Conversely, a relaxed one may put at risk a high percentage of the population. Other investigations in this area are focused on modelling the spread of the virus or estimating the impact of the different measures on its propagation. However, in this paper, we propose a new methodology for helping governments in planning the phases to combat the pandemic based on their priorities. To this end, we implement the SEIR epidemiological model to represent the evolution of the COVID-19 virus on the population.
To optimize the best sequences of actions governments can take, we propose a methodology with two approaches, one based on Deep Q-Learning and another one based on Genetic Algorithms. The sequences of actions (confinement, self-isolation, two-meter distance or not taking restrictions) are evaluated according to a reward system focused on meeting two objectives: firstly, getting few people infected so that hospitals are not overwhelmed, and secondly, avoiding taking drastic measures which could cause serious damage to the economy. The conducted experiments evaluate our methodology based on the accumulated rewards during the established period. The experiments also prove that it is a valid tool for governments to reduce the negative effects of a pandemic by optimizing the planning of the phases. According to our results, the approach based on Deep Q-Learning outperforms the one based on Genetic Algorithms.
Hate speech detection on online social networks has become one of the emerging hot topics in recent years. With the broad spread and fast propagation speed across online social networks, hate speech makes significant impacts on society by increasing prejudice and hurting people. Therefore, there are aroused attention and concern from both industry and academia. In this paper, we address the hate speech problem and propose a novel hate speech detection framework called SWE2, which only relies on the content of messages and automatically identifies hate speech. In particular, our framework exploits both word-level semantic information and sub-word knowledge. It is intuitively persuasive and also practically performs well under a situation with/without character-level adversarial attack. Experimental results show that our proposed model achieves 0.975 accuracy and 0.953 macro F1, outperforming 7 state-of-the-art baselines under no adversarial attack. Our model robustly and significantly performed well under extreme adversarial attack (manipulation of 50% messages), achieving 0.967 accuracy and 0.934 macro F1.
Learning in the positive-unlabeled (PU) setting is prevalent in real world applications. Many previous works depend upon theSelected Completely At Random (SCAR) assumption to utilize unlabeled data, but the SCAR assumption is not often applicable to the real world due to selection bias in label observations. This paper is the first generative PU learning model without the SCAR assumption. Specifically, we derive the PU risk function without the SCAR assumption, and we generate a set of virtual PU examples to train the classifier. Although our PU risk function is more generalizable, the function requires PU instances that do not exist in the observations. Therefore, we introduce the VAE-PU, which is a variant of variational autoencoders to separate two latent variables that generate either features or observation indicators. The separated latent information enables the model to generate virtual PU instances. We test the VAE-PU on benchmark datasets with and without the SCAR assumption. The results indicate that the VAE-PU is superior when selection bias exists, and the VAE-PU is also competent under the SCAR assumption. The results also emphasize that the VAE-PU is effective when there are few positive-labeled instances due to modeling on selection bias.
We propose Factual News Graph (FANG), a novel graphical social context representation and learning framework for fake news detection. Unlike previous contextual models that have targeted performance, our focus is on representation learning. Compared to transductive models, FANG is scalable in training as it does not have to maintain all nodes, and it is efficient at inference time, without the need to re-process the entire graph. Our experimental results show that FANG is better at capturing the social context into a high fidelity representation, compared to recent graphical and non-graphical models. In particular, FANG yields significant improvements for the task of fake news detection, and it is robust in the case of limited training data. We further demonstrate that the representations learned by FANG generalize to related tasks, such as predicting the factuality of reporting of a news medium.
While neural networks models have shown impressive performance in many NLP tasks, lack of interpretability is often seen as a disadvantage. Individual relevance scores assigned by post-hoc explanation methods are not sufficient to show deeper systematic preferences and potential biases of the model that apply consistently across examples. In this paper we apply rule mining using knowledge graphs in combination with neural network explanation methods to uncover such systematic preferences of trained neural models and capture them in the form of conjunctive rules. We test our approach in the context of text classification tasks and show that such rules are able to explain a substantial part of the model behaviour as well as indicate potential causes of misclassifications when the model is applied outside of the initial training context.
Recent years have witnessed a drastic increase in the number of urban metro passengers, which inevitably causes the overcrowdedness in the metro systems of many cities. Clearly, an accurate prediction of passenger flows at metro stations is critical for a variety of metro system management operations, such as line scheduling and staff preallocation, that help alleviate such overcrowdedness. Thus, in this paper, we aim to address the problem of accurately predicting metro station passenger (MSP) flows. Similar to other traffic data, such as road traffic volume and highway speed, MSP flows are also spatial-temporal in nature. However, existing methods for other traffic prediction tasks are usually suboptimal to predict MSP flows due to MSP flows' unique spatial-temporal characteristics. As a result, we propose a novel deep learning framework STP-TrellisNets, which for the first time augments the newly-emerged temporal convolutional framework TrellisNet for spatial-temporal prediction. The temporal module of STP-TrellisNets (named CP-TrellisNets) employs two TrellisNets in serial to jointly capture the short- and long-term temporal correlation of MSP flows. In parallel to CP-TrellisNets, its spatial module (named GC-TrellisNet) adopts a novel transfer flow-based metric to characterize the spatial correlation among MSP flows, and implements multiple diffusion graph convolutional networks (DGCNs) in time-series order with their outputs connected to a TrellisNet to capture the dynamics of such spatial correlation. Clearly, GC-TrellisNet essentially integrates TrellisNet with graph convolution, and empowers TrellisNet with the ability to capture dynamic graph-structured correlation. We conduct extensive experiments with two large-scale real-world automated fare collection datasets, which contain respectively about 1.5 billion records in Shenzhen, China and 70 million records in Hangzhou, China. The experimental results demonstrate that STP-TrellisNets outperforms the state-of-the-art baselines.
Session-based recommendation is a challenging task. Without access to a user's historical user-item interactions, the information available in an ongoing session may be very limited. Previous work on session-based recommendation has considered sequences of items that users have interacted with sequentially. Such item sequences may not fully capture complex transition relationship between items that go beyond inspection order. Thus graph neural network (GNN) based models have been proposed to capture the transition relationship between items. However, GNNs typically propagate information from adjacent items only, thus neglecting information from items without direct connections. Importantly, GNN-based approaches often face serious overfitting problems. We propose Star Graph Neural Networks with Highway Networks (SGNN-HN) for session-based recommendation. The proposed SGNN-HN applies a star graph neural network (SGNN) to model the complex transition relationship between items in an ongoing session. To avoid overfitting, we employ highway networks (HN) to adaptively select embeddings from item representations. Finally, we aggregate the item embeddings generated by the SGNN in an ongoing session to represent a user's final preference for item prediction. Experiments on two public benchmark datasets show that SGNN-HN can outperform state-of-the-art models in terms of P@20 and MRR@20 for session-based recommendation.
The world has transitioned into a new phase of online learning in response to the recent Covid19 pandemic. Now more than ever, it has become paramount to push the limits of online learning in every manner to keep flourishing the education system. One crucial component of online learning is Knowledge Tracing (KT). The aim of KT is to model student's knowledge level based on their answers to a sequence of exercises referred as interactions. Students acquire their skills while solving exercises and each such interaction has a distinct impact on student ability to solve a future exercise. This impact is characterized by 1) the relation between exercises involved in the interactions and 2) student forget behavior. Traditional studies on knowledge tracing do not explicitly model both the components jointly to estimate the impact of these interactions. In this paper, we propose a novel Relation-aware self-attention model for Knowledge Tracing (RKT). We introduce a relation-aware self-attention layer that incorporates the contextual information. This contextual information integrates both the exercise relation information through their textual content as well as student performance data and the forget behavior information through modeling an exponentially decaying kernel function. Extensive experiments on three real-world datasets, among which two new collections are released to the public, show that our model outperforms state-of-the-art knowledge tracing methods. Furthermore, the interpretable attention weights help visualize the relation between interactions and temporal patterns in the human learning process.
Predicting road traffic speed is a challenging task due to different types of roads, abrupt speed change and spatial dependencies between roads; it requires the modeling of dynamically changing spatial dependencies among roads and temporal patterns over long input sequences. This paper proposes a novel spatio-temporal graph attention (ST-GRAT) that effectively captures the spatio-temporal dynamics in road networks. The novel aspects of our approach mainly include spatial attention, temporal attention, and spatial sentinel vectors. The spatial attention takes the graph structure information (e.g., distance between roads) and dynamically adjusts spatial correlation based on road states. The temporal attention is responsible for capturing traffic speed changes, and the sentinel vectors allow the model to retrieve new features from spatially correlated nodes or preserve existing features. The experimental results show that ST-GRAT outperforms existing models, especially in difficult conditions where traffic speeds rapidly change (e.g., rush hours). We additionally provide a qualitative study to analyze when and where ST-GRAT tended to make accurate predictions during rush-hour times.
Hierarchically structured data are commonly represented as trees and have given rise to popular data formats like XML or JSON. An interesting query computes the difference between two versions of a tree, expressed as the minimum set of node edits (deletion, insertion, label rename) that transform one tree into another, commonly known as the tree edit distance. Unfortunately, the fastest tree edit distance algorithms run in cubic time and quadratic space and are therefore not feasible for large inputs. In this paper, we leverage the fact that the difference between two versions of a tree is typically much smaller than the overall tree size. We propose a new tree edit distance algorithm that is linear in the tree size for similar trees. Our algorithm is based on the new concept of top node pairs and avoids redundant distance computations, the main issue with previous solutions for tree diffs. We empirically evaluate the runtime of our algorithm on large synthetic and real-world trees; our algorithm clearly outperforms the state of the art, often by orders of magnitude.
Research on data dependencies has experienced a revival as dependency violations can reveal errors in data. Several data cleaning systems use a DBMS to detect such violations. While DBMSs are efficient for some kinds of data dependencies (e.g., unique constraints), they are likely to fall short of satisfactory performance for more complex ones, such as order dependencies.
We present a novel system to efficiently detect violations of denial constraints (DCs), a well-known formalism that generalizes many kinds of data dependencies. We describe its execution model, which operates on a compressed block of tuples at-a-time, and we present various algorithms that take advantage of the predicate form in the DCs to provide effective code patterns. Our experimental evaluation includes comparisons with DBMS-based and DC-specific approaches, real-world and synthetic data, and various kinds of DCs. It shows that our system is up to three orders-of-magnitude faster than the other solutions, especially for datasets with a large number of tuples and DCs that identify a large number of violations.
Identifying tweets related to infrastructure damage during a crisis event is an important problem. However, the unavailability of labeled data during the early stages of a crisis event poses major challenge in training suitable models. Several domain adaptation strategies have been proposed for text classification that can be used to train models using available source data of previous crisis events and apply on a target data related to a current event. However, these approaches are insufficient to handle the distribution drift in the source and target data along with the class imbalance in the target data. In this paper we introduce an Ensemble learning approach with a Decoupled Adversarial (EnDeA) model to classify infrastructure damage tweets in a target tweet dataset. EnDeA is an ensemble of three different models two of which separately learn the event invariant and specific features of a target data from a set of source and target data. The third model which is an adversarial model helps to improve the prediction accuracy of both models. Unlike the existing approaches that also identify the domain invariant and specific properties of target data for sentiment classification, our method works for short texts and can better handle the distribution drift and class imbalance problem. We rigorously investigate the performance of the proposed approach using multiple public datasets and compare it with several state-of-the-art baselines. We discover that EnDeA outperforms these baselines with around 20% improvement in the 1 scores.
Network alignment is useful for multiple applications that require increasingly large graphs to be processed. Existing research approaches this as an optimization problem or computes the similarity based on node representations. However, the process of aligning every pair of nodes between relatively large networks is time-consuming and resource-intensive. In this paper, we propose a framework, called G-CREWE (Graph CompREssion With Embedding) to solve the network alignment problem. G-CREWE uses node embeddings to align the networks on two levels of resolution, a fine resolution given by the original network and a coarse resolution given by a compressed version, to achieve an efficient and effective network alignment. The framework first extracts node features and learns the node embedding via a Graph Convolutional Network (GCN). Then, node embedding helps to guide the process of graph compression and finally improve the alignment performance. As part of G-CREWE, we also propose a new compression mechanism called MERGE (Minimum DEgRee NeiGhbors ComprEssion) to reduce the size of the input networks while preserving the consistency in their topological structure. Experiments on all real networks show that our method is more than twice as fast as the most competitive existing methods while maintaining high accuracy.
Search results returned by search engines need to be diversified in order to satisfy different information needs of different users. Several supervised learning models have been proposed for diversifying search results in recent years. Most of the existing supervised methods greedily compare each candidate document with the selected document sequence and select the next local optimal document. However, the information utility of each candidate document is not independent with each other, and research has shown that the selection of a candidate document will affect the utilities of other candidate documents. As a result, the local optimal document rankings will not lead to the global optimal rankings. In this paper, we propose a new supervised diversification framework to address this issue. Based on a self-attention encoder-decoder structure, the model can take the whole candidate document sequence as input, and simultaneously leverage both the novelty and the subtopic coverage of the candidate documents. We call this framework Diversity Encoder with Self-Attention (DESA). Comparing with existing supervised methods, this framework can model the interactions between all candidate documents and return their diversification scores based on the whole candidate document sequence. Experimental results show that our proposed framework outperforms existing methods. These results confirm the effectiveness of modeling all the candidate documents for the overall novelty and subtopic coverage globally, instead of comparing every single candidate document with the selected sequence document selection.
To promote cost-effective task assignment in Spatial Crowdsourcing (SC), workers are required to report their location to servers, which raises serious privacy concerns. As a solution, geo-obfuscation has been widely used to protect the location privacy of SC workers, where workers are allowed to report perturbed location instead of the true location. Yet, most existing geo-obfuscation methods consider workers? mobility on a 2 dimensional (2D) plane, wherein workers can move in arbitrary directions. Unfortunately, 2D-based geo-obfuscation is likely to generate high traveling cost for task assignment over roads, as it cannot accurately estimate the traveling costs distortion caused by location obfuscation. In this paper, we tackle the SC worker location privacy problem over road networks. Considering the network-constrained mobility features of workers, we describe workers? mobility by a weighted directed graph, which considers the dynamic traffic condition and road network topology. Based on the graph model, we design a geo-obfuscation (GO) function for workers to maximize the workers? overall location privacy without compromising the task assignment efficiency. We formulate the problem of deriving the optimal GO function as a linear programming (LP) problem. By using the angular block structure of the LP's constraint matrix, we apply Dantzig-Wolfe decomposition to improve the time-efficiency of the GO function generation. Our experimental results in the real-trace driven simulation and the real-world experiment demonstrate the effectiveness of our approach in terms of both privacy and task assignment efficiency.
Knowledge Graph Question Answering aims to automatically answer natural language questions via well-structured relation information between entities stored in knowledge graphs. When faced with a complex question with compositional semantics, query graph generation is a practical semantic parsing-based method. But existing works rely on heuristic rules with limited coverage, making them impractical on more complex questions. This paper proposes a Director-Actor-Critic framework to overcome these challenges. Through options over a Markov Decision Process, query graph generation is formulated as a hierarchical decision problem. The Director determines which types of triples the query graph needs, the Actor generates corresponding triples by choosing nodes and edges, and the Critic calculates the semantic similarity between the generated triples and the given questions. Moreover, to train from weak supervision, we base the framework on hierarchical Reinforcement Learning with intrinsic motivation. To accelerate the training process, we pre-train the Critic with high-reward trajectories generated by hand-crafted rules, and leverage curriculum learning to gradually increase the complexity of questions during query graph generation. Extensive experiments conducted over widely-used benchmark datasets demonstrate the effectiveness of the proposed framework.
Electronic health records (EHR) are often generated and collected across a large number of patients featuring distinctive medical conditions and clinical progress over a long period of time, which results in unaligned records along the time dimension. EHR is also prone to missing and erroneous data due to various practical reasons. Recently, PARAFAC2 has been re-popularized for successfully extracting meaningful medical concepts (phenotypes) from such temporal EHR by irregular tensor factorization. Despite recent advances, existing PARAFAC2 methods are unable to robustly handle erroneousness and missing data which are prevalent in clinical practice. We propose REPAIR, a Robust tEmporal PARAFAC2 method for IRregular tensor factorization and completion method, to complete an irregular tensor and extract phenotypes in the presence of missing and erroneous values. To achieve this, REPAIR designs a new effective low-rank regularization function for PARAFAC2 to handle missing and erroneous entries, which has not been explored for irregular tensors before. In addition, the optimization of REPAIR allows it to enjoy the same computational scalability and incorporate a variety of constraints as the state-of-the-art PARAFAC2 method for efficient and meaningful phenotype extraction. We evaluate REPAIR on two real temporal EHR datasets to verify its robustness in tensor factorization against various missing and outlier conditions. Furthermore, we conduct two case studies to demonstrate that REPAIR is able to extract meaningful and useful phenotypes from such corrupted temporal EHR. Our implementation is publicly available https://github.com/Emory-AIMS/Repair.
Misinformation is an ever increasing problem that is difficult to solve for the research community and has a negative impact on the society at large. Very recently, the problem has been addressed with a crowdsourcing-based approach to scale up labeling efforts: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of (non-expert) judges is exploited. We follow the same approach to study whether crowdsourcing is an effective and reliable method to assess statements truthfulness during a pandemic. We specifically target statements related to the COVID-19 health emergency, that is still ongoing at the time of the study and has arguably caused an increase of the amount of misinformation that is spreading online (a phenomenon for which the term "infodemic" has been used). By doing so, we are able to address (mis)information that is both related to a sensitive and personal issue like health and very recent as compared to when the judgment is done: two issues that have not been analyzed in related work.
In our experiment, crowd workers are asked to assess the truthfulness of statements, as well as to provide evidence for the assessments as a URL and a text justification. Besides showing that the crowd is able to accurately judge the truthfulness of the statements, we also report results on many different aspects, including: agreement among workers, the effect of different aggregation functions, of scales transformations, and of workers background / bias. We also analyze workers behavior, in terms of queries submitted, URLs found / selected, text justifications, and other behavioral data like clicks and mouse actions collected by means of an ad hoc logger.
Most existing algorithms for cross-modal Information Retrieval are based on a supervised train-test setup, where a model learns to align the mode of the query (e.g., text) to the mode of the documents (e.g., images) from a given training set. Such a setup assumes that the training set contains an exhaustive representation of all possible classes of queries. In reality, a retrieval model may need to be deployed on previously unseen classes, which implies a zero-shot IR setup. In this paper, we propose a novel GAN-based model for zero-shot text to image retrieval. When given a textual description as the query, our model can retrieve relevant images in a zero-shot setup. The proposed model is trained using an Expectation-Maximization framework. Experiments on multiple benchmark datasets show that our proposed model comfortably outperforms several state-of-the-art zero-shot text to image retrieval models, as well as zero-shot classification and hashing models suitably used for retrieval.
In this paper, we propose a flexible notion of characteristic functions defined on graph vertices to describe the distribution of vertex features at multiple scales. We introduce FEATHER, a computationally efficient algorithm to calculate a specific variant of these characteristic functions where the probability weights of the characteristic function are defined as the transition probabilities of random walks. We argue that features extracted by this procedure are useful for node level machine learning tasks. We discuss the pooling of these node representations, resulting in compact descriptors of graphs that can serve as features for graph classification algorithms. We analytically prove that FEATHER describes isomorphic graphs with the same representation and exhibits robustness to data corruption. Using the node feature characteristic functions we define parametric models where evaluation points of the functions are learned parameters of supervised classifiers. Experiments on real world large datasets show that our proposed algorithm creates high quality representations, performs transfer learning efficiently, exhibits robustness to hyperparameter changes and scales linearly with the input size.
Temporal point process (TPP) models have hitherto been moderately good at nowcasting hashtag popularity, but have been very poor at forecasting due to insufficient modeling of Twitter microdynamics. Recent studies have shown that the highly fluctuating nature of hashtag popularity dynamics is due to the influence of two external factors: (i) hashtag-tweet reinforcement and (ii) inter-hashtag competition. In this paper, we propose a marked TPP based on Generative Adversarial Networks (GANs) which can seamlessly incorporate the assistive information necessary to capture the above effects and successfully forecast distant popularity trends. To achieve this, we employ a unique linear semi-autoregressive model for mark generation and couple the time and mark generative aspects. On seven diverse datasets crawled from Twitter covering several real-world events, our model yields remarkably stable performance in predicting hashtag popularity in diverse situations and offers a substantial improvement over the existing state of the art generative models.
This paper studies privacy-aware inverted index design and document retrieval for multi-keyword document search in a trusted hardware execution environment such as Intel SGX. The previous work uses time-consuming oblivious computing techniques to avoid the leakage of memory access patterns for privacy preservations in such an environment. This paper proposes an efficiency-enhanced design that obfuscates the inverted index structure with posting bucketing and document ID masking, which aims to hide document-term association and avoid the access pattern leakage. This paper describes privacy-aware oblivious document retrieval during online query processing based on such an index. Both privacy and efficiency analyses are provided, followed by evaluation results comparing proposed designs with multiple baselines.
In mobile crowdsourcing (MCS), the platform selects participants to complete location-aware tasks from the recruiters aiming to achieve multiple goals (e.g., profit maximization, energy efficiency, and fairness). However, different MCS systems have different goals and there are possibly conflicting goals even in one MCS system. Therefore, it is crucial to design a participant selection algorithm that applies to different MCS systems to achieve multiple goals. To deal with this issue, we formulate the participant selection problem as a reinforcement learning problem and propose to solve it with a novel method, which we call auxiliary-task based deep reinforcement learning (ADRL). We use transformers to extract representations from the context of the MCS system and a pointer network to deal with the combinatorial optimization problem. To improve the sample efficiency, we adopt an auxiliary-task training process that trains the network to predict the imminent tasks from the recruiters, which facilitates the embedding learning of the deep learning model. Additionally, we release a simulated environment on a specific MCS task, the ride-sharing task, and conduct extensive performance evaluations in this environment. The experimental results demonstrate that ADRL outperforms and improves sample efficiency over other well-recognized baselines in various settings.
Recent years have witnessed the success of deep neural networks in many research areas. The fundamental idea behind the design of most neural networks is to learn similarity patterns from data for prediction and inference, which lacks the ability of cognitive reasoning. However, the concrete ability of reasoning is critical to many theoretical and practical problems. On the other hand, traditional symbolic reasoning methods do well in making logical inference, but they are mostly hard rule-based reasoning, which limits their generalization ability to different tasks since difference tasks may require different rules. Both reasoning and generalization ability are important for prediction tasks such as recommender systems, where reasoning provides strong connection between user history and target items for accurate prediction, and generalization helps the model to draw a robust user portrait over noisy inputs.
In this paper, we propose Logic-Integrated Neural Network (LINN) to integrate the power of deep learning and logic reasoning. LINN is a dynamic neural architecture that builds the computational graph according to input logical expressions. It learns basic logical operations such as AND, OR, NOT as neural modules, and conducts propositional logical reasoning through the network for inference. Experiments on theoretical task show that LINN achieves significant performance on solving logical equations and variables. Furthermore, we test our approach on the practical task of recommendation by formulating the task into a logical inference problem. Experiments show that LINN significantly outperforms state-of-the-art recommendation models in Top-K recommendation, which verifies the potential of LINN in practice.
Many learning tasks involve multi-modal data streams, where continuous data from different modes convey a comprehensive description about objects. A major challenge in this context is how to efficiently interpret multi-modal information in complex environments. This has motivated numerous studies on learning unsupervised representations from multi-modal data streams. These studies aim to understand higher-level contextual information (e.g., a Twitter message) by jointly learning embeddings for the lower-level semantic units in different modalities (e.g., text, user, and location of a Twitter message). However, these methods directly associate each low-level semantic unit with a continuous embedding vector, which results in high memory requirements. Hence, deploying and continuously learning such models in low-memory devices (e.g., mobile devices) becomes a problem. To address this problem, we present METEOR, a novel MEmory and Time Efficient Online Representation learning technique, which: (1) learns compact representations for multi-modal data by sharing parameters within semantically meaningful groups and preserves the domain-agnostic semantics; (2) can be accelerated using parallel processes to accommodate different stream rates while capturing the temporal changes of the units; and (3) can be easily extended to capture implicit/explicit external knowledge related to multi-modal data streams. We evaluate METEOR using two types of multi-modal data streams (i.e., social media streams and shopping transaction streams) to demonstrate its ability to adapt to different domains. Our results show that METEOR preserves the quality of the representations while reducing memory usage by around 80% compared to the conventional memory-intensive embeddings.
The accuracy of deep neural networks is significantly affected by how well mini-batches are constructed during the training step. In this paper, we propose a novel adaptive batch selection algorithm called Recency Bias that exploits the uncertain samples predicted inconsistently in recent iterations. The historical label predictions of each training sample are used to evaluate its predictive uncertainty within a sliding window. Then, the sampling probability for the next mini-batch is assigned to each training sample in proportion to its predictive uncertainty. By taking advantage of this design, Recency Bias not only accelerates the training step but also achieves a more accurate network. We demonstrate the superiority of Recency Bias by extensive evaluation on two independent tasks. Compared with existing batch selection methods, the results showed that Recency Bias reduced the test error by up to 20.97% in a fixed wall-clock training time. At the same time, it improved the training time by up to 59.32% to reach the same test error.
Machine reading comprehension (MRC) has become a core component in a variety of natural language processing (NLP) applications such as question answering and dialogue systems. It becomes a practical challenge that an MRC model needs to learn in non-stationary environments, in which the underlying data distribution changes over time. A typical scenario is the domain drift, i.e. different domains of data come one after another, where the MRC model is required to adapt to the new domain while maintaining previously learned ability. To tackle such a challenge, in this work, we introduce the Continual Domain Adaptation (CDA) task for MRC. So far as we know, this is the first study on the continual learning perspective of MRC. We build two benchmark datasets for the CDA task, by re-organizing existing MRC collections into different domains with respect to context type and question type, respectively. We then analyze and observe the catastrophic forgetting (CF) phenomenon of MRC under the CDA setting. To tackle the CDA task, we propose several BERT-based continual learning MRC models using either regularization-based methodology or dynamic-architecture paradigm. We analyze the performance of different continual learning MRC models under the CDA task and show that the proposed dynamic-architecture based model achieves the best performance.
Recommender systems have shown great potential to solve the information explosion problem and enhance user experience in various online applications. To tackle data sparsity and cold start problems in recommender systems, researchers propose knowledge graphs (KGs) based recommendations by leveraging valuable external knowledge as auxiliary information. However, most of these works ignore the variety of data types (e.g., texts and images) in multi-modal knowledge graphs (MMKGs). In this paper, we propose Multi-modal Knowledge Graph Attention Network (MKGAT) to better enhance recommender systems by leveraging multi-modal knowledge. Specifically, we propose a multi-modal graph attention technique to conduct information propagation over MMKGs, and then use the resulting aggregated embedding representation for recommendation. To the best of our knowledge, this is the first work that incorporates multi-modal knowledge graph into recommender systems. We conduct extensive experiments on two real datasets from different domains, results of which demonstrate that our model MKGAT can successfully employ MMKGs to improve the quality of recommendation system.
Anomaly detection in multilayer graphs becomes more critical in many application scenarios, i.e., identifying crime hotspots in urban areas by discovering suspicious and illicit behaviors in social networks. However, it is a big challenge to identify anomalies in a layer graph due to the insufficient anomaly features. Most existing methods of anomaly detection determine whether a node is abnormal by looking at the observable anomalous feature values. However, these methods are not suitable for scenarios in which the abnormal features are scarce, e.g., geometric graphs or non-public data in social network services. In this paper, to detect anomaly in a graph with insufficient anomalous features, we propose a pioneering approach ASD-FT (Anomaly Subgraph Detection with Feature Transfer) based on a strategy of anomalous feature transfers between different layers of a multilayer graph. The proposed ASD-FT detects anomaly subgraphs from the graph of the target layer by analyzing the anomalous features in the graph of another layer. We demonstrate the effectiveness and robustness of our approach ASD-FT with extensive experiments on five real-world datasets.
Data aggregation is a key problem in wireless sensor networks (WSNs). To secure the aggregation results, researchers have proposed to adopt homomorphic encryptions. Since aggregation is conducted in the ciphertext space without decryption, both the confidentiality and integrity can be protected against untrusted or compromised aggregators. However, such techniques cannot protect against untrusted or compromised sources, i.e., wireless sensors, as homomorphic encryptions require all sources to share a common encryption key. Since wireless sensor networks are often vulnerable to physical or network attacks, new secure aggregation schemes that can protect against compromised sources are needed. This paper proposes Onion Homomorphic Encryption-based Aggregation (OHEA), where sources form groups with their dedicated encryption keys, a.k.a., the group keys. OHEA has a nice property that group keys themselves can be aggregated, so it can work recursively with any level of aggregation hierarchy. By security analysis, we show that even if multiple aggregators or sources are compromised, an adversary is still unable to compromise the data of other nodes in the same or upper levels of the hierarchy. Furthermore, the experimental results show that OHEA incurs low computation and communication cost, and is thus scalable to large WSNs.
Graph Convolutional Networks (GCNs) show promising results for semi-supervised learning tasks on graphs, thus become favorable comparing with other approaches. Despite the remarkable success of GCNs, it is difficult to train GCNs with insufficient supervision. When labeled data are limited, the performance of GCNs becomes unsatisfying for low-degree nodes. While some prior work analyze successes and failures of GCNs on the entire model level, profiling GCNs on individual node level is still underexplored.
In this paper, we analyze GCNs in regard to the node degree distribution. From empirical observation to theoretical proof, we confirm that GCNs are biased towards nodes with larger degrees with higher accuracy on them, even if high-degree nodes are underrepresented in most graphs. We further develop a novel Self-Supervised-Learning Degree-Specific GCN (SL-DSGCN) that mitigate the degree-related biases of GCNs from model and data aspects. Firstly, we propose a degree-specific GCN layer that captures both discrepancies and similarities of nodes with different degrees, which reduces the inner model-aspect biases of GCNs caused by sharing the same parameters with all nodes. Secondly, we design a self-supervised-learning algorithm that creates pseudo labels with uncertainty scores on unlabeled nodes with a Bayesian neural network. Pseudo labels increase the chance of connecting to labeled neighbors for low-degree nodes, thus reducing the biases of GCNs from the data perspective. Uncertainty scores are further exploited to weight pseudo labels dynamically in the stochastic gradient descent for SL-DSGCN. Experiments on three benchmark datasets show SL-DSGCN not only outperforms state-of-the-art self-training/self-supervised-learning GCN methods, but also improves GCN accuracy dramatically for low-degree nodes.
False information detection on social media is challenging as it commonly requires tedious evidence-collecting but lacks available comparative information. Clues mined from user comments, as the wisdom of crowds, could be of considerable benefit to this task. However, it is non-trivial to capture the complex semantics from the contents and comments in consideration of their implicit correlations. Although deep neural networks have good expressive power, one major drawback is the lack of explainability. In this paper, we focus on how to learn from the post contents and related comments in social media to understand and detect the false information more effectively, with explainability. We thus propose a Quantum-probability based Signed Attention Network (QSAN) that integrates the quantum-driven text encoding and a novel signed attention mechanism in a unified framework. QSAN is not only able to distinguish important comments from the others, but also can exploit the conflicting social viewpoints in the comments to facilitate the detection. Moreover, QSAN is advantageous with its explainability in terms of transparency due to quantum physics meanings and the attention weights. Extensive experiments on real-world datasets show that our approach outperforms state-of-the-art baselines and can provide different kinds of user comments to explain why a piece of information is detected as false.
Quaternion space has brought several benefits over the traditional Euclidean space: Quaternions (i) consist of a real and three imaginary components, encouraging richer representations; (ii) utilize Hamilton product which better encodes the inter-latent interactions across multiple Quaternion components; and (iii) result in a model with smaller degrees of freedom and less prone to overfitting. Unfortunately, most of the current recommender systems rely on real-valued representations in Euclidean space to model either user's long-term or short-term interests. In this paper, we fully utilize Quaternion space to model both user's long-term and short-term preferences. We first propose a QUaternion-based self-Attentive Long term user Encoding (QUALE) to study the user's long-term intents. Then, we propose a QUaternion-based self-Attentive Short term user Encoding (QUASE) to learn the user's short-term interests. To enhance our models' capability, we propose to fuse QUALE and QUASE into one model, namely QUALSE, by using a Quaternion-based gating mechanism. We further develop Quaternion-based Adversarial learning along with the Bayesian Personalized Ranking (QABPR) to improve our model's robustness. Extensive experiments on six real-world datasets show that our fused QUALSE model outperformed 11 state-of-the-art baselines, improving 8.43% at HIT@1 and 10.27% at NDCG@1 on average compared with the best baseline.
E-Commerce marketplaces support millions of daily transactions, and some disagreements between buyers and sellers are unavoidable. Resolving disputes in an accurate, fast, and fair manner is of great importance for maintaining a trustworthy platform. Simple cases can be automated, but intricate cases are not sufficiently addressed by hard-coded rules, and therefore most disputes are currently resolved by people. In this work we take a first step towards automatically assisting human agents in dispute resolution at scale. We construct a large dataset of disputes from the eBay online marketplace, and identify several interesting behavioral and linguistic patterns. We then train classifiers to predict dispute outcomes with high accuracy. We explore the model and the dataset, reporting interesting correlations, important features, and insights.
Besides position bias, which has been well-studied, trust bias is another type of bias prevalent in user interactions with rankings: users are more likely to click incorrectly w.r.t. their preferences on highly ranked items because they trust the ranking system. While previous work has observed this behavior in users, we prove that existing Counterfactual Learning to Rank (CLTR) methods do not remove this bias, including methods specifically designed to mitigate this type of bias. Moreover, we prove that Inverse Propensity Scoring (IPS) is principally unable to correct for trust bias under non-trivial circumstances. Our main contribution is a new estimator based on affine corrections: it both reweights clicks and penalizes items displayed on ranks with high trust bias. Our estimator is the first estimator that is proven to remove the effect of both trust bias and position bias. Furthermore, we show that our estimator is a generalization of the existing (CLTR) framework: if no trust bias is present, it reduces to the original (IPS) estimator. Our semi-synthetic experiments indicate that by removing the effect of trust bias in addition to position bias, (CLTR) can approximate the optimal ranking system even closer than previously possible.
To facilitate advanced analytics, data science projects increasingly require records about individuals to be linked across databases. Generally no unique entity identifiers are available in the databases to be linked, and therefore quasi-identifiers such as names, addresses, and dates of birth are used to link records. The process of linking records without revealing any sensitive or confidential information about the entities represented by these records is known as privacy-preserving record linkage (PPRL). Various encoding and encryption based PPRL methods have been developed in the past two decades. Most existing PPRL methods calculate approximate similarities between records because errors and variations can occur in quasi-identifying attribute values. Even though being used in real-world linkage applications, certain PPRL methods, such as popular Bloom filter encoding, have shown to be vulnerable to cryptanalysis attacks. In this paper we present a novel attack on PPRL methods that exploits the approximate similarities calculated between encoded records. Our attack matches nodes in a similarity graph generated from an encoded database with a corresponding similarity graph generated from a plain-text database to re-identify sensitive values. Our attack is not limited to any specific PPRL method, and in an experimental evaluation we apply it on three PPRL encoding methods using three different databases. This evaluation shows that our attack can successfully re-identify sensitive values from these encodings with high accuracy where no previous attack on PPRL would have been successful.
We study max-sum clustering in a semi-supervised setting. Our objective function maximizes the pairwise within-cluster similarity with respect to some null hypothesis regarding the similarity. This is a natural objective that does not require any additional parameters, and is a generalization of the well-known modularity objective function. We show that for such an objective function in a semi-supervised setting we can compute an additive approximation of the optimal solution in the general case, and a constant-factor approximation when the optimal objective value is large. The supervision that we consider is in the form of cluster assignment queries and same-cluster queries; we also study the setting where the query responses are noisy. Our algorithm also generalizes to the min-sum objective function, for which we can achieve similar performance guarantees. We present computational experiments to show that our framework is effective for clustering text data - we are able to find clusterings that are close to the queried clustering and have a good objective value.
A great variety of complex systems ranging from user interactions in communication networks to transactions in financial markets can be modeled as temporal graphs, which consist of a set of vertices and a series of timestamped and directed edges. Temporal motifs in temporal graphs are generalized from subgraph patterns in static graphs which take into account edge orderings and durations in addition to structures. Counting the number of occurrences of temporal motifs is a fundamental problem for temporal network analysis. However, existing methods either cannot support temporal motifs or suffer from performance issues. In this paper, we focus on approximate temporal motif counting via random sampling. We first propose a generic edge sampling (ES) algorithm for estimating the number of instances of any temporal motif. Furthermore, we devise an improved EWS algorithm that hybridizes edge sampling with wedge sampling for counting temporal motifs with 3 vertices and 3 edges. We provide comprehensive analyses of the theoretical bounds and complexities of our proposed algorithms. Finally, we conduct extensive experiments on several real-world datasets, and the results show that our ES and EWS algorithms have higher efficiency, better accuracy, and greater scalability than the state-of-the-art sampling method for temporal motif counting.
Graph neural networks (GNNs) have achieved strong performance in various applications. In the real world, network data is usually formed in a streaming fashion. The distributions of patterns that refer to neighborhood information of nodes may shift over time. The GNN model needs to learn the new patterns that cannot yet be captured. But learning incrementally leads to the catastrophic forgetting problem that historical knowledge is overwritten by newly learned knowledge. Therefore, it is important to train GNN model to learn new patterns and maintain existing patterns simultaneously, which few works focus on. In this paper, we propose a streaming GNN model based on continual learning so that the model is trained incrementally and up-to-date node representations can be obtained at each time step. Firstly, we design an approximation algorithm to detect new coming patterns efficiently based on information propagation. Secondly, we combine two perspectives of data replaying and model regularization for existing pattern consolidation. Specially, a hierarchy-importance sampling strategy for nodes is designed and a weighted regularization term for GNN parameters is derived, achieving greater stability and generalization of knowledge consolidation. Our model is evaluated on real and synthetic data sets and compared with multiple baselines. The results of node classification prove that our model can efficiently update model parameters and achieve comparable performance to model retraining. In addition, we also conduct a case study on the synthetic data, and carry out some specific analysis for each part of our model, illustrating its ability to learn new knowledge and maintain existing knowledge from different perspectives.
Multi-document summarization (MDS) aims at giving a brief summary for a cluster of related documents. In this paper, we consider the MDS task as an optimization problem with a novel measure named soaking capacity being the objective function. The origin of our method is the classic hypothesis: the summary components are the sinks of information diffusion. We point out that the hypothesis only gives the role of summary but does not cover how well a summary acts as this role. To fill in the gap, soaking capacity is formally defined to quantify the ability of summary to soak up information. We explicitly demonstrate its fitness as an indicator for both the saliency and the diversity goal of MDS. For solving the optimization problem, we propose a greedy algorithm named Soap by adopting a surrogate of soaking capacity to accelerate the computation. Experiments on MDS datasets across various domains show the great potential of Soap as compared with the state-of-the-art MDS systems.
Phrase mining is a fundamental task for text analysis and has various downstream applications such as named entity recognition, topic modeling, and relation extraction. In this paper, we focus on mining high-quality phrases from domain-specific corpora with special consideration of infrequent ones. Previous methods might miss infrequent high-quality phrases in the candidate selection stage. And these methods rely on explicit features to mine phrases while rarely considering the implicit features. In addition, completeness is rarely explicitly considered in the evaluation of a high-quality phrase. In this paper, we propose a novel approach that exploits a sequence labeling model to capture infrequent phrases. And we employ implicit semantic features and contextual POS tag statistics to measure meaningfulness and completeness, respectively. Experiments over four real-world corpora demonstrate that our method achieves significant improvements over previous state-of-the-art methods across different domains and languages.
Due to the expensive cost of data annotation, few-shot learning has attracted increasing research interests in recent years. Various meta-learning approaches have been proposed to tackle this problem and have become the de facto practice. However, most of the existing approaches along this line mainly focus on image and text data in the Euclidean domain. However, in many real-world scenarios, a vast amount of data can be represented as attributed networks defined in the non-Euclidean domain, and the few-shot learning studies in such structured data have largely remained nascent. Although some recent studies have tried to combine meta-learning with graph neural networks to enable few-shot learning on attributed networks, they fail to account for the unique properties of attributed networks when creating diverse tasks in the meta-training phase---the feature distributions of different tasks could be quite different as instances (i.e., nodes) do not follow the data i.i.d. assumption on attributed networks. Hence, it may inevitably result in suboptimal performance in the meta-testing phase. To tackle the aforementioned problem, we propose a novel graph meta-learning framework--Attribute Matching Meta-learning Graph Neural Networks (AMM-GNN). Specifically, the proposed AMM-GNN leverages an attribute-level attention mechanism to capture the distinct information of each task and thus learns more effective transferable knowledge for meta-learning. We conduct extensive experiments on real-world datasets under a wide range of settings and the experimental results demonstrate the effectiveness of the proposed AMM-GNN framework.
Crowd flow prediction, which aims to predict the in-out flows (e.g. the traffic of crowds, taxis and bikes ) of different areas of a city, is critically important to many real applications including public safety and intelligent transportation systems. The challenges of this problem lie in both the dynamic mobility patterns of crowds and the complex spatial-temporal correlations. Meanwhile, crowd flow is highly correlated to and affected by the Origin-Destination (OD) locations of the flow trajectories, which is largely ignored by existing works. In this paper, we study the novel problem of predicting the crowd flow and flow OD simultaneously, and propose a multi-task adversarial spatial-temporal network model entitled MT-ASTN to effectively address it. As a multi-task learning model, MT-ASTN adopts a shared-private framework which contains private spatial-temporal encoders, a shared spatial-temporal encoder, and decoders to learn the task-specific features and shared features. To effectively extract high quality shared features, a discriminative loss on task classification and an adversarial loss on shared feature extraction are incorporated to reduce information redundancy. We also design an attentive temporal queue to automatically capture the complex temporal dependency without the help of domain knowledge. Extensive evaluations are conducted over the bike and taxicab trip datasets in New York. The results demonstrate that our approach significantly outperforms state-of-the-art methods by a large margin on both tasks.
The incompleteness of positive labels and the presence of many unlabelled instances are common problems in binary classification applications such as in review helpfulness classification. Various studies from the classification literature consider all unlabelled instances as negative examples. However, a classification model that learns to classify binary instances with incomplete positive labels while assuming all unlabelled data to be negative examples will often generate a biased classifier. In this work, we propose a novel Negative Confidence-aware Weakly Supervised approach (NCWS), which customises a binary classification loss function by discriminating the unlabelled examples with different negative confidences during the classifier's training. NCWS allows to effectively, unbiasedly identify and separate positive and negative instances after its integration into various binary classifiers from the literature, including SVM, CNN and BERT-based classifiers. We use the review helpfulness classification as a test case for examining the effectiveness of our NCWS approach. We thoroughly evaluate NCWS by using three different datasets, namely one from Yelp (venue reviews), and two from Amazon (Kindle and Electronics reviews). Our results show that NCWS outperforms strong baselines from the literature including an existing SVM-based approach (i.e. SVM-P), the positive and unlabelled learning-based approach (i.e. C-PU) and the positive confidence-based approach (i.e. P-conf) in addressing the classifier's bias problem. Moreover, we further examine the effectiveness of NCWS by using its classified helpful reviews in a state-of-the-art review-based venue recommendation model (i.e. DeepCoNN) and demonstrate the benefits of using NCWS in enhancing venue recommendation effectiveness in comparison to the baselines.
In multi-label image recognition, it has become a popular method to predict those labels that co-occur in an image via modeling the label dependencies. Previous works focus on capturing the correlation between labels, but neglect to effectively fuse the image features and label embeddings, which severely affects the convergence efficiency of the model and inhibits the further precision improvement of multi-label image recognition. To overcome this shortcoming, in this paper, we introduce Multi-modal Factorized Bilinear pooling (MFB) which works as an efficient component to fuse cross-modal embeddings and propose F-GCN, a fast graph convolution network (GCN) based multi-label image recognition model. F-GCN consists of three key modules: (1) an image representation learning module which adopts a convolution neural network (CNN) to learn and generate image representations, (2) a label co-occurrence embedding module which first obtains the label vectors via the word embeddings technique and then adopts GCN to capture label co-occurrence embeddings and (3) an MFB fusion module which efficiently fuses these cross-modal vectors to enable an end-to-end model with a multi-label loss function. We conduct extensive experiments on two multi-label datasets including MS-COCO and VOC2007. Experimental results demonstrate the MFB component efficiently fuses image representations and label co-occurrence embeddings and thus greatly improves the convergence efficiency of the model. In addition, the performance of image recognition has also been promoted compared with the state-of-the-art methods.
Network embedding aims to automatically learn the node representations in networks. The basic idea of network embedding is to first construct a network to describe the neighborhood context for each node, and then learn the node representations by designing an objective function to preserve certain properties of the constructed context network. The vast majority of the existing methods, explicitly or implicitly, follow a pointwise design principle. That is, the objective can be decomposed into the summation of the certain goodness function over each individual edge of the context network. In this paper, we propose to go beyond such pointwise approaches, and introduce the ranking-oriented design principle for network embedding. The key idea is to decompose the overall objective function into the summation of a goodness function over a set of edges to collectively preserve their relative rankings on the context network. We instantiate the ranking-oriented design principle by two new network embedding algorithms, including a pairwise network embedding method PaWine which optimizes the relative weights of edge pairs, and a listwise method LiWine which optimizes the relative weights of edge lists. Both proposed algorithms bear a linear time complexity, making themselves scalable to large networks. We conduct extensive experimental evaluations on five real datasets with a variety of downstream learning tasks, which demonstrate that the proposed approaches consistently outperform the existing methods.
Recent advances in information extraction have motivated the automatic construction of huge Knowledge Graphs (KGs) by mining from large-scale text corpus. However, noisy facts are unavoidably introduced into KGs that could be caused by automatic extraction.
To validate the correctness of facts (i.e., triplets) inside a KG, one possible approach is to map the triplets into vector representations by capturing the semantic meanings of facts. Although many representation learning approaches have been developed for knowledge graphs, these methods are not effective for validation. They usually assume that facts are correct, and thus may overfit noisy facts and fail to detect such facts.
Towards effective KG validation, we propose to leverage an external human-curated KG as auxiliary information source to help detect the errors in a target KG. The external KG is built upon human-curated knowledge repositories and tends to have high precision. On the other hand, although the target KG built by information extraction from texts has low precision, it can cover new or domain-specific facts that are not in any human-curated repositories. To leverage external KG for validation, one intuitive approach is to find matched triplets between the target KG and the external KG. However, this approach can only apply to the small portion of triplets that are covered by both external and target KGs and is not useful for the validation of majority of the triplets. To tackle this challenging task, we propose a cross-graph representation learning framework, i.e., CrossVal, which can leverage an external KG to validate the facts in the target KG efficiently. This is achieved by embedding triplets based on their semantic meanings, drawing cross-KG negative samples and estimating a confidence score for each triplet based on its degree of correctness. We evaluate the proposed framework on datasets across different domains. Experimental results show that the proposed framework achieves the best performance compared with the state-of-the-art methods on large-scale KGs.
Heterogeneous information network has been widely used to alleviate sparsity and cold start problems in recommender systems since it can model rich context information in user-item interactions. Graph neural network is able to encode this rich context information through propagation on the graph. However, existing heterogeneous graph neural networks neglect entanglement of the latent factors stemming from different aspects. Moreover, meta paths in existing approaches are simplified as connecting paths or side information between node pairs, overlooking the rich semantic information in the paths. In this paper, we propose a novel disentangled heterogeneous graph attention network DisenHAN for top-N recommendation, which learns disentangled user/item representations from different aspects in a heterogeneous information network. In particular, we use meta relations to decompose high-order connectivity between node pairs and propose a disentangled embedding propagation layer which can iteratively identify the major aspect of meta relations. Our model aggregates corresponding aspect features from each meta relation for the target user/item. With different layers of embedding propagation, DisenHAN is able to explicitly capture the collaborative filtering effect semantically. Extensive experiments on three real-world datasets show that DisenHAN consistently outperforms state-of-the-art approaches. We further demonstrate the effectiveness and interpretability of the learned disentangled representations via insightful case studies and visualization.
Capturing the relatedness of different domains is a key challenge in transferring knowledge across domains. In this paper, we propose an effective and efficient Gaussian process (GP) modelling framework, mTGPmk, that can explicitly model domain relatedness and adaptively control the space as well as the strength of knowledge transfer. mTGPmk takes both the discrepancy of input feature space and the discrepancy of predictive function into account in the transfer procedure. Specifically, mTGPmk adaptively selects a good latent manifold shared by different domains, and utilizes a parametric similarity coefficient to measure the predictive function covariance of different domains in this manifold. The latent shared manifold and the similarity coefficient are jointly learned in a coupled manner. By doing so, mTGPmk maximizes the strength of the shared knowledge transfer by choosing the transfer space with the best transfer capacity. More importantly, mTGPmk exploits a succinct and computationally efficient manifold learning approach so that it can be well trained with scarce target training data. Extensive experimental studies using 36 synthetic transfer tasks and 10 real-world transfer tasks show the effectiveness of mTGPmk on capturing the relatedness and the transfer adaptiveness.
The analysis of wearable-sensory time series data (e.g., heart rate records) benefits many applications (e.g., activity recognition, disease diagnosis). However, sensor measurements usually contain missing values due to various factors (e.g., user behavior, lack of charging), which may degrade the performance of downstream analytical tasks (e.g., regression, prediction). Thus, time series imputation is desired, which is capable of making sensory time series complete. Existing time series imputation methods generally employ various deep neural network models (e.g., GRU and GAN) to fill missing values by leveraging temporal patterns extracted from the contextual observations. Despite their effectiveness, we argue that most existing models can only achieve sub-optimal imputation performance due to the fact that they are inherently limited in sharing only one single set of model parameters to perform imputation on all individuals. Relying on one set of parameters limits the expressiveness of the imputation model as such models are bound to fail in capturing various complex personal characteristics. Therefore, most existing models tend to achieve inferior imputation performance, especially when a long duration of missing values, i.e., a large gap, is observed in the time series data. To address the limitation, this work develops a new imputation framework--Personalized Wearable-Sensory Time Series Imputation framework (PTSI) to provide a fully personalized treatment for time series imputation via effective knowledge transfer. In particular, PTSI first leverages a meta-learning paradigm to learn a well-generalized initialization to facilitate the adaption process for each user. To make the time series imputation be reflective of an individual's unique characteristics, we further endow PTSI with the capability of learning personalized model parameters, which is achieved by designing a parameter initialization modulating component. Extensive experiments on real-world human heart rate datasets demonstrate that our PTSI framework outperforms various state-of-the-art methods by a large margin consistently.
To study the impact of providing direct answers in search results on user behavior, we conducted a controlled user study to analyze factors including reading time, eye-tracked attention, and the influence of the quality of answer module content. We also studied a more advanced answer interface, where multiple answers are shown on the search engine results page (SERP). Our results show that users focus more extensively than normal on the top items in the result list when answers are provided. The existence of the answer module helps to improve user engagement on SERPs, reduces user effort, and promotes user satisfaction during the search process. Furthermore, we investigate how the question type -- factoid or non-factoid -- affects user interaction patterns. This work provides insight into the design of SERPs that includes direct answers to queries, including when answers should be shown.
Recent research explores incorporating knowledge graphs (KG) into e-commerce recommender systems, not only to achieve better recommendation performance, but more importantly to generate explanations of why particular decisions are made. This can be achieved by explicit KG reasoning, where a model starts from a user node, sequentially determines the next step, and walks towards an item node of potential interest to the user. However, this is challenging due to the huge search space, unknown destination, and sparse signals over the KG, so informative and effective guidance is needed to achieve a satisfactory recommendation quality. To this end, we propose a CoArse-to-FinE neural symbolic reasoning approach (CAFE). It first generates user profiles as coarse sketches of user behaviors, which subsequently guide a path-finding process to derive reasoning paths for recommendations as fine-grained predictions. User profiles can capture prominent user behaviors from the history, and provide valuable signals about which kinds of path patterns are more likely to lead to potential items of interest for the user. To better exploit the user profiles, an improved path-finding algorithm called Profile-guided Path Reasoning (PPR) is also developed, which leverages an inventory of neural symbolic reasoning modules to effectively and efficiently find a batch of paths over a large-scale KG. We extensively experiment on four real-world benchmarks and observe substantial gains in the recommendation performance compared with state-of-the-art methods.
Anomaly detection is one of the most important data mining tasks in many real-life applications such as network intrusion detection for cybersecurity and medical diagnosis for healthcare. In the big data era, these applications demand fast and versatile anomaly detection capability to handle various types of increasingly huge-volume data. However, existing detection methods are either slow due to high computational complexity, or unable to deal with complicated anomalies like local anomalies. In this paper, we propose a novel anomaly detection method named OPHiForest with the use of the order preserving hashing based isolation forest. The core idea is to learn the information from data to construct better isolation forest structure than the state-of-the-art methods like iForest and LSHiForest, which can achieve robust detection of various anomaly types. We design a fast two-step learning process for the order preserving hashing scheme. This leads to stronger order preservation for better hashing, and therefore enhances anomaly detection robustness and accuracy. Extensive experiments on both synthetic and real-world data sets demonstrate that our method is highly robust and scalable.
Accurate traffic speed prediction is an important and challenging topic for transportation planning. Previous studies on traffic speed prediction predominately used spatio-temporal and context features for prediction. However, they have not made good use of the impact of traffic incidents. In this work, we aim to make use of the information of incidents to achieve a better prediction of traffic speed. Our incident-driven prediction framework consists of three processes. First, we propose a critical incident discovery method to discover traffic incidents with high impact on traffic speed. Second, we design a binary classifier, which uses deep learning methods to extract the latent incident impact features. Combining above methods, we propose a Deep Incident-Aware Graph Convolutional Network (DIGC-Net) to effectively incorporate traffic incident, spatio-temporal, periodic and context features for traffic speed prediction. We conduct experiments using two real-world traffic datasets of San Francisco and New York City. The results demonstrate the superior performance of our model compared with the competing benchmarks.
Story generation, which aims to generate a long and coherent story automatically based on the title or an input sentence, is an important research area in the field of natural language generation. There is relatively little work on story generation with appointed emotions. Most existing works focus on using only one specific emotion to control the generation of a whole story and ignore the emotional changes in the characters in the course of the story. In our work, we aim to design an emotional line for each character that considers multiple emotions common in psychological theories, with the goal of generating stories with richer emotional changes in the characters. To the best of our knowledge, this work is first to focuses on characters' emotional lines in story generation. We present a novel model-based attention mechanism that we call SoCP (Storytelling of multi-Character Psychology). We show that the proposed model can generate stories considering the changes in the psychological state of different characters. To take into account the particularity of the model, in addition to commonly used evaluation indicators(BLEU, ROUGE, etc.), we introduce the accuracy rate of psychological state control as a novel evaluation metric. The new indicator reflects the effect of the model on the psychological state control of story characters. Experiments show that with SoCP, the generated stories follow the psychological state for each character according to both automatic and human evaluations.
Building a question-answering agent currently requires large annotated datasets, which are prohibitively expensive. This paper proposes Schema2QA, an open-source toolkit that can generate a Q&A system from a database schema augmented with a few annotations for each field. The key concept is to cover the space of possible compound queries on the database with a large number of in-domain questions synthesized with the help of a corpus of generic query templates. The synthesized data and a small paraphrase set are used to train a novel neural network based on the BERT pretrained model. We use Schema2QA to generate Q&A systems for five Schema.org domains, restaurants, people, movies, books and music, and obtain an overall accuracy between 64% and 75% on crowdsourced questions for these domains. Once annotations and paraphrases are obtained for a Schema.org schema, no additional manual effort is needed to create a Q&A agent for any website that uses the same schema. Furthermore, we demonstrate that learning can be transferred from the restaurant to the hotel domain, obtaining a 64% accuracy on crowdsourced questions with no manual effort. Schema2QA achieves an accuracy of 60% on popular restaurant questions that can be answered using Schema.org. Its performance is comparable to Google Assistant, 7% lower than Siri, and 15% higher than Alexa. It outperforms all these assistants by at least 18% on more complex, long-tail questions.
Different from shopping at retail stores, consumers on e-commerce platforms usually cannot touch or try products before purchasing, which means that they have to make decisions when they are uncertain about the outcome (e.g., satisfaction level) of purchasing a product. To study people's preferences with regard to choices that have uncertain outcomes, economics researchers have proposed the hypothesis of Expected Utility (EU) that models the subject value associated with an individual's choice as the statistical expectations of that individual's valuations of the outcomes of this choice. Despite its success in studies of game theory and decision theory, the effectiveness of EU, however, is mostly unknown in e-commerce recommendation systems. Previous research on e-commerce recommendation interprets the utility of purchase decisions either as a function of the consumed quantity of the product or as the gain of sellers/buyers in the monetary sense. As most consumers just purchase one unit of a product at a time and most alternatives have similar prices, such modeling of purchase utility is likely to be inaccurate in practice. In this paper, we interpret purchase utility as the satisfaction level a consumer gets from a product and propose a recommendation framework using EU to model consumers' behavioral patterns. We assume that consumer estimates the expected utilities of all the alternatives and choose products with maximum expected utility for each purchase. To deal with the potential psychological biases of each consumer, we introduce the usage of Probability Weight Function (PWF) and design our algorithm based on Weighted Expected Utility (WEU). Empirical study on real-world e-commerce datasets shows that our proposed ranking-based recommendation framework achieves statistically significant improvement against both classical Collaborative Filtering/Latent Factor Models and state-of-the-art deep models in top-K recommendation.
Link prediction insimple graphs is a fundamental problem in which new links between vertices are predicted based on the observed structure of the graph. However, in many real-world applications, there is a need to model relationships among vertices that go beyond pairwise associations. For example, in a chemical reaction, relationship among the reactants and products is inherently higher-order. Additionally, there is a need to represent the direction from reactants to products. Hypergraphs provide a natural way to represent such complex higher-order relationships. Graph Convolutional Network (GCN) has recently emerged as a powerful deep learning-based approach for link prediction over simple graphs. However, their suitability for link prediction in hypergraphs is underexplored -- we fill this gap in this paper and propose Neural Hyperlink Predictor (NHP). NHP adapts GCNs for link prediction in hypergraphs. We propose two variants of NHP -- NHP-U and NHP-D -- for link prediction over undirected and directed hypergraphs, respectively. To the best of our knowledge, NHP-D is the first-ever method for link prediction over directed hypergraphs. An important feature of NHP is that it can also be used for hyperlinks in which dissimilar vertices interact (e.g. acids reacting with bases). Another attractive feature of NHP is that it can be used to predict unseen hyperlinks at test time (inductive hyperlink prediction). Through extensive experiments on multiple real-world datasets, we show NHP's effectiveness.
Machine learning models are at the foundation of modern society. Accounts of unfair models penalizing subgroups of a population have been reported in domains including law enforcement, job screening, etc. Unfairness can spur from biases in the training data, as well as from class imbalance, i.e., when a sensitive group's data is not sufficiently represented. Under such settings, balancing techniques are commonly used to achieve better prediction performance, but their effects on model fairness are largely unknown. In this paper, we first illustrate the extent to which common balancing techniques exacerbate unfairness in real-world data. Then, we propose a new method, called fair class balancing, that allows to enhance model fairness without using any information about sensitive attributes. We show that our method can achieve accurate prediction performance while concurrently improving fairness.
Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark data sets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark data set, code and a pre-trained model to accelerate future research on long-form document matching.
We demonstrated the existence of a group algebraic structure hidden in relational knowledge embedding problems, which suggests that a group-based embedding framework is essential for designing embedding models. Our theoretical analysis explores merely the intrinsic property of the embedding problem itself hence is model independent. Motivated by the theoretical analysis, we have proposed a group theory-based knowledge graph embedding framework, in which relations are embedded as group elements, and entities are represented by vectors in group action spaces. We provide a generic recipe to construct embedding models associated with two instantiating examples: SO3E and SU2E, both of which apply a continuous non-Abelian group as the relation embedding. Empirical experiments using these two exampling models have shown state-of-the-art results on benchmark datasets.
Randomness exists either due to the inherent noise of the problem or lack of important input features, which could lead to multimodality of the data distribution. Therefore, in more and more scenarios, it is required not only to predict a single point-value, but also the distribution of the prediction. However, well-studied prediction models usually focus on point prediction that minimizes the mean squared error or the mean absolute error. These approaches could miss important knowledge when their outputs are applied to the downstream decision process. In this paper, we combine the advantages of both GANs (Generative Adversarial Nets) and VAEs (Variational Auto-Encoders), and introduce a latent-based conditional generative model (LB-CGM) to handle the distribution regression problems. The VAE framework is adopted, and the adversarial network is applied to estimate the validity of the generated sample. Besides, the latent-based reconstruction loss is introduced to mitigate mode collapse, in which the direct pairwise comparison between the original and generated samples ensures the correctness and completeness of the generated mode pattern. In this work, we explore a path for the generative model to be used in probabilistic prediction problems. This method can produce conditional prediction distribution close to the actual distribution and is verified on both the synthetic dataset and benchmark dataset.
Risk prediction using electronic health records (EHR) is a challenging data mining task due to the two-level hierarchical structure of EHR data. EHR data consist of a set of time-ordered visits, and within each visit, there is a set of unordered diagnosis codes. Existing approaches focus on modeling temporal visits with deep neural network (DNN) techniques. However, they ignore the importance of modeling diagnosis codes within visits, and a lot of task-unrelated information within visits usually leads to unsatisfactory performance of existing approaches. To minimize the effect caused by noise information of EHR data, in this paper, we propose a novel DNN for risk prediction termed as LSAN, which consists of a Hierarchical Attention Module (HAM) and a Temporal Aggregation Module (TAM). Particularly, LSAN applies HAM to model the hierarchical structure of EHR data. Using the attention mechanism in the hierarchy of diagnosis code, HAM is able to retain diagnosis details and assign flexible attention weights to different diagnosis codes by their relevance to corresponding diseases. Moreover, the attention mechanism in the hierarchy of visit learns a comprehensive feature throughout the visit history by paying greater attention to visits with higher relevance. Based on the foundation laying by HAM, TAM uses a two-pathway structure to learn a robust temporal aggregation mechanism among all visits for LSAN. It extracts long-term dependencies by a Transformer encoder and short-term correlations by a parallel convolutional layer among different visits. With the construction of HAM and TAM, LSAN achieves the state-of-the-art performance on three real-world datasets with larger AUCs, recalls and F1 scores. Furthermore, the model analysis results demonstrate the effectiveness of the network construction with good interpretability and robustness of decision making by LSAN.
We study the commonsense inference task that aims to reason and generate the causes and effects of a given event. Existing neural methods focus more on understanding and representing the event itself, but pay little attention to the relations between different commonsense dimensions (e.g. causes or effects) of the event, making the generated results logically inconsistent and unreasonable. To alleviate this issue, we propose Chain Transformer, a logic enhanced commonsense inference model that combines both direct and indirect inferences to construct a logical chain so as to reason in a more logically consistent way. First, we apply a self-attention based encoder to represent and encode the given event. Then a chain of decoders is implemented to reason and generate for different dimensions following the logical chain, where an attention module is designed to link different decoders and to make each decoder attend to the previous reasoned inferences. Experiments on two real-world datasets show that Chain Transformer outperforms previous methods on both automatic and human evaluation, and demonstrate that Chain Transformer can generate more reasonable and logically consistent inference results.
Adversarial examples can be detrimental to a recommender,leading to a surging enthusiasm for applying adversarial learning to improve recommendation performance, e.g. raising model robustness, alleviating data sparsity, generating initial profiles for cold-start users or items, etc. Most existing adversarial example generation methods fall within three categories: attacking the user-item interactions or auxiliary contents, adding perturbations in latent space, sampling the latent space according to certain distribution. In this work, we focus on the semantic-rich user-item interactions in a recommender system and propose a novel generative adversarial network (GAN) named Convolutional Generative Collaborative Filtering (Conv-GCF). We develop an effective perturbation mechanism (adversarial noise layer) for convolutional neural networks (CNN), based on which we design a generator with residual blocks to synthesize user-item interactions. We empirically demonstrate that on Conv-GCF, the adversarial noise layer is superior to the conventional noise-adding approach. Moreover, we propose two types of discriminators: one using Bayes Personalized Ranking (BPR) and the other with binary classification. On four public datasets, we show that our approach achieves the state-of-the-art top-n recommendation performance among competitive baselines.
The heavy traffic congestion problem has always been a concern for modern cities. To alleviate traffic congestion, researchers use reinforcement learning (RL) to develop better traffic signal control (TSC) algorithms in recent years. However, most RL models are trained and tested in the same traffic flow environment, which results in a serious overfitting problem. Since the traffic flow environment in the real world keeps varying, these models can hardly be applied due to the lack of generalization ability. Besides, the limited number of accessible traffic flow data brings extra difficulty in testing the generalization ability of the models. In this paper, we design a novel traffic flow generator based on Wasserstein generative adversarial network to generate sufficient diverse and quality traffic flows and use them to build proper training and testing environments. Then we propose a meta-RL TSC framework GeneraLight to improve the generalization ability of TSC models. GeneraLight boosts the generalization performance by combining the idea of flow clustering and model-agnostic meta-learning. We conduct extensive experiments on multiple real-world datasets to show the superior performance of GeneraLight on generalizing to different traffic flows.
The Multi-Task Learning (MTL) leverages the inter-relationship across tasks and is useful for applications with limited data. Existing works articulate different task relationship assumptions, whose validity is vital to successful multi-task training. We observe that, in many scenarios, the inter-relationship across tasks varies across different groups of data (i.e., topic), which we call within-topic task relationship hypothesis. In this case, current MTL models with homogeneous task relationship assumption cannot fully exploit different task relationships among different groups of data. Based on this observation, in this paper, we propose a generalized topic-wise multi-task architecture, to capture the within-topic task relationship, which can be combined with any existing MTL designs. Further, we propose a new specialized MTL design, topic-task-sparsity, along with two different types of sparsity constraints. The architecture, combined with the topic-task-sparsity design, constructs our proposed TOMATO model. The experiments on both synthetic and 4 real-world datasets show that our proposed models consistently outperform 6 state-of-the-art models and 2 baselines with improvement from $5%$ to $46%$ in terms of task-wise comparison, demonstrating the validity of the proposed within-topic task relationship hypothesis. We release the source codes and datasets of TOMATO at: https://github.com/JasonLC506/MTSEM.
Zero-shot learning (ZSL) aims to recognize unseen categories whose data is unavailable during the training stage. Most existing ZSL algorithms focus on learning an embedding space and determine the classes of test samples according to sample-prototype similarities in this space. However, we observe that, in contrast to the single sample-prototype relationship, an ensemble criterion usually benefits the final classification, just as the saying "more than one". Inspired by this, we introduce a novel cluster-prototype matching (CPM) strategy and propose a ZSL framework based on CPM. Firstly, we learn a mapping between the visual space and the semantic space utilizing a well-established ZSL algorithm. Via the learned mapping, all test samples are projected into the embedding space and clustered in this space. Secondly, two CPM methods, soft-CPM and hard-CPM, are proposed to match clusters and class prototypes, along with cluster-prototype similarities calculated. Finally, the label of each sample is determined by the combination of the sample-prototype similarity and the cluster-prototype similarity. We apply our framework to five basic ZSL methods and compare them with several advanced baselines of ZSL. The experimental results demonstrate that the proposed framework can significantly improve the performance of the basic ZSL models and help them achieve or beyond the state-of-the-art.
Neighborhood aggregation is a key step in Graph Convolutional Networks (GCNs) for graph representation learning. Two commonly used aggregators, sum and mean, are designed with the homophily assumption that connected nodes are likely to share the same label. However, real-world graphs are noisy and adjacent nodes do not necessarily imply similarity.Learnable aggregators are proposed in Graph Attention Network (GAT) and Learnable Graph Convolutional Layer (LGCL). However, GAT considers node importance but not the importance of different features. The convolution aggregator in LGCL considers feature importance but it can not directly operate on graphs due to the irregular connectivity and lack of orderliness. In this paper, we firstly unify the current learnable aggregators in a framework: Learnable Aggregator for GCN (LA-GCN) by introducing a shared auxiliary model that provides a customized schema in neighborhood aggregation. Under this framework, we propose a new model called LA-GCNMask consisting of a new aggregator function,mask aggregator. The auxiliary model learns a specific mask for each neighbor of a given node, allowing both node-level and feature-level attention. This mechanism learns to assign different importance to both nodes and features for prediction, which provides interpretable explanations for prediction and increases the model robustness. Experiments on seven graphs for node classification and graph classification tasks show that LA-GCNMask outperforms the state-of-the-art methods. Moreover, our aggregator can identify both the important nodes and node features simultaneously, which provides a quantified understanding of the relationship between input nodes and the prediction. We further conduct experiments on noisy graphs to evaluate the robustness of our model. Experiments show that LA-GCNMask consistently outperforms the state-of-the-art methods, with up to 15% improvements in terms of accuracy compared to the second best.
Query understanding is a fundamental problem in information retrieval (IR), which has attracted continuous attention through the past decades. Many different tasks have been proposed for understanding users' search queries, e.g., query classification or query clustering. However, it is not that precise to understand a search query at the intent class/cluster level due to the loss of many detailed information. As we may find in many benchmark datasets, e.g., TREC and SemEval, queries are often associated with a detailed description provided by human annotators which clearly describes its intent to help evaluate the relevance of the documents. If a system could automatically generate a detailed and precise intent description for a search query, like human annotators, that would indicate much better query understanding has been achieved. In this paper, therefore, we propose a novel Query-to-Intent-Description (Q2ID) task for query understanding. Unlike those existing ranking tasks which leverage the query and its description to compute the relevance of documents, Q2ID is a reverse task which aims to generate a natural language intent description based on both relevant and irrelevant documents of a given query. To address this new task, we propose a novel Contrastive Generation model, namely CtrsGen for short, to generate the intent description by contrasting the relevant documents with the irrelevant documents given a query. We demonstrate the effectiveness of our model by comparing with several state-of-the-art generation models on the Q2ID task. We discuss the potential usage of such Q2ID technique through an example application.
Category systems are central components of knowledge bases, as they provide a hierarchical grouping of semantically related concepts and entities. They are a unique and valuable resource that is utilized in a broad range of information access tasks. To aid knowledge editors in the manual process of expanding a category system, this paper presents a method of generating categories for sets of entities. First, we employ neural abstractive summarization models to generate candidate categories. Next, the location within the hierarchy is identified for each candidate. Finally, structure-, content-, and hierarchy-based features are used to rank candidates to identify by the most promising ones (measured in terms of specificity, hierarchy, and importance). We develop a test collection based on Wikipedia categories and demonstrate the effectiveness of the proposed approach.
Graph Neural Networks(GNNs), like GCN and GAT, have achieved great success in a number of supervised or semi-supervised tasks including node classification and link prediction. These existing graph neural networks can effectively encode neighborhood information of graph nodes through their message aggregating mechanisms. However, there are some unsupervised and structure-related tasks like community detection, which is a fundamental problem in network analysis that finds densely-connected groups of nodes and separates them from others in graphs. It is still difficult for these general-purposed GNNs to learn the needed structural information in these particular problems. To overcome the shortcomings of general-purposed graph representation learning methods, we propose the Community Deep Graph Infomax (CommDGI), a graph neural network designed to handle community detection problems. Inspired by the success of deep graph infomax in self-supervised graph learning, we design a novel mutual information mechanism to capture neighborhood as well as community information in graphs. A trainable clustering layer is employed to learn the community partition in an end-to-end manner. Disentangled representation learning is applied in our graph neural network so that the model can improve interpretability and generalization. Throughout the whole learning process, joint optimization is applied to learn the community-related node representations. The experimental results show that our algorithm outperforms state-of-the-art community detection methods.
Traffic flow prediction plays an important role in many spatial-temporal data applications, e.g., traffic management and urban planning. Various deep learning techniques are developed to model the traffic dynamic patterns with different neural network architectures, such as attention mechanism, recurrent neural network. However, two important challenges have yet to be well addressed: (i) Most of these methods solely focus on local spatial dependencies and ignore the global inter-region dependencies in terms of traffic distributions; (ii) It is important to capture channel-aware semantics when performing spatial-temporal information aggregation. To address these challenges, we propose a new traffic prediction framework--Spatial-Temporal Convolutional Graph Attention Network (ST-CGA), to enable the traffic prediction with the modeling of region dependencies, from locally to globally in a comprehensive manner. In our ST-CGA framework, we first develop a hierarchical attention networks with a graph-based neural architecture, to capture both the multi-level temporal relations and cross-region traffic dependencies. Furthermore, a region-wise spatial relation encoder is proposed to supercharge ST-CGA mapping spatial and temporal signals into different representation subspaces, with channel-aware recalibration residual network. Extensive experiments on four real-world datasets demonstrate that ST-CGA achieve substantial gains over many state-of-the-art baselines. Source codes are available at: https://github.com/shurexiyue/ST-CGA.
Graph translation is very promising research direction and has awide range of potential real-world applications. Graph is a natural structure for representing relationship and interactions, and its translation can encode the intrinsic semantic changes of relation-ships in different scenarios. However, despite its seemingly wide possibilities, usage of graph translation so far is still quite limited.One important reason is the lack of high-quality paired dataset. For example, we can easily build graphs representing peoples? shared music tastes and those representing co-purchase behavior, but a well paired dataset is much more expensive to obtain. Therefore,in this work, we seek to provide a graph translation model in the semi-supervised scenario. This task is non-trivial, because graph translation involves changing the semantics in the form of link topology and node attributes, which is difficult to capture due to the combinatory nature and inter-dependencies. Furthermore, due to the high order of freedom in graph's composition, it is difficult to assure the generalization ability of trained models. These difficulties impose a tighter requirement for the exploitation of unpaired samples. Addressing them, we propose to construct a dual representation space, where transformation is performed explicitly to model the semantic transitions. Special encoder/decoder structures are designed, and auxiliary mutual information loss is also adopted to enforce the alignment of unpaired/paired examples. We evaluate the proposed method in three different datasets.
Graph neural networks (GNNs) have been widely used to learn node representations from graph data in an unsupervised way for downstream tasks. However, when applied to detect anomalies (e.g., outliers, unexpected density), they deliver unsatisfactory performance as existing loss functions fail. For example, any loss based on random walk (RW) algorithms would no longer work because the assumption that anomalous nodes were close with each other could not hold. Moreover, the nature of class imbalance in anomaly detection tasks brings great challenges to reduce the prediction error. In this work, we propose a novel loss function to train GNNs for anomaly-detectable node representations. It evaluates node similarity using global grouping patterns discovered from graph mining algorithms. It can automatically adjust margins for minority classes based on data distribution. Theoretically, we prove that the prediction error is bounded given the proposed loss function. We empirically investigate the GNN effectiveness of different loss variants based on different algorithms. Experiments on two real-world datasets show that they perform significantly better than RW-based loss for graph anomaly detection.
With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in developing RL-based recommender systems. In practical recommendation sessions, users will sequentially access multiple scenarios, such as the entrance pages and the item detail pages, and each scenario has its specific characteristics. However, the majority of existing RL-based recommender systems focus on optimizing one strategy for all scenarios or separately optimizing each strategy, which could lead to sub-optimal overall performance. In this paper, we study the recommendation problem with multiple (consecutive) scenarios, i.e., whole-chain recommendations. We propose a multi-agent RL-based approach (DeepChain), which can capture the sequential correlation among different scenarios and jointly optimize multiple recommendation strategies. To be specific, all recommender agents (RAs) share the same memory of users' historical behaviors, and they work collaboratively to maximize the overall reward of a session. Note that optimizing multiple recommendation strategies jointly faces two challenges in the existing model-free RL model - (i) it requires huge amounts of user behavior data, and (ii) the distribution of reward (users' feedback) are extremely unbalanced. In this paper, we introduce model-based RL techniques to reduce the training data requirement and execute more accurate strategy updates. The experimental results based on a real e-commerce platform demonstrate the effectiveness of the proposed framework.
Recently, significant progress has been made in sequential recommendation with deep learning. Existing neural sequential recommendation models usually rely on the item prediction loss to learn model parameters or data representations. However, the model trained with this loss is prone to suffer from data sparsity problem. Since it overemphasizes the final performance, the association or fusion between context data and sequence data has not been well captured and utilized for sequential recommendation.
To tackle this problem, we propose the model S3-Rec, which stands for Self-Supervised learning for Sequential Recommendation, based on the self-attentive neural architecture. The main idea of our approach is to utilize the intrinsic data correlation to derive self-supervision signals and enhance the data representations via pre-training methods for improving sequential recommendation. For our task, we devise four auxiliary self-supervised objectives to learn the correlations among attribute, item, subsequence, and sequence by utilizing the mutual information maximization (MIM) principle. MIM provides a unified way to characterize the correlation between different types of data, which is particularly suitable in our scenario. Extensive experiments conducted on six real-world datasets demonstrate the superiority of our proposed method over existing state-of-the-art methods, especially when only limited training data is available. Besides, we extend our self-supervised learning method to other recommendation models, which also improve their performance.
Directed acyclic graph (DAG) is an essentially important model to represent terminologies and their hierarchical relationships, such as Disease Ontology. Due to massive terminologies and complex structures in a large DAG, it is challenging to summarize the whole hierarchical DAG.
In this paper, we study a new problem of finding k representative vertices to summarize a hierarchical DAG. To depict diverse summarization and important vertices, we design a summary score function for capturing vertices' diversity coverage and structure correlation. The studied problem is theoretically proven to be NP-hard. To efficiently tackle it, we propose a greedy algorithm with an approximation guarantee, which iteratively adds vertices with the large summary contributions into answers. To further improve answer quality, we propose a subtree extraction based method, which is proven to guarantee achieving higher-quality answers. In addition, we develop a scalable algorithm k-PCGS based on candidate pruning and DAG compression for large-scale hierarchical DAGs. Extensive experiments on large real-world datasets demonstrate both the effectiveness and efficiency of proposed algorithms.
As an online, query-dependent variant of the well-known community detection problem, community search has been studied for years to find communities containing the query vertices. Along with the generation of graphs with rich attribute information, attributed community search has attracted increasing interest recently, aiming to select communities where vertices are cohesively connected and share homogeneous attributes. However, existing community models may include cut-edges/vertices and thus cannot well guarantee the strong connectivity required by a cohesive community. In this paper, we propose a new cohesive attributed community (CAC) model that can ensure both structure cohesiveness and attribute cohesiveness of communities. Specifically, for a query with vertex vq and keyword set S, we aim to find the cohesively connected communities containing vq with the most shared keywords in S. It is nontrivial as we need to explore all possible subsets of S to verify the existence of structure cohesive communities until we find the communities with the most common keywords. To tackle this problem, we make efforts in two aspects. The first is to reduce the candidate keyword subsets. We achieve this by exploring the anti-monotonicity and neighborhood-constraint properties of our CAC model so that we can filter out the unpromising keyword subsets. The second is to speed up the verification process for each candidate keyword subset. We propose two indexes TIndex and MTIndex to reduce the size of the candidate subgraph before the verification. Moreover, we derive two new properties based on these indexes to reduce the candidate keyword subsets further. We conducted extensive experimental studies on four real-world graphs and validated the effectiveness and efficiency of our approaches.
Heterogeneous information networks (HINs) have been successfully applied into several fields to accomplish complex data analytics, such as bibliography, bioinformatics, NLP, etc. In the meantime, network embedding at present has emerged as a convenient tool to mine and learn from networked data. As a result, it is of interest to develop HIN embedding methods. Despite recent breakthroughs in HIN embedding methods, little research attention has been paid to exploit the relation semantics in HINs and further integrate it to improve the embedding quality. Considering the sophisticated correlations in HINs, we in this paper propose a novel HIN embedding method LRHNE to yield latent-relation enhanced embeddings for nodes. Our work mainly involves three contributions: i) we verify that the latent relation can promote the embedding quality indeed through a real-world dataset, then a novel graph inception network is proposed to extract the latent relational features under the guidance of partial prior knowledge; ii) taking into account the existing structure information and inferred latent relation knowledge, we propose a cross-aligned variational graph autoencoder to extract and further fuse both the structure and latent relational features into the embeddings; and iii) we perform extensive experiments to validate our proposed LRHNE, and experimental results show that our LRHNE can significantly outperform state-of-the-art methods. The multi-facet inspections also exhibit our method is robust and hyper-parameter insensitive, therefore, our method can serve as a radical tool to tackle the relation-sophisticated HINs.
Bootstrapping is an established tool for examining the behaviour of offline information retrieval (IR) experiments, where it has primarily been used to assess statistical significance and the robustness of significance tests. In this work we consider how bootstrapping can be used to assess the reliability of effectiveness measures for experimental IR. We use bootstrapping of the corpus of documents rather than, as in most prior work, the set of queries. We demonstrate that bootstrapping can provide new insights into the behaviour of effectiveness measures: the precision of the measurement of a system for a query can be quantified; some measures are more consistent than others; rankings of systems on a test corpus likewise have a precision (or uncertainty) that can be quantified; and, in experiments with limited volumes of relevance judgements, measures can be wildly different in terms of reliability and precision. Our results show that the uncertainty in measurement and ranking of system performance can be substantial and thus our approach to corpus bootstrapping provides a key tool for helping experimenters to choose measures and understand reported outcomes.
The high cost of constructing test collections led many researchers to develop intelligent document selection methods to find relevant documents with fewer judgments than the standard pooling method requires. In this paper, we conduct a comprehensive set of experiments to evaluate six bandit-based document selection methods, in terms of evaluation reliability, fairness, and reusability of the resultant test collections. In our experiments, the best performing method varies across test collections, showing the importance of using diverse test collections for an accurate performance analysis. Our experiments with six test collections also show that Move-To-Front is the most robust method among the ones we investigate.
Concept-based IR is expected to improve the quality of medical ranking since it captures more semantics than BOW representations. However, bringing concepts and BOW together into a transparent IR framework is challenging. We propose a new aggregation parameter to combine conceptual and term-based Dirichlet Compound Model scores effectively. The determination of this linear parameter is the result of exploring to what degree the difference of the conceptual and term-based sum of IDFs is influential to the integration. Instead of employing heuristics to find combined models, this paper aims to build the grounds for establishing reasonable aggregation standards based on semantic query performance predictors.
Performance anomalies are a core problem in modern information systems, that affects the execution of the hosted applications. The detection of these anomalies often relies on the analysis of the application execution logs. The current most effective approach is to detect samples that differ from a learnt nominal model. However, current methods often focus on detecting sequential anomalies in logs, neglecting the time elapsed between logs, which is a core component of the performance anomaly detection. In this paper, we develop a new model for performance anomaly detection that captures temporal deviations from the nominal model, by means of a sliding window data representation. This nominal model is trained by a Long Short-Term Memory neural network, which is appropriate to represent complex sequential dependencies. We assess the effectiveness of our model on both simulated and real datasets. We show that it is more robust to temporal variations than current state-of-the-art approaches, while remaining as effective.
Gaussian Process Models (GPMs) are widely regarded as a prominent tool for capturing the inherent characteristics of data. These bayesian machine learning models allow for data analysis tasks such as regression and classification. Usually a process of automatic GPM retrieval is needed to find an optimal model for a given dataset, despite prevailing default instantiations and existing prior knowledge in some scenarios, which both shortcut the way to an optimal GPM. Since non-approximative Gaussian Processes only allow for processing small datasets with low statistical versatility, we propose a new approach that allows to efficiently and automatically retrieve GPMs for large-scale data. The resulting model is composed of independent statistical representations for non-overlapping segments of the given data. Our performance evaluation of the new approach demonstrates the quality of resulting models, which clearly outperform default GPM instantiations, while maintaining reasonable model training time.
Most successful search queries do not result in a click if the user can satisfy their information needs directly on the SERP. Modeling query abandonment in the absence of click-through data is challenging because search engines must rely on other behavioral signals to understand the underlying search intent. We show that mouse cursor movements make a valuable, low-cost behavioral signal that can discriminate good and bad abandonment. We model mouse movements on SERPs using recurrent neural nets and explore several data representations that do not rely on expensive hand-crafted features and do not depend on a particular SERP structure. We also experiment with data resampling and augmentation techniques that we adopt for sequential data. Our results can help search providers to gauge user satisfaction for queries without clicks and ultimately contribute to a better understanding of search engine performance.
The goal of claim detection in argument mining is to sort out the key points from a long narrative. In this paper, we design a novel task for argument mining in the financial domain, and provide an expert-annotated dataset, NumClaim, for the proposed task. Based on the statistics, we discuss the differences between the claims in other datasets and the claims of the investors in NumClaim. With the ablation analysis, we show that encoding numeral and co-training with the auxiliary task of the numeral understanding, i.e., the category classification task, can improve the performance of the proposed task under different neural network architectures. The annotations in the NumClaim is published for academic usage under the CC BY-NC-SA 4.0 license.
Recent advances in Graph Convolutional Networks (GCNs) have led to state-of-the-art performance on various graph-related tasks. However, most existing GCN models do not explicitly identify whether all the aggregated neighbors are valuable to the learning tasks, which may harm the learning performance. In this paper, we consider the problem of node classification and propose the Label-Aware Graph Convolutional Network (LAGCN) framework which can directly identify valuable neighbors to enhance the performance of existing GCN models. Our contribution is three-fold. First, we propose a label-aware edge classifier that can filter distracting neighbors and add valuable neighbors for each node to refine the original graph into a label-aware (LA) graph. Existing GCN models can directly learn from the LA graph to improve the performance without changing their model architectures. Second, we introduce the concept of positive ratio to evaluate the density of valuable neighbors in the LA graph. Theoretical analysis reveals that using the edge classifier to increase the positive ratio can improve the learning performance of existing GCN models. Third, we conduct extensive node classification experiments on benchmark datasets. The results verify that LAGCN can improve the performance of existing GCN models considerably, in terms of node classification.
The technique of recursive neighborhood aggregation has dominated the implementation of existing successful Graph Neural Networks (GNNs). However, the recursive information propagation across layers inevitably brings in extra calculations, potentially large variance, and difficulty of parallel computation. In this paper, we propose Graph Unfolding Networks (GUNets) as an alternative mechanism of recursive neighborhood aggregation for graph representation learning. Comparing to generic GNNs, our proposed GUNets are efficient, robust and practically effective. At their core, GUNets unfold the local structure of every node, i.e. the rooted tree, to a set of trajectories, and then adopt set function to capture the topology of the rooted subtree, which is more convenient for parallel computation than the recursive neighborhood aggregation process. More importantly, through a specific design of the set function, our architecture enables efficient and robust learning on large-scale graphs without resorting to any pruning of the rooted subtree that is usually necessary in generic GNNs. Extensive experiments on five large datasets (the number of nodes ranges from 104 to 106) show that our GUNets achieve comparable or even better results than current successful GNNs while gaining significantly more efficiency and lower accuracy variance. Codes can be found at github.com/GUNets/GUNets.
Network alignment, the process of finding correspondences between nodes in different graphs, has many scientific and industrial applications. Existing unsupervised network alignment methods find suboptimal alignments that break up node neighborhoods, i.e. do not preserve matched neighborhood consistency. To improve this, we propose CONE-Align, which models intra-network proximity with node embeddings and uses them to match nodes across networks after aligning the embedding subspaces. Experiments on diverse, challenging datasets show that CONE-Align is robust and obtains 19.25% greater accuracy on average than the best-performing state-of-the-art graph alignment algorithm in highly noisy settings.
Anomaly detection is a useful technique in many applications such as network security and fraud detection. Due to the insufficiency of anomaly samples as training data, it is usually formulated as an unsupervised model learning problem. In recent years there is a surge of adopting graph data structure in numerous applications. Detecting anomaly in an attributed network is more challenging than the sample based task because of the sample information representations in the form of graph nodes and edges. In this paper, we propose a generative adversarial attributed network (GAAN) anomaly detection framework. The fake graph nodes are generated by a generator module with Gaussian noise as input. An encoder module is employed to map both real and fake graph nodes into a latent space. To encode the graph structure information into the node latent representation, we compute the sample covariance matrix for real nodes and fake nodes respectively. A discriminator is trained to recognize whether two connected nodes are from the real or fake graph. With the learned encoder module output, an anomaly evaluation measurement considering the sample reconstruction error and real-sample identification confidence is employed to make prediction. We conduct extensive experiments on benchmark datasets and compare with state-of-the-art attributed graph anomaly detection methods. The superior AUC score demonstrates the effectiveness of the proposed method.
Fast propagation, ease-of-access, and low cost have made social media an increasingly popular means for news consumption. However, this has also led to an increase in the preponderance of fake news. Widespread propagation of fake news can be detrimental to society, and this has created enormous interest in fake news detection on social media. Many approaches to fake news detection use the news content, social context, or both. In this work, we look at fake news detection as a problem of estimating the credibility of both the news publishers and users that propagate news articles. We introduce a new approach called the credibility score-based model that can jointly infer fake news and credibility scores for publishers and users. We use a state-of-the-art statistical relational learning framework called probabilistic soft logic to perform this joint inference effectively. We show that our approach is accurate at both fake news detection and inferring credibility scores. Further, our model can easily integrate any auxiliary information that can aid in fake news detection. Using the FakeNewsNet dataset, we show that our approach significantly outperforms previous approaches at fake news detection by up to 10% in recall and 4% in accuracy. Furthermore, the credibility scores learned for both publishers and users are representative of their true behavior.
Temporal data are continuously collected in a wide range of domains. The increasing availability of such data has led to significant developments of time series analysis. Time series classification, as an essential task in time series analysis, aims to assign a set of temporal sequences to different categories. Among various approaches for time series classification, the distance metric learning based ones, such as the virtual sequence metric learning (VSML), have attracted increased attention due to their remarkable performance. In VSML, virtual sequences attract samples from different classes to facilitate time series classification. However, the existing VSML methods simply employ fixed virtual sequences, which might not be optimal for the subsequent classification tasks. To address this issue, in this paper, we propose a novel time series classification method named Discriminative Virtual Sequence Learning (DVSL). Following the unified framework of sequence metric learning, our DVSL method jointly learns a set of discriminative virtual sequences that help separate time series samples in a feature space, and optimizes the temporal alignment by dynamic time warping. Extensive experiments on 15 UCR time series datasets demonstrate the efficiency of DVSL, compared with several representative baselines.
Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Among these methods, state-of-the-art density-based clustering methods have proven to be effective for arbitrary-shaped clusters. Despite their encouraging results, they suffer to find low-density clusters, near clusters with similar densities, and high-dimensional data. Our proposals are a new characterization of clusters and a new clustering algorithm based on spatial density and probabilistic approach. First of all, sub-clusters are built using spatial density represented as probability density function (p.d.f) of pairwise distances between points. A method is then proposed to agglomerate similar sub-clusters by using both their density (p.d.f) and their spatial distance. The key idea we propose is to use the Wasserstein metric, a powerful tool to measure the distance between p.d.f of sub-clusters. We show that our approach outperforms other state-of-the-art density-based clustering methods on a wide variety of datasets.
News background linking is the problem of finding online resources that can provide valuable context and background information to help the reader comprehend a given news article. While the problem has recently attracted several researchers, however, the notion of background relevance is not well-studied and it requires better understanding to ensure effective system performance. In this paper, we conduct a qualitative analysis on a sample of 25 query news articles and 152 of their corresponding background articles, in addition to 50 of non-background ones. The goal of the study is to shed some light on the relationship between the query articles and the background articles, and provide informative insights for developing more effective background retrieval models. For instance, our analysis shows that event-driven query articles are, on average, harder to process than others, hence they should be handled differently. It also shows that discussing subtopics in detail and adding new informative topics are both essential factors for highly-relevant background articles. Moreover, it shows that a high lexical similarity between a query article and a background one is neither sufficient nor necessary.
The performance of query processing has always been a concern in the field of information retrieval. Dynamic pruning algorithms have been proposed to improve query processing performance in terms of efficiency and effectiveness. However, a single pruning algorithm generally does not have both advantages. In this work, we investigate the performance of the main dynamic pruning algorithms in terms of average and tail latency as well as the accuracy of query results, and find that they are complementary. Inspired by these findings, we propose two types of hybrid dynamic pruning algorithms that choose different combinations of strategies according to the characteristics of each query. Experimental results demonstrate that our proposed methods yield a good balance between both efficiency and effectiveness.
Sample optimization, which involves sample augmentation and sample refinement, is an essential but often neglected component in modern display advertising platforms. Due to the massive number of ad candidates, industrial ad service usually leverages a multi-layer funnel-shaped structure involving at least two stages: candidate generation and re-ranking. In the candidate generation step, an offline neural network matching model is often trained based on past click/conversion data to obtain the user feature vector and ad feature vector. However, there is a covariate shift problem between the user observed ads and all possible ones. As a result, the candidate generation model trained from the click/conversion history cannot fully capture users' potential intentions or generalize well to unseen ads. In this paper, we utilize several sample optimization strategies to alleviate the covariate shift problem for training candidate generation models. We have launched these strategies in Baidu display ad platform and achieved considerable improvements in offline metrics, including both offline click-recall, cost-recall, as well as online metric cost per mille (CPM).
It is crucial to recommend helpful product reviews to consumers in e-commercial service, as the helpful ones can promote consumption. Existing methods for identifying helpful reviews are based on the supervised learning paradigm. The capacity of supervised methods, however, is limited by the lack of annotated reviews. In addition, there is a serious distributional bias between the labeled and unlabeled reviews. Therefore, this paper proposes a reinforced semi-supervised neural learning method (abbreviated as RSSNL) for helpful review identification, which can automatically select high-related unlabeled reviews to help training. Concretely, RSSNL composes with a reinforced unlabeled review selection policy and a semi-supervised pseudo-labeling review classifier. These two parts train jointly and integrate together based on the policy gradient framework. Extensive experiments on Amazon product reviews verify the effectiveness of RSSNL for using unlabeled reviews.
Network embedding has demonstrated effective empirical performance for various network mining tasks such as node classification, link prediction, clustering, and anomaly detection. However, most of these algorithms focus on the single-view network scenario. From a real-world perspective, one individual node can have different connectivity patterns in different networks. For example, one user can have different relationships on Twitter, Facebook, and LinkedIn due to varying user behaviors on different platforms. In this case, jointly considering the structural information from multiple platforms (i.e., multiple views) can potentially lead to more comprehensive node representations, and eliminate noises and bias from a single view. In this paper, we propose a view-adversarial framework to generate comprehensive and robust multi-view network representations named VANE, which is based on two adversarial games. The first adversarial game enhances the comprehensiveness of the node representation by discriminating the view information which is obtained from the subgraph induced by neighbors of that node. The second adversarial game improves the robustness of the node representation with the challenging of fake node representations from the generative adversarial net. We conduct extensive experiments on downstream tasks with real-world multi-view networks, which shows that our proposed VANE framework significantly outperforms other baseline methods.
Adversarial machine learning has exposed several security hazards of neural models. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an ℓ∞ norm around the original model weights. We introduce adversarial perturbations in model weights using a composite loss on the predictions of the original model and the desired trigger through projected gradient descent. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values for several CV and NLP applications.
Information retrieval evaluation has to consider the varying "difficulty" between topics. Topic difficulty is often defined in terms of the aggregated effectiveness of a set of retrieval systems to satisfy a respective information need. Current approaches to estimate topic difficulty come with drawbacks such as being incomparable across different experimental settings. We introduce a new approach to estimate topic difficulty, which is based on the ratio of systems that achieve an NDCG score that is better than a baseline formed as random ranking of the pool of judged documents. We modify the NDCG measure to explicitly reflect a system's divergence from this hypothetical random ranker. In this way we achieve relative comparability of topic difficulty scores across experimental settings as well as stability to outlier systems?features lacking in previous difficulty estimations. We reevaluate the TREC 2012 Web Track's ad hoc task to demonstrate the feasibility of our approach in practice.
NDCG is one of the most commonly used measures to quantify system performance in retrieval experiments. Though originally not considered, graded relevance judgments nowadays frequently include negative labels. Negative relevance labels cause NDCG to be unbounded. This is probably why widely used implementations of NDCG map negative relevance labels to zero, thus ensuring the resulting scores to originate from the [0,1] range. But zeroing negative labels discards valuable relevance information, e.g., by treating spam documents the same as unjudged ones, which are assigned the relevance label of zero by default. We show that, instead of zeroing negative labels, a min-max-normalization of NDCG retains its statistical power while improving its reliability and stability.
In this paper, we study the problem of employing pre-trained language models for multi-turn response selection in retrieval-based chatbots. A new model, named Speaker-Aware BERT (SA-BERT), is proposed in order to make the model aware of the speaker change information, which is an important and intrinsic property of multi-turn dialogues. Furthermore, a speaker-aware disentanglement strategy is proposed to tackle the entangled dialogues. This strategy selects a small number of most important utterances as the filtered context according to the speakers' information in them. Finally, domain adaptation is performed to incorporate the in-domain knowledge into pre-trained language models. Experiments on five public datasets show that our proposed model outperforms the present models on all metrics by large margins and achieves new state-of-the-art performances for multi-turn response selection.
A well-known problem in data science and machine learning is linear regression, which is recently extended to dynamic graphs. Existing exact algorithms for updating solutions of dynamic graph regression require at least a linear time (in terms of n: the number of nodes of the graph). However, this time complexity might be intractable in practice. In this paper, we utilize subsampled randomized Hadamard transform to propose a randomized algorithm for dynamic graphs. Suppose that we are given an nxm matrix embedding M of the graph, where m ⇐ n. Let r be the number of samples required for a guaranteed approximation error, which is a sublinear function of n. After an edge insertion or an edge deletion in the graph, our algorithm updates the approximate solution in O(rm) time.
We focus on the composition of teams of experts that collectively cover a set of required skills based on their historical collaboration network and expertise. Prior works are primarily based on the shortest path between experts on the expert collaboration network, and suffer from three major shortcomings: (1) they are computationally expensive due to the complexity of finding paths on large network structures; (2) they use a small portion of the entire historical collaboration network to reduce the search space; hence, may form sub-optimal teams; and, (3) they fall short in sparse networks where the majority of the experts have only participated in a few teams in the past. Instead of forming a large network of experts, we propose to learn relationships among experts and skills through a variational Bayes neural architecture wherein: i) we consider all past team compositions as training instances to predict future teams; ii) we bring scalability for large networks of experts due to the neural architecture; and, iii) we address sparsity by incorporating uncertainty on the neural network's parameters which yields a richer representation and more accurate team composition. We empirically demonstrate how our proposed model outperforms the state-of-the-art approaches in terms of effectiveness and efficiency based on a large DBLP dataset.
Knowledge graph embedding (KGE) encodes components of a KG including entities and relations into continuous low vector space. Most existing methods focus on treating entities and relations in triples independently and thus failing to capture the complex and hidden information that is inherently implicit inside the local neighborhood surrounding a triple. In this paper, we present a new approach for knowledge graph completion called GAEAT (Graph Auto-encoder Attention Network Embedding), which can encapsulate both entity and relation features. Specifically, we construct a triple-level auto-encoder by extending graph attention mechanisms to obtain latent representations of entities and relations simultaneously. To justify our proposed model, we evaluate GAEAT on two real-world datasets. The experimental results demonstrate that GAEAT can outperform state-of-the-art KGE models in knowledge graph completion task, which validates the effectiveness of GAEAT. The source code of this paper can be obtained from https://github.com/TomersHan/GAEAT.
The use of stopwords has been thoroughly studied in traditional Information Retrieval systems, but remains unexplored in the context of neural models. Neural re-ranking models take the full text of both the query and document into account. Naturally, removing tokens that do not carry relevance information provides us with an opportunity to improve the effectiveness by reducing noise and lower document representation caching-storage requirements. In this work we propose a novel contextualized stopword detection mechanism for neural re-ranking models. This mechanism consists of training a sparse vector in order to filter out document tokens from the ranking decision. This vector is learned end-to-end based on the contextualized document representations, allowing the model to filter terms on a per occurrence basis. This leads to a more explainable model, as it reduces noise. We integrate our component into the state-of-the-art interaction-based TK neural re-ranking model. Our experiments on the MS MARCO passage collection and queries from the TREC 2019 Deep Learning Track show that filtering out traditional stopwords prior to the neural model reduces its effectiveness, while learning to filter out contextualized representations improves it.
Due to the high temporal uncertainty and low signal-to-noise ratio, transfer learning for univariate time series forecasting remains a challenging task. In addition, data scarcity, which is commonly encountered in business forecasting, further limits the application of conventional transfer learning protocols. In this work, we have developed, DATSING, a transfer learning-based framework that effectively leverages cross-domain time series latent representations to augment target domain forecasting. In particular, we aim to transfer domain-invariant feature representations from a pre-trained stacked deep residual network to the target domains, so as to assist the prediction of each target time series. To effectively avoid noisy feature representations, we propose a two-phased framework which first clusters similar mixed domains time series data and then performs a fine-tuning procedure with domain adversarial regularization to achieve better out-of-sample generalization. Extensive experiments with real-world datasets have demonstrated that our method significantly improves the forecasting performance of the pre-trained model. DATSING has the unique potential to empower forecasting practitioners to unleash the power of cross-domain time series data.
Heterogeneous information network (HIN), especially its embedding task, has drawn much attention recently as its rich latent information brings great benefits to complex classification and clustering. Many prior embedding works focus on designing a specific approach for the HIN while others implicitly homogenize the HIN with losing some semantic information. In this paper, a novel explicit homogenization method is proposed to preserve more semantic information, where the latent information of intermediate nodes among each meta-path instance and that among multiple meta-path instances are incorporated into the conventional adjacent matrix (or weight matrix). Then, the transfer of weight matrix and the fusion of node-level embeddings are considered to obtain graph-level embedding to solve the HIN problem. In such way, much more latent information of meta-path is preserved so that the proposed method exhibits its superiority in comparison to state-of-the-art works in classification and clustering tasks.
A popular choice for extractive summarization is to conceptualize it as sentence-level classification, supervised by binary labels. While the common metric ROUGE prefers to measure the text similarity, instead of the performance of classifier. For example, BERTSUMEXT, the best extractive classifier so far, only achieves a precision of 32.9% at the top 3 extracted sentences (P@3) on CNN/DM dataset. It is obvious that current approaches cannot model the complex relationship of sentences exactly with 0/1 targets. In this paper, we introduce DistilSum, which contains teacher mechanism and student model. Teacher mechanism produces high entropy soft targets at a high temperature. Our student model is trained with the same temperature to match these informative soft targets and tested with temperature of 1 to distill for ground-truth labels. Compared with large version of BERTSUMEXT, our experimental result on CNN/DM achieves a substantial improvement of 0.99 ROUGE-L score (text similarity) and 3.95 P@3 score (performance of classifier). Our source code will be available on Github.
Document-level relation extraction (RE) has recently received a lot of attention. However, existing models for document-level RE have similar structures to the models for sentence-level RE. Thus, they still do not consider some unique characteristics of the new problem setting. For example, in Wikipedia, there is a title for each page and it usually represents the topic entity that is mainly described on the page. In many cases, the topic entity is omitted in the text. Thus, existing RE models often fail to find the relations with the omitted topic entity. To tackle the problem, we propose a Topic-aware Relation EXtraction (T-REX) model. To extract the relations with the (possibly omitted) topic entity, the proposed model first encodes the topic entity by aggregating the information of all its mentions in the document. Then it finds the relations between the topic entity and each mention of other entities. Finally, the output layer combines the mention-wise results and outputs all relations expressed in the document. Our performance study with a large-scale dataset confirms the effectiveness of the T-REX model.
In this paper, we present CR-Graph (community reinforcement on graphs), a novel method that helps existing algorithms to perform more-accurate community detection (CD). Toward this end, CR-Graph strengthens the community structure of a given original graph by adding non-existent predicted intra-community edges and deleting existing predicted inter-community edges. To design CR-Graph, we propose the following two strategies: (1) predicting intra-community and inter-community edges (i.e., the type of edges) and (2) determining the amount of edges to be added/deleted. To show the effectiveness of CR-Graph, we conduct extensive experiments with various CD algorithms on 7 synthetic and 4 real-world graphs. The results demonstrate that CR-Graph improves the accuracy of all underlying CD algorithms universally and consistently.
This paper presents findings from an empirical study of multileaved comparisons, an efficient online evaluation methodology, in a commercial Web service. The most important difference from the previous studies is the number of rankers involved in the online evaluation: we compared 30 rankers for around 90 days by multileaved comparisons. A relatively large number of rankers answered several questions that could not be addressed in the previous work due to a small number of rankers: How much ranking difference is required for rankers to be statistically distinguished? How many impressions are necessary for finding statistically significant differences for correlated rankers? How large difference in offline evaluation can predict significant differences in a multileaved comparison? We answer these questions with the results of the multileaved comparisons, and generalized some of the findings by simulation-based experiments.
We detail the novel architecture of a Synopses Data Engine (SDE) which combines the virtues of parallel processing and stream summarization towards interactive analytics at scale. Our SDE, built on top of Apache Flink, has a unique design that supports a very wide variety of synopses, allows for dynamically adding new functionality to it at runtime, and introduces a synopsis-as-a-service paradigm to enable various types of scalability.
Link prediction is the task of predicting missing connections between entities in the knowledge graph (KG). While various forms of models are proposed for the link prediction task, most of them are designed based on a few known relation patterns in several well-known datasets. Due to the diversity and complexity nature of the real-world KGs, it is inherently difficult to design a model that fits all datasets well. To address this issue, previous work has tried to use Automated Machine Learning (AutoML) to search for the best model for a given dataset. However, their search space is limited only to bilinear model families. In this paper, we propose a novel Neural Architecture Search (NAS) framework for the link prediction task. First, the embeddings of the input triplet are refined by the Representation Search Module. Then, the prediction score is searched within the Score Function Search Module. This framework entails a more general search space, which enables us to take advantage of several mainstream model families, and thus it can potentially achieve better performance. We relax the search space to be continuous so that the architecture can be optimized efficiently using gradient-based search strategies. Experimental results on several benchmark datasets demonstrate the effectiveness of our method compared with several state-of-the-art approaches.
Given a natural language query, teaching machines to ask clarifying questions is of immense utility in practical natural language processing systems. Such interactions could help in filling information gaps for better machine comprehension of the query. For the task of ranking clarification questions, we hypothesize that determining whether a clarification question pertains to a missing entry in a given post (on QA forums such as StackExchange) could be considered as a special case of Natural Language Inference (NLI), where both the post and the most relevant clarification question point to a shared latent piece of information or context. We validate this hypothesis by incorporating representations from a Siamese BERT model fine-tuned on NLI and Multi-NLI datasets into our models and demonstrate that our best performing model obtains a relative performance improvement of 40 percent and 60 percent respectively (on the key metric of Precision@1), over the state-of-the-art baseline(s) on the two evaluation sets of the StackExchange dataset, thereby, significantly surpassing the state-of-the-art.
Automatic machine learning (AutoML) aims to automate the different aspects of the data science process and, by extension, allow non-experts to utilize "off the shelf" machine learning solution. One of the more popular AutoML methods is the Tree-based Pipeline Optimization Tool (TPOT), which uses genetic programming (GP) to efficiently explore the vast space of ML pipelines and produce a working ML solution. However, TPOT's GP process comes with substantial time and computational costs. In this study, we explore TPOT's GP process and propose MetaTPOT, an enhanced variant that uses a meta learning-based approach to predict the performance of TPOT's pipeline candidates. MetaTPOT leverages domain knowledge in the form of pipelines pre-ranking to improve TPOT's speed and performance. Evaluation on 65 classification datasets shows that our approach often improves the outcome of the genetic process while simultaneously substantially reduce its running time and computational cost.
Maximum Sustainable Throughput (MST) refers to the amount of data that a Data Stream Processing (DSP) system can ingest while keeping stable performance. It has been acknowledged as an accurate metric to evaluate the performance of stream data processing. Yet, existing operators placements continue to focus on latency and throughput, not MST, as main performance objective when deploying stream data applications in the Edge. In this paper, we argue that MST should be used as an optimization objective when placing operators. This is specially important in the Edge, where network bandwidth and data streams are highly dynamic. We demonstrate that through the design and evaluation of a MST-driven operators placement (based on constraint programming) for stream data applications. Through simulations, we show how existing placement strategies that target overall communications reduction often fail to keep up with the rate of data streams. Importantly, the constraint programming-based operators placement is able to sustain up to 5x increased data ingestion compared to baseline strategies.
We study the problem of index selection to maximize the workload performance, which is critical to database systems. In contrast to existing methods, we seamlessly integrate index recommendation rules and deep reinforcement learning, such that we can recommend single-attribute and multi-attribute indexes together for complex queries and meanwhile support multiple-index access to a table. Specifically, we first propose five heuristic rules to generate the index candidates. Then, we formulate the index selection problem as a reinforcement learning task and employ Deep Q Network (DQN) on it. Using the heuristic rules can significantly reduce the dimensions of the action space and state space in reinforcement learning. With the neural network used in DQN, we can model the interactions between indexes better than previous methods. We conduct experiments on various workloads to show its superiority.
Recently, unbiased learning-to-rank models have been widely studied to learn a better ranker by eliminating the biases from click data. Toward this goal, existing work mainly focused on estimating the propensity weight to design a specific bias type from click data. From a different perspective, we propose a simple-yet-effective ranking model, namely wLambdaMART, which estimates the confidence of click data with a few labeled data, instead of learning the propensity weight to reduce the bias from click data. We first train a confidence estimator to bridge the gap between biased click data and unbiased relevance. Then, we infer confidence weights for all click data and apply them to LambdaMART to learn a debiased ranker. Practically, since it is found that learning the confidence estimator only requires a few labeled data, it does not incur high labeling costs. Our experimental results show that wLambdaMART outperforms state-of-the-art click models and unbiased learning-to-rank models on the real-world click datasets collected from a commercial search engine.
In this paper, we start by pointing out the limitations on the validation of existing signed network embedding (NE) methods. To address the limitations, we design the two research questions: (1) are signed NE methods consistently more effective in various types of tasks than unsigned NE methods? (2) in signed NE methods, does the utilization of negative links help provide higher accuracy in various tasks? To answer the questions, we present our evaluation framework consisting of three components: (1) five signed network datasets; (2) six signed and two unsigned NE methods; (3) five types of tasks. Through extensive experiments on our evaluation framework, we demonstrate that additional utilization of negative links really helps only in some tasks related to negative links but not in tasks related to positive links.
A big challenge existing in genetic functionality prediction is that genetic datasets comprise few samples but massive unclear structured features, i.e., 'large p, small N' problem. To tackle this problem, we propose Non-local Self-attentive Autoencoder (NSAE) which applies attention-driven genetic variant modelling. The backbone attention layer captures long-range dependency relationship among cells (i.e., features) and thus allocates weights to construct attention maps based on cell significance. Utilizing attention maps, NSAE can effectively seize and leverage significant features in a non-local way from numerous cells. Our proposed NSAE outperforms the state-of-the-art algorithms on two genomics datasets from Roadmap projects. The visualization of the attention layer also validates NSAE's ability to highlight important features.
Balanced rule-constrained resource allocation aims to evenly distribute tasks to different processors under allocation rule constraints. Conventional heuristic approach fails to achieve optimal solution while simple brute force method has the defect of high computational complexity. To address these limitations, we propose recursive balanced k-subset sum partition (RBkSP), in which iterative 'cut-one-out' policy is employed that in each round, only one subset whose weight of tasks sums up to 1/k of the total weight of all tasks is taken out from the set. In a single partition, we first create a dynamic programming table with its elements recursively computed, then use 'zig-zag search' method to explore the table, find out elements with optimal subset partition and assign different partitions to proper places. Next, to resolve conflicts during allocation, we use simple but effective heuristic method to adjust the allocation of tasks that is contradicted to allocation rules. Testing results show RBkSP can achieve more balanced results with lower computational complexity over classical benchmarks.
Financial credit risk assessment serves as the impetus to evaluate the credit admission or potential business failure of customers in order to make early actions prior to the actual financial crisis. It aims to predict the probability that a customer may belong to a high-risk group, which is usually formulated as a binary classification problem. However, due to the lack of high-risk samples, the prevailing models suffer from the severe class-imbalance problem. Oversampling those high-risk users could alleviate this problem but the effect of noise examples is also amplified. In this paper, we propose a novel adversarial data augmentation method to solve the class imbalance problem in financial credit risk assessment. We train a generator for synthetic sample generation with a discriminator to identify real or fake instances. Besides, an auxiliary risk discriminator is trained cooperatively with the generator to assess the credit risk. Experimental results on three real-world datasets demonstrate the effectiveness of the proposed
Learning a fair prediction model is an important research problem with profound societal impacts. Most approaches assume free access to the sensitive demographic data, whereas these data are becoming restricted to use by privacy regulations. Existing solutions are broadly based on multi-party computation or demographic proxy, but each direction has its own limits in certain scenarios.
In this paper, we propose a new direction called active demographic query. We assume sensitive demographic data can be queried with cost, e.g., a company may pay to get a customer's consent on using his private data. Building on Dwork's decoupled fair model, we propose two active query strategies: QmCo queries for the most controversial data maximally disagreed by the decoupled models, and QmRe queries for the most resistant data maximally deteriorating fairness of the current model. In experiment, we show both strategies efficiently improve model fairness on three data sets.
Knowledge Graph Augmentation is the task of adding missing facts to an incomplete knowledge graph to improve its effectiveness in applications such as web search and question answering. State-of-the-art methods rely on information extraction from running text, leaving rich sources of facts such as tables behind. We help close this gap with a neural method that uses contextual information surrounding a table in a Wikipedia article to extract relations between entities appearing in the same row of a table or between the entity of said article and entities appearing in the table. We trained and tested our method on a much larger dataset compared to previous work which we have made public and observed experimentally that our method is very promising for the task.
Machine learning models are extensively being used to make decisions that have a significant impact on human life. These models are trained over historical data that may contain information about sensitive attributes such as race, sex, religion, etc. The presence of such sensitive attributes can impact certain population subgroups unfairly. It is straightforward to remove sensitive features from the data; however, a model could pick up prejudice from latent sensitive attributes that may exist in the training data. This has led to the growing apprehension about the fairness of the employed models. In this paper, we propose a novel algorithm that can effectively identify and treat latent discriminating features. The approach is agnostic of the learning algorithm and generalizes well for classification as well as regression tasks. It can also be used as a key aid in proving that the model is free of discrimination towards regulatory compliance if the need arises. The approach helps to collect discrimination-free features that would improve the model performance while ensuring the fairness of the model. The experimental results from our evaluations on publicly available real-world datasets show a near-ideal fairness measurement in comparison to other methods.
In the top-k threshold estimation problem, given a query q, the goal is to estimate the score of the result at rank k. A good estimate of this score can result in significant performance improvements for several query processing scenarios, including selective search, index tiering, and widely used disjunctive query processing algorithms such as MaxScore, WAND, and BMW. Several approaches have been proposed, including parametric approaches, methods using random sampling, and a recent approach based on machine learning. However, previous work fails to perform any experimental comparison between these approaches. In this paper, we address this issue by reimplementing four major approaches and comparing them in terms of estimation error, running time, likelihood of an overestimate, and end-to-end performance when applied to common classes of disjunctive top-k query processing algorithms.
Recommendation algorithms are known to suffer from popularity bias; a few popular items are recommended frequently while the majority of other items are ignored. These recommendations are then consumed by the users, their reaction will be logged and added to the system: what is generally known as a feedback loop. In this paper, we propose a method for simulating the users interaction with the recommenders in an offline setting and study the impact of feedback loop on the popularity bias amplification of several recommendation algorithms. We then show how this bias amplification leads to several other problems such as declining the aggregate diversity, shifting the representation of users' taste over time and also homogenization of the users. In particular, we show that the impact of feedback loop is generally stronger for the users who belong to the minority group.
By "checking into'' various points-of-interest (POIs), users create a rich source of location-based social network data that can be used in expressive spatio-social queries. This paper studies the use of popularity as a means to diversify results of top-k nearby POI queries. In contrast to previous work, we evaluate social diversity as a group-based, rather than individual POI, metric. Algorithmically, evaluating this set-based notion of diversity is challenging, yet we present several effective algorithms based on (integer) linear programming, a greedy framework, and r-tree distance browsing. Experiments show scalability and interactive response times for up to 100 million unique check-ins across 25000 POIs.
People Also Ask (PAA) is an exciting feature in most of the leading search engines which recommends related questions for a given user query, thereby attempting to reduce the gap between user's information need. This helps users in diving deep into the topic of interest, and reduces task completion time. However, showing unrelated or irrelevant questions is highly detrimental to the user experience. While there has been significant work on query reformulation and related searches, there is hardly any published work on recommending related questions for a query. Question suggestion is challenging because the question needs to be interesting, structurally correct, not be a duplicate of other visible information, and must be reasonably related to the original query. In this paper, we present our system which is based on a Transformer-based neural representation, BERT (Bidirectional Encoder Representations from Transformers), for query, question and corresponding search result snippets. Our best model provides an accuracy of ~81%.
Pretrained Transformer models have emerged as state-of-the-art approaches that learn contextual information from the text to improve the performance of several NLP tasks. These models, albeit powerful, still require specialized knowledge in specific scenarios. In this paper, we argue that context derived from a knowledge graph (in our case: Wikidata) provides enough signals to inform pretrained transformer models and improve their performance for named entity disambiguation (NED) on Wikidata KG. We further hypothesize that our proposed KG context can be standardized for Wikipedia, and we evaluate the impact of KG context on the state of the art NED model for the Wikipedia knowledge base. Our empirical results validate that the proposed KG context can be generalized (for Wikipedia), and providing KG context in transformer architectures considerably outperforms the existing baselines, including the vanilla transformer models.
Deep metric learning has shown significantly increasing values in a wide range of domains, such as image retrieval, face recognition, zero-shot learning, to name a few. When evaluating the methods for deep metric learning, top-k precision is commonly used as a key metric, since few users bother to scroll down to lower-ranked items. Despite being widely studied, how to directly optimize top-k precision is still an open problem. In this paper, we propose a new method on how to optimize top-k precision in a rank-sensitive manner. Given the cutoff value k, our key idea is to impose different weights to further differentiate misplaced images sampled according to the top-k precision. To validate the effectiveness of the proposed method, we conduct a series of experiments on three widely used benchmark datasets. The experimental results demonstrate that: (1) Our proposed method outperforms the baseline methods on two datasets, which shows the potential value of rank-sensitive optimization of top-k precision for deep metric learning. (2) The factors, such as batch size and cutoff value k, significantly affect the performance of approaches that rely on optimising top-k precision for deep metric learning. Careful examinations of these factors are highly recommended.
In e-commerce search, vectorized matching is the most important approach besides lexical matching, where learning vector representations for entities (e.g., query, item, shop) plays a crucial role. In this work, we focus on vectorized search matching model for shop search in Taobao. Unlike item search, shop search is faced with serious behavior sparsity and long-tail problem. To tackle this, we take the first step to transfer knowledge from item search, i.e., leveraging items purchased under a query and the shops they belong to. Moreover, we propose a novel gated heterogeneous graph learning model (named GHL) to derive vector representations for entities. Both first-order and second-order proximity of queries and shops are exploited to fully mine the heterogeneous relationships. And to relieve long-tail phenomenon, we devise an innovative gated neighbor aggregation scheme where each type of entities (i.e., hot ones and long-tail ones) can benefit from the heterogeneous graph in an automatic way. Finally, the whole framework is jointly trained in an end-to-end fashion. Offline evaluation results on real-world data of Taobao shop search platform demonstrate that the proposed model outperforms existing graph based methods, and online A/B tests show that it is highly effective and achieves significant CTR improvements.
In this paper, we compare several deep and surface state-of-the-art machine learning methods for risk prediction in problems that can be modelled as a trajectory of events separated by irregular time intervals. Trajectories are the abstract representation of many real-life data, such as patient records, student e-tivities, online financial transactions, and many others. Given the continuously increasing number of machine learning methods to predict future high-risk events in these contexts, we aim to provide more insight into reproducibility and applicability of these methods when changing datasets, parameters, and evaluation measures. As an additional contribution, we release to the community the implementations of all compared methods.
Contextualized neural language models have gained much attention in Information Retrieval (IR) with its ability to achieve better text understanding by capturing contextual structure. However, to achieve better document understanding, it is necessary to involve global structure of a document. In this paper, we take the advantage of Graph Convolutional Networks (GCN) to model global word-relation structure of a document to improve context-aware document ranking. We propose to build a graph for a document to model the global structure. The nodes and edges of the graph are constructed from contextual embeddings. Then we apply graph convolution on the graph to learning a new representation, and this representation covers both contextual and global structure information. The experimental results show that our method outperforms the state-of-the-art contextual language models, which demonstrate that incorporating global structure is useful for improving document ranking and GCN is an effective way to achieve it.
Online ad targeting can be formulated as a problem of learning the relevance ranking among possible audiences for a given ad. It has to deal with the massive number of negative,i.e., non-interacted, instances in impression data due to the nature of this service, and thus suffers from data imbalance problem. In this work, we tackle this problem by improving the quality of negative instances used in training the targeting model. We propose to enhance the generalization capability by introducing unobserved data as possible negative instances, and extract more reliable negative instances from the observed negatives in impression data. However, this idea is non-trivial to implement because of the limited learning signal and existing noise signal. To this end, we design a novel RNIG method (short for Representative Negative Instance Generator) to leverage feature matching technique. It aims to generate reliable negative instances that are similar to the observed negatives and further improves the representativeness of generated negatives by matching the most important feature. Extensive experiments on the real-world ad targeting dataset show that our RNIG model has achieved a relative improvement of more than 5%.
Graph neural network (GNN) is a popular tool to learn the lower-dimensional representation of a graph. It facilitates the applicability of machine learning tasks on graphs by incorporating domain-specific features. There are various options for underlying procedures (such as optimization functions, activation functions, etc.) that can be considered in the implementation of GNN. However, most of the existing tools are confined to one approach without any analysis. Thus, this emerging field lacks a robust implementation ignoring the highly irregular structure of the real-world graphs. In this paper, we attempt to fill this gap by studying various alternative functions for a respective module using a diverse set of benchmark datasets. Our empirical results suggest that the generally used underlying techniques do not always perform well to capture the overall structure from a set of graphs.
Privacy-preserving record linkage (PPRL) facilitates the matching of records that correspond to the same real-world entities across different databases while preserving the privacy of the individuals in these databases. A Bloom filter (BF) is a space efficient probabilistic data structure that is becoming popular in PPRL as an efficient privacy technique to encode sensitive information in records while still enabling approximate similarity computations between attribute values. However, BF encoding is susceptible to privacy attacks which can re-identify the values that are being encoded. In this paper we propose two novel techniques that can be applied on BF encoding to improve privacy against attacks. Our techniques use neighbouring bits in a BF to generate new bit values. An empirical study on large real databases shows that our techniques provide high security against privacy attacks, and achieve better similarity computation accuracy and linkage quality compared to other privacy improvements that can be applied on BF encoding.
Web search engines are frequently used to access information about products. This has increased in recent times with the rising popularity of e-commerce. However, there is limited understanding of what users search for and their intents when it comes to product search on the web. In this work, we study search logs from Bing web search engine to characterize user intents and study user behavior for product search. We propose a taxonomy of product intents by analyzing product search queries. This itself is a challenging task given that only 15%-17% of queries in the web refer to products. We train machine learning classifiers with query log features to classify queries based on intent with an overall F1-score of 78%. We further analyze various characteristics of product search queries in terms of search metrics like dwell time, success, popularity and session-specific information.
Streaming analytics deploy Kleene pattern queries to detect and aggregate event trends on high-rate data streams. Despite increasing workloads, most state-of-the-art systems process each query independently, thus missing cost-saving sharing opportunities. Sharing event trend aggregation poses several technical challenges. First, Kleene patterns are in general difficult to share due to complex nesting and arbitrarily long matches. Second, not all sharing opportunities are beneficial because sharing Kleene patterns incurs non-trivial overhead to ensure the correctness of final aggregation results. We propose MUSE (Multi-query Shared Event trend aggregation), the first framework that shares aggregation queries with Kleene patterns while avoiding expensive trend construction. To find the beneficial sharing plan, the MUSE optimizer effectively selects robust sharing candidates from the exponentially large search space. Our experiments demonstrate that MUSE increases throughput by 4 orders of magnitude compared to state-of-the-art approaches.
Recently introduced pre-trained contextualized autoregressive models like BERT have shown improvements in document retrieval tasks. One of the major limitations of the current approaches can be attributed to the manner they deal with variable-size document lengths using a fixed input BERT model. Common approaches either truncate or split longer documents into small sentences/passages and subsequently label them - using the original document label or from another externally trained model. The other problem is the scarcity of labelled query-document pairs that directly hampers the performance of modern data hungry neural models. This process gets even more complicated with the partially labelled large dataset of queries derived from query logs (TREC-DL). In this paper, we handle both the issues simultaneously and introduce passage level weak supervision in contrast to standard document level supervision. We conduct a preliminary study on the document to passage label transfer and influence of unlabelled documents on the performance of adhoc document retrieval. We observe that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness. We propose a weak-supervision based transfer passage labelling scheme that helps in performance improvement and gathering relevant passages from unlabelled documents.
The location-based social network, Foursquare, reflects the human activities of a city. The mobility dynamics inferred from Foursquare helps us understanding urban social events like crime In this paper, we propose a directed graph from the aggregated movement between regions using Foursquare data. We derive region risk factor from the movement direction, quantity and crime history in different periods of the day. Later, we propose a new set of features, DIrected graph Flow FEatuRes (DIFFER) which are associated with region risk factor. The reliable correlations between DIFFER and crime count are observed. We verify the effectiveness of the DIFFER in monthly crime count using Linear, XGBoost, and Random Forest regression in two cities, Chicago and New York City.
Relation Extraction is a way of obtaining the semantic relationship between entities in text. The state-of-the-art methods use linguistic tools to build a graph for the text in which the entities appear and then a Graph Convolutional Network (GCN) is employed to encode the pre-built graphs. Although their performance is promising, the reliance on linguistic tools results in a non end-to-end process. In this work, we propose a novel model, the Self-determined Graph Convolutional Network (SGCN), which determines a weighted graph using a self-attention mechanism, rather using any linguistic tool. Then, the self-determined graph is encoded using a GCN. We test our model on the TACRED dataset and achieve the state-of-the-art result. Our experiments show that SGCN outperforms the traditional GCN, which uses dependency parsing tools to build the graph.
We propose a stochastic framework to evaluate the impact of missing data on the performance of predictive models. The framework allows full control of important aspects of the data set structure. These include the number and type of the input variables, the correlation between the input variables and their general predictive power, and sample size. The missing process is generated from a multivariate Bernoulli distribution, which allows us to simulate missing patterns corresponding to the MCAR, MAR and MNAR mechanisms. Although the framework may be applied to virtually all types of predictive models, in this article, we focus on the logistic regression model and choose the accuracy as the predictive measure. The simulation results show that the effects of missing data disappear for large sample sizes, as expected. On the other hand, as the number of input variables increases, the accuracy decreases mainly for binary inputs.
Convolutional Neural Network (CNN) based multi-task learning methods have been widely used in a variety of applications of computer vision. Towards effective multi-task CNN architectures, recent studies automatically learn the optimal combinations of task-specific features at single network layers. However, they generally construct an unchanged operation of feature aggregation after training, regardless of the characteristics of input features. In this paper, we propose a novel Adaptive Feature Aggregation (AFA) layer for multi-task CNNs, in which a dynamic aggregation mechanism is designed to allow each task to adaptively determine the degree to which the feature aggregation of different tasks is needed according to the feature dependencies. On both pixel-level and image-level tasks, we demonstrate that our approach significantly outperforms the previous state-of-the-art methods of multi-task CNNs.
We propose Graph Generating Dependencies (GGDs), a new class of dependencies for property graphs. Extending the expressivity of state of the art constraint languages, GGDs can express both tuple- and equality-generating dependencies on property graphs, both of which find broad application in graph data management. We provide the formal definition of GGDs, analyze the validation problem for GGDs, and demonstrate the practical utility of GGDs.
As social network services (SNS) are expanding from friend-based to interest-based, users form a new type of relationships, namely interest-based relationships, with friends and others through social activities (e.g., likes, comments). Although such relationships are highlighted in the common-identity theory and have important values in theoretical and practical aspects, little evidence exists in the literature pertaining to the explanation of social activities as a central component for social network analysis and an association with friendship. In this paper, we build like networks in Instagram and analyze them through the lens of two salient aspects - friendship and interest - that constitute social networks. Our study results (1) show ambiguous interpretations of the like activities between users who are friends, based on the comparative analysis between friend- and non-friend-based like networks, and (2) demonstrate strong signals of the hashtag characterizing the interest-based relationships in users and content. Our research substantiates and gives insights on the common-identity theory applied in online social networks through data-driven, empirical analysis.
Social connections play a vital role in improving the performance of recommendation systems (RS). However, incorporating social information into RS is challenging. Most existing models usually consider social influences in a given session, ignoring that both users? preferences and their friends? influences are evolving. Moreover, in real world, social relations are sparse. Modeling dynamic influences and alleviating data sparsity is of great importance.
In this paper, we propose a unified framework named Dynamic RElation-Aware Model (DREAM) for social recommendation, which tries to model both users? dynamic interests and their friends? temporal influences. Specifically, we design temporal information encoding modules, because of which user representations are updated in each session. The updated user representations are transferred to relational-GAT modules, subsequently influence the operations on social networks. In each session, to solve social relation sparsity, we utilize glove-based method to complete social network with virtual friends. Then we employ relational-GAT module over completed social networks to update users? representations. In the extensive experiments on the public datasets, DREAM significantly outperforms the state-of-the-art solutions.
Log parsers first convert large-scale and unstructured system logs into structured data, and then cluster them into groups for anomaly detection and monitoring. However, the security vulnerabilities of the log parsers have not been unveiled yet. In this paper, to our best knowledge, we take the first step to propose a novel real-time black-box attack framework LogBug in which attackers slightly modify the logs to deviate the analysis result (i.e., evading the anomaly detection) without knowing the learning model and parameters of the log parser. We have empirically evaluated LogBug on five emerging log parsers using system logs collected from five different systems. The results demonstrate that LogBug can greatly reduce the accuracy of log parsers with minor perturbations in real time.
Document retrieval (DR) is a crucial task in NLP. Recently, the pre-trained BERT-like language models have achieved remarkable success, obtaining a state-of-the-art result in DR. In this paper, we come up with a new BERT-based ranking model for DR task, named TABLE. In the pre-training stage of TABLE, we present a domain-adaptive strategy. More essentially, in the fine-tuning stage, we develop a two-phase task-adaptive process, i.e., type-adaptive pointwise fine-tuning and listwise fine-tuning. In the type-adaptive pointwise fine-tuning phase, the model can learn different matching patterns regarding different query types. In the listwise fine-tuning phase, the model matches documents with regard to a given query in a listwise fashion. This task-adaptive process makes the model more robust. In addition, a simple but effective exact matching feature is introduced in fine-tuning, which can effectively compute matching of out-of-vocabulary (OOV) words between a query and a document. As far as we know, we are the first who propose a listwise ranking model with BERT. This work can explore rich matching features between queries and documents. Therefore it substantially improves model performance in DR. Notably, our TABLE model shows excellent performance on the MS MARCO leaderboard.
Recently convolutional networks have shown significant promise for modeling sequential user interactions for recommendations. Critically, such networks rely on fixed convolutional kernels to capture sequential behavior. In this paper, we argue that all the dynamics of the item-to-item transition in session-based settings may not be observable at training time. Hence we propose DynamicRec, which uses dynamic convolutions to compute the convolutional kernels on the fly based on the current input. We show through experiments that this approach significantly outperforms existing convolutional models on real datasets in session-based settings.
Entity matching (EM) is the process of linking records from different data sources. While extensive research has been done in various aspects of EM, many of these studies generally assume EM tasks as schema-specific, which attempt to match record pairs at attributes level. Unfortunately, in the real-world, tables that undergo EM may not have an aligned schema, and often, the schema or metadata of the table and attributes are not known beforehand.In view of this challenge, this paper presents an effective approach for schema-agnostic EM, where having schema-aligned tables is not compulsory. The proposed method stemmed from the idea of treating tuples in tables for EM similar to sentence pair classification problem in natural language processing (NLP). A pre-trained language model, BERT is adopted by fine-tuning it using labeled dataset. The proposed method was experimented using benchmark datasets and compared against two state-of-the-art approaches,namely DeepMatcher and Magellan. The experimental results show that our proposed solution outperforms by an average of 9% in F1 score. The performance is in fact consistent across different types of datasets, showing significant improvement of 29.6% for one of dirty datasets. These prove that our proposed solution is versatile for EM.
Low rank representation of binary matrix is powerful in disentangling sparse individual-attribute associations, and has received wide applications. Existing binary matrix factorization (BMF) or co-clustering (CC) methods often assume i.i.d background noise. However, this assumption could be easily violated in real data, where heterogeneous row- or column-wise probability of binary entries results in disparate element-wise background distribution, and paralyzes the rationality of existing methods. We propose a binary data denoising framework, namely BIND, which optimizes the detection of true patterns by estimating the row- or column-wise mixture distribution of patterns and disparate background, and eliminating the binary attributes that are more likely from the background. BIND is supported by thoroughly derived mathematical property of the row- and column-wise mixture distributions. Our experiment on synthetic and real-world data demonstrated BIND effectively removes background noise and drastically increases the fairness and accuracy of state-of-the arts BMF and CC methods.
Cold-start is a long-standing and challenging problem in recommendation systems. To tackle this issue, many cross-domain recommendation approaches are proposed. However, most of them follow a two-stage embedding-and-mapping paradigm, which is hard to be optimized. Besides, they ignore the structure information of the user-item interaction graph, resulting in that the embedding is insufficient to capture the latent collaborative filtering effect. In this paper, we propose a Dual Autoencoder Network (DAN), which implements cross-domain recommendations to cold-start users in an end-to-end manner. The graph convolutional network (GCN) based encoder in DAN explicitly captures high-order collaborative information in user-item interaction graphs. The two-branched decoder is proposed for fully exploiting the data across domains, and therefore the elaborate reconstruction constraints are obtained under a domain swapping strategy. Experiments on two pairs of real-world cross-domain datasets demonstrate that DAN outperforms existing state-of-the-art methods.
Recently, there has been an interest in embedding networks in hyperbolic space, since hyperbolic space has been shown to work well in capturing graph/network structure as it can naturally reflect some properties of complex networks. However, the work on network embedding in hyperbolic space has been focused on microscopic node embedding. In this work, we are the first to present a framework to embed the structural roles of nodes into hyperbolic space. Our framework extends struct2vec, a well-known structural role preserving embedding method, by moving it to a hyperboloid model. We evaluated our method on four real-world and one synthetic network. Our results show that hyperbolic space is more effective than euclidean space in learning latent representations for the structural role of nodes.
Google Trends is a tool that allows researchers to analyze the popularity of Google search queries across time and space. In a single request, users can obtain time series for up to 5 queries on a common scale, normalized to the range from 0 to 100 and rounded to integer precision. Despite the overall value of Google Trends, rounding causes major problems, to the extent that entirely uninformative, all-zero time series may be returned for unpopular queries when requested together with more popular queries. We address this issue by proposing Google Trends Anchor Bank (G-TAB), an efficient solution for the calibration of Google Trends data. Our method expresses the popularity of an arbitrary number of queries on a common scale without being compromised by rounding errors. The method proceeds in two phases. In the offline preprocessing phase, an "anchor bank" is constructed, a set of queries spanning the full spectrum of popularity, all calibrated against a common reference query by carefully chaining together multiple Google Trends requests. In the online deployment phase, any given search query is calibrated by performing an efficient binary search in the anchor bank. Each search step requires one Google Trends request, but few steps suffice, as we demonstrate in an empirical evaluation. We make our code publicly available as an easy-to-use library at https://github.com/epfl-dlab/GoogleTrendsAnchorBank.
Due to the interpretability and robustness, Markov boundary (MB) has received much attention and been widely applied to causal feature selection. However, enormous empirical studies show that, existing algorithms achieve outstanding performance only on the standard Bayesian network data. While on the real-world data, they could not identify some of the relevant features since the large conditioning set and the ignored multivariate dependence lead to performance degradation. In this paper, we propose a tolerant MB discovery algorithm (TLMB), which maps the feature space and target space to a reproducing kernel Hilbert space through the conditional covariance operator, to measure the causal information carried by a feature. Specifically, TLMB uses a score function to filter the redundant features first and then minimize the trace of the conditional covariance operator, where both of the score function and the optimization problem work in the reproducing kernel Hilbert space so that TLMB can select features with not only pairwise dependence but also multivariate dependence. Moreover, as a MB-based method, TLMB can automatically determine the number of selected features due to the property of MB.
Click-through rate prediction plays an important role in many fields, such as recommender and advertising systems. It is one of the crucial parts to improve user experience and increase industry revenue. Recently, several deep learning-based models are successfully applied to this area. Some existing studies further model user representation based on user historical behavior sequence, in order to capture dynamic and evolving interests. We observe that users usually have multiple interests at a time and the latent dominant interest is expressed by the behavior. The switch of latent dominant interest results in the behavior changes. Thus, modeling and tracking latent multiple interests would be beneficial. In this paper, we propose a novel method named as Deep Multi-Interest Network (DMIN) which models user's latent multiple interests for click-through rate prediction task. Specifically, we design a Behavior Refiner Layer using multi-head self-attention to capture better user historical item representations. Then the Multi-Interest Extractor Layer is applied to extract multiple user interests. We evaluate our method on three real-world datasets. Experimental results show that the proposed DMIN outperforms various state-of-the-art baselines in terms of click-through rate prediction task.
The ability of conversational query understanding (CQU) is indispensable to multi-turn QA. However, existing methods are data-driven and expensive to extend to new conversation domains, or under specific frameworks and hard to apply to other underlying QA technologies. We propose a novel contextual query reformulation (CQR) module based on reformulation actions for general CQU. The actions are domain-independent and scalable, since they capture syntactic regularities of conversations. For action generation, we propose a multi-task learning framework enhanced by coreference resolution, and introduce grammar constraints into the decoding process. Then CQR synthesizes standalone queries based on the actions, which naturally adapts to original downstream technologies. Experiments on different CQU datasets suggest that action-based methods substantially outperform direct reformulation, and the proposed model performs the best among the methods.
Recently few-shot relation classification has drawn much attention. It devotes to addressing the long-tail relation problem by recognizing the relations from few instances. The existing metric learning methods aim to learn the prototype of classes and make prediction according to distances between query and prototypes. However, it is likely to make unreliable predictions due to the text diversity. It is intuitive that the text descriptions of relation and entity can provide auxiliary support evidence for relation classification. In this paper, we propose TD-Proto, which enhances prototypical network with relation and entity descriptions. We design a collaborative attention module to extract beneficial and instructional information of sentence and entity respectively. A gate mechanism is proposed to fuse both information dynamically so as to obtain a knowledge-aware instance. Experimental results demonstrate that our method achieves excellent performance.
Leveraging biased click data for optimizing learning to rank systems has been a popular approach in information retrieval. Because click data is often noisy and biased, a variety of methods have been proposed to construct unbiased learning to rank (ULTR) algorithms for the learning of unbiased ranking models. Among them, automatic unbiased learning to rank (AutoULTR) algorithms that jointly learn user bias models (i.e., propensity models) with unbiased rankers have received a lot of attention due to their superior performance and low deployment cost in practice. Despite their differences in theories and algorithm design, existing studies on ULTR usually use uni-variate ranking functions to score each document or result independently. On the other hand, recent advances in context-aware learning-to-rank models have shown that multivariate scoring functions, which read multiple documents together and predict their ranking scores jointly, are more powerful than uni-variate ranking functions in ranking tasks with human-annotated relevance labels. Whether such superior performance would hold in ULTR with noisy data, however, is mostly unknown. In this paper, we investigate existing multivariate scoring functions and AutoULTR algorithms in theory and prove that permutation invariance is a crucial factor that determines whether a context-aware learning-to-rank model could be applied to existing AutoULTR framework. Our experiments with synthetic clicks on two large-scale benchmark datasets show that AutoULTR models with permutation-invariant multivariate scoring functions significantly outperform those with uni-variate scoring functions and permutation-variant multivariate scoring functions.
Recommending stock with the highest return ratio is always a challenging problem in the field of financial technology. In this paper, we propose a time-aware graph relational attention network (TRAN) for stock recommendation based on return ratio ranking. In TRAN, time-aware relational attention mechanism is the key unit to capture time-varying correlation strength between stocks by the interaction of historical sequences and stock description documents. With the dynamic strength, the nodes of the stock relation graph aggregate the features of neighbor stock nodes by graph convolution operation. For a given group of stocks, our model can output the ranking results of stocks according to their return ratios. The experimental results on several real-world datasets demonstrate the effectiveness of our TRAN for stock recommendation.
Click-Through Rate (CTR) prediction is a crucial task for various online applications, such as recommendation and online advertising. The task of CTR prediction is to predict the probability of users' clicking behaviors, with high-dimensional input features. To avoid heavy handcrafted feature engineering, the core topic of CTR prediction is the automatic interactions of the input features. Factorization Machine (FM) is an effective approach for modeling second-order feature interactions. Recently, FM has been extended for modeling higher-order feature interactions, such as xDeepFM and Higher-Order Factorization Machine (HOFM). However, these approaches are with either high complexity or iterative computation consuming much time and space. To overcome above problems, we express arbitrary-order FM in the form of power sums according to Newton's identities. Accordingly, we propose a novel Interaction Machine (IM) model. IM is an efficient and exact implementation of high-order FM, whose time complexity linearly grows with the order of interactions and the number of feature fields. Via IM, we can conduct arbitrary-order feature interactions in a very simple way. Moreover, we perform IM together with deep neural networks, and the resulted DeepIM model is more efficient than xDeepFM with comparable or even better performance. We conduct experiments on two real-world datasets, in which effectiveness and efficiency of both IM and DeepIM are strongly verified.
Insiders cause significant cyber-security threats to organizations. Due to a very limited number of insiders, most of the current studies adopt unsupervised learning approaches to detect insiders by analyzing the audit data that record information about employees' activities. However, in practice, we do observe a small number of insiders. How to make full use of these few observed insiders to improve a classifier for insider threat detection is a key challenge. In this work, we propose a novel framework combining the idea of self-supervised pre-training and metric-based few-shot learning to detect insiders. Experimental results on insider threat datasets demonstrate that our model outperforms the existing anomaly detection approaches by only using a few insiders.
Online advertising systems often provide means for users to close ads and also leave feedback. Although closing ads requires additional user engagement and usually indicates a poor user experience, ad closes are not as scarce as one might expect. Recently it was shown that penalizing ads with high closing likelihood during auctions may substantially reduce the number of ad closes while maintaining a small predefined revenue loss. In this work, we focus on email since this is the property in which most ad closes occur. Using data collected from a major email provider, we present interesting insights about the interplay between ad closes in email and email-related user actions. In particular, we explore the merits of integrating information derived from user actions in email for ad-close prediction. Thorough performance evaluation reveals that incorporating such signals significantly improves ad-close prediction quality over previously reported results.
Despite extensive research on cross-modal retrieval, existing methods focus on the matching between image objects and text words. However, for the large amount of social media, such as news reports and online posts with images, previous methods are insufficient to model the associations between long text and image. As long text contains multiple entities and relationships between them, as well as complex events sharing a common scenario of the text, it poses unique research challenge to cross-modal retrieval. To tackle the challenge, in this paper, we focus on the retrieval task on long text and image, and propose an event-driven network for cross-modal retrieval. Our approach consists of two modules, namely the contextual neural tensor network (CNTN) and cross-modal matching network (CMMN). The CNTN module captures both event-level and text-level semantics of the sequential events extracted from a long text. The CMMN module learns a common representation space to compute the similarity of image and text modalities. We construct a multimodal dataset based on the news reports in People's Daily. The experimental results demonstrate that our model outperforms the existing state-of-the-art methods and can provide semantic richer text representations to enhance the effectiveness in cross-modal retrieval.
Bladder cancer is a malignant disease with substantial morbidity and mortality. Bladder cancer staging is crucial to determine the effective treatments of bladder tumors in clinic. As to the superiority of feature learning, Deep Convolutional Neural Networks (DCNN) are widely used to predict the cancer stage based on medical images. However, most existing DCNN-based cancer staging methods are data-driven and neglect the domain knowledge and experiences of clinicians. Besides, the deep neural networks are short of model interpretability and may lead to risky diagnosis. To tackle the problems, we construct the diagnosis rules of bladder cancer staging based on the clinical experiences of tumor penetration into bladder wall. The diagnosis rules are extracted from Magnetic Resonance (MR) images and further integrated into DCNN for joint identification of tumor stage. The experiments validate that the integrated rules improve the model interpretability and guide DCNN to focus on the regions of tumor penetration and thereby produce precise prediction of cancer staging.
Link prediction has long been the focus in the analysis of network-structured data. Though straightforward and efficient, heuristic approaches like Common Neighbors perform link prediction with pre-defined assumptions and only use superficial structural features. While it is widely acknowledged that a vertex could be characterized by a bunch of neighbor vertices, network embedding algorithms and newly emerged graph neural networks still exploit structural features on the whole network, which may inevitably bring in noises and limits the scalability of those methods. In this paper, we propose an end-to-end deep learning framework, namely hyper-substructure enhanced link predictor (HELP), for link prediction. HELP utilizes local topological structures from the neighborhood of the given vertex pairs, avoiding useless features. For further exploiting higher-order structural information, HELP also learns features from hyper-substructure network (HSN).Extensive experiments on six benchmark datasets have shown the state-of-the-art performance of HELP on link prediction.
\emphSeasonal periodicity is a frequent phenomenon for social interactions in temporal networks. A key property of this behavior is that it exhibits periodicity for multiple particular periods in temporal networks. Mining such seasonal-periodic patterns is significant since it can indicate interesting relationships between the individuals involved in the interactions. Unfortunately, most previous studies for periodic pattern mining ignore the seasonal feature. This motivates us to explore mining seasonal-periodic subgraphs, and the investigation presents a novel model, called maximal σ-periodic $ømega$-seasonal k-subgraph. It represents a subgraph with size larger than k and that appears at least σ times periodically in at least $ømega$ particular periods on the temporal graph. Since seasonal-periodic patterns do not satisfy the anti-monotonic property, we propose a weak version of support measure with an anti-monotonic property to reduce the search space efficiently. Then, we present an effective mining algorithm to seek all maximal σ-periodic $ømega$-seasonal k-subgraphs. Experimental results on real-life datasets show the effectiveness and efficiency of our approach.
This paper focuses on the multi-behavior recommendation problem, i.e., generating personalized recommendation based on multiple types of user behaviors. Methods proposed recently usually leverage the ordinal assumption, which means that users? different types of behaviors should take place in a fixed order. However, this assumption may be too strong in some scenarios. In this paper, a more general model named Multiplex Graph Neural Network (MGNN) is proposed as a remedy. MGNN tackles the multi-behavior recommendation problem from a novel perspective, i.e., the perspective of link prediction in multiplex networks. By taking advantage of both the multiplex network structure and graph representation learning techniques, MGNN learns shared embeddings and behavior-specific embeddings for users and items to model the collective effect of multiple types of behaviors. Experiments conducted on both ordinal-behavior datasets and generic-behavior datasets demonstrate the effectiveness of the proposed MGNN model.
Unsupervised domain adaptation (UDA) attempts to transfer specific knowledge from one domain with labeled data to another domain without labels. Recently, maximum squares loss has been proposed to tackle UDA problem but it does not consider the prediction diversity which has proven beneficial to UDA. In this paper, we propose a novel normalized squares maximization (NSM) loss in which the maximum squares is normalized by the sum of squares of class sizes. The normalization term enforces the class sizes of predictions to be balanced to explicitly increase the diversity. Theoretical analysis shows that the optimal solution to NSM is one-hot vectors with balanced class sizes, i.e., NSM encourages both discriminate and diverse predictions. We further propose a robust variant of NSM, RNSM, by replacing the square loss with L2,1-norm to reduce the influence of outliers and noises. Experiments of cross-domain image classification on two benchmark datasets illustrate the effectiveness of both NSM and RNSM. RNSM achieves promising performance compared to state-of-the-art methods. The code is available at https://github.com/wj-zhang/NSM.
Community detection is a fundamental problem in social network analysis, and most existing studies focus on unsigned graphs, i.e., treating all relationships as positive. However, friend and foe relationships naturally exist in many real-world applications. Ignoring the signed information may lead to unstable communities. To better describe the communities, we propose a novel model, named signed k-truss, which leverages the properties of k-truss and balanced triangle. We prove that the problem of identifying the maximum signed k-truss is NP-hard. To deal with large graphs, novel pruning strategies and algorithms are developed. Finally, we conduct comprehensive experiments on real-world signed networks to evaluate the performance of proposed techniques.
Event-oriented news retrieval (ENR) is the task of retrieving news articles related to the specific event in response to the event-oriented query. Previous approaches usually focus on optimizing traditional retrieval models through hand-crafted features from the perspective of new articles. However, these approaches often fail to work well in reality, as they do not consider the essential natures of the event, i.e., dynamics, coupling. In this paper, we propose a novel and effective event-oriented neural ranking model for news retrieval (ENRMNR). Our model exploits a deep attention mechanism to tackle the dynamics and coupling derived from event evolution. Specifically, the word-level bidirectional attention allows the model to identify which query words about the subevent are related to the news article words, and vice-versa, in order to tackle the dynamics. Moreover, the hierarchical attention at passage-level and document-level allows it to capture fine-grained event representations for the coupling between different events within a news article. Experimental results on real-world datasets demonstrate that ENRMNR model significantly outperforms competitive models.
Top-N item recommendation has been a widely studied task from implicit feedback. Although much progress has been made with neural methods, there is increasing concern on appropriate evaluation of recommendation algorithms. In this paper, we revisit alternative experimental settings for evaluating top-N recommendation algorithms, considering three important factors, namely dataset splitting, sampled metrics and domain selection. We select eight representative recommendation algorithms (covering both traditional and neural methods) and construct extensive experiments on a very large dataset. By carefully revisiting different options, we make several important findings on the three factors, which directly provide useful suggestions on how to appropriately set up the experiments for top-N item recommendation.
Embedding mechanism plays an important role in Click-Through-Rate (CTR) prediction. Essentially, it tries to learn a new feature space with some learned latent properties as the basis, and maps the high dimensional and categorical raw data to dense, rich and expressive representations, i.e., the embedding features. Current researches usually focus on learning the interactions through operations on the whole embedding features without considering the relations among the learned latent properties. In this paper, we find it has clear positive effects on CTR prediction to model such relations and propose a novel Dimension Relation Module (DRM) to capture them through dimension recalibration. We show that DRM can improve the performance of existing models consistently and the improvements are more obvious when the embedding dimension is higher. We further boost Field-wise and Element-wise embedding methods with our DRM and name this new model FED network. Extensive experiments demonstrate that FED is very powerful in CTR prediction task and achieves new state-of-the-art results on Criteo, Avazu and JD.com datasets.
Identifying influencers on social media, such as Twitter, has played a central role in many applications, including online marketing and political campaigns. Compared with social media celebrities, domain-specific influencers are less expensive to hire and more engaged in spreading messages such as new treatment or timely prevention for HIV. However, most of the existing topic modeling based approaches fail to identify influencers who are dedicated to the rare yet important topics such as HIV and suicide. To alleviate this limitation, we investigate an on-Demand Influencer Discovery (DID) framework that is able to identify influencers on any subject depicted by a few user-specified keywords, regardless of its popularity on social media. The DID model employs an iterative learning process that integrates the language attention network as a subject filter and the influence convolution network built on user interactions. Comprehensive evaluations on Twitter datasets show that the DID model can reliably identify influencers even on rare subjects such as HIV and suicide, outperforming existing topic-specific influencer detection models.
Graph classification, which aims to identify the category labels of graphs, plays a significant role in drug classification, toxicity detection, protein analysis etc. However, the limitation of scale of benchmark datasets makes it easy for graph classification models to fall into over-fitting and undergeneralization. Towards this, we introduce data augmentation on graphs and present two heuristic algorithms: \emrandom mapping and \emmotif-similarity mapping, to generate more weakly labeled data for small-scale benchmark datasets via heuristic modification of graph structures. Furthermore, we propose a generic model evolution framework, named \emM-Evolve, which combines graph augmentation, data filtration and model retraining to optimize pre-trained graph classifiers. Experiments conducted on six benchmark datasets demonstrate that \emM-Evolve helps existing graph classification models alleviate over-fitting when training on small-scale benchmark datasets and %achieve significant improvement of classification performance. yields an average improvement of 3-12% accuracy on graph classification tasks.
In search and recommendation, diversifying the multi-aspect search results could help with reducing redundancy, and promoting results that might not be shown otherwise. Many previous methods have been proposed for this task. However, previous methods do not explicitly consider the uniformity of the number of the items' classes, or evenness, which could degrade the search and recommendation quality. To address this problem, we introduce a novel method by adapting the Simpson's Diversity Index from biology, which enables a more effective and efficient quadratic search result diversification algorithm. We also extend the method to balance the diversity between multiple aspects through weighted factors and further improve computational complexity by developing a fast approximation algorithm. We demonstrate the feasibility of the proposed method using the openly available Kaggle shoes competition dataset. Our experimental results show that our approach outperforms previous state of the art diversification methods, while reducing computational complexity.
Recently, conversational recommender system (CRS) has become an emerging and practical research topic. Most of the existing CRS methods focus on learning effective preference representations for users from conversation data alone. While, we take a new perspective to leverage historical interaction data for improving CRS. For this purpose, we propose a novel pre-training approach to integrating both item-based preference sequence (from historical interaction data) and attribute-based preference sequence (from conversation data) via pre-training methods. We carefully design two pre-training tasks to enhance information fusion between item- and attribute-based preference. To improve the learning performance, we further develop an effective negative sample generator which can produce high-quality negative samples. Experiment results on two real-world datasets have demonstrated the effectiveness of our approach for improving CRS.
Student performance prediction aims to leverage student-related information to predict their future academic outcomes, which may be beneficial to numerous educational applications, such as personalized teaching and academic early warning. In this paper, we seek to address the problem by analyzing students' daily studying and living behavior, which is comprehensively recorded via campus smart cards. Different from previous studies, we propose an end-to-end student performance prediction model, namely Tri-branch CNN, which is equipped with three types of convolutional filters, i.e., the row-wise convolution, column-wise convolution, and group-wise convolution, to effectively capture the duration, periodicity, and location-aware characteristic of student behavior, respectively. We also introduce the attention mechanism and cost-sensitive learning strategy to further improve the accuracy of our approach. Extensive experiments on a large-scale real-world dataset demonstrate the potential of our approach for student performance prediction.
Deep multimodal clustering have shown their competitiveness among different multimodal clustering algorithms. Existing algorithms usually boost the multimodal clustering by exploring the common knowledge among multiple modalities, which underutilizes the uniqueness of multiple modalities. In this paper, we enhance the mining of modality-common knowledge by extracting the modality-unique knowledge of each modality simultaneously. Specifically, we first utilize autoencoders to extract the modality-common and modality-unique features of each modality respectively. Meanwhile, the cross reconstruction is used to build latent connections among different modalities, i.e., maintain the consistency of modality-common features of each modality as well as heightening the diversity of modality-unique features of each modality. After that, modality-common features are fused to cluster the multimodal data. Experimental results on several benchmark datasets demonstrate that the proposed method outperforms state-of-art works obviously.
Search and recommender systems that take the initiative to ask clarifying questions to better understand users' information needs are receiving increasing attention from the research community. However, to the best of our knowledge, there is no empirical study to quantify whether and to what extent users are willing or able to answer these questions. In this work, we conduct an online experiment by deploying an experimental system, which interacts with users by asking clarifying questions against a product repository. We collect both implicit interaction behavior data and explicit feedback from users showing that: (a) users are willing to answer a good number of clarifying questions (11 on average), but not many more than that; (b) most users answer questions until they reach the target product, but also a fraction of them stops due to fatigue or due to receiving irrelevant questions; (c) part of the users' answers (17%) are actually opposite to the description of the target product; while (d) most of the users (84%) find the question-based system helpful towards completing their tasks. Some of the findings of the study contradict current assumptions on simulated evaluations in the field, while they point towards improvements in the evaluation framework and can inspire future interactive search/recommender system designs.
Large-scale pre-trained models have attracted extensive attention in the research community and shown promising results on various tasks of natural language processing. However, these pre-trained models are memory and computation intensive, hindering their deployment into industrial online systems like Ad Relevance. Meanwhile, how to design an effective yet efficient model architecture is another challenging problem in online Ad Relevance. Recently, AutoML shed new lights on architecture design, but how to integrate it with pre-trained language models remains unsettled. In this paper, we propose AutoADR (Automatic model design for AD Relevance) --- a novel end-to-end framework to address this challenge, and share our experience to ship these cutting-edge techniques into online Ad Relevance system at Microsoft Bing. Specifically, AutoADR leverages a one-shot neural architecture search algorithm to find a tailored network architecture for Ad Relevance. The search process is simultaneously guided by knowledge distillation from a large pre-trained teacher model (e.g. BERT), while taking the online serving constraints (e.g. memory and latency) into consideration. We add the model designed by AutoADR as a sub-model into the production Ad Relevance model. This additional sub-model improves the Precision-Recall AUC (PR AUC) on top of the original Ad Relevance model by 2.65X of the normalized shipping bar. More importantly, adding this automatically designed sub-model leads to a statistically significant 4.6% Bad-Ad ratio reduction in online A/B testing. This model has been shipped into Microsoft Bing Ad Relevance Production model.
Learning to rank with implicit feedback is one of the most important tasks in many real-world information systems where the objective is some specific utility, e.g., clicks and revenue. However, we point out that existing methods based on probabilistic ranking principle do not necessarily achieve the highest utility. To this end, we propose a novel ranking framework called U-rank that directly optimizes the expected utility of the ranking list. With a position-aware deep click-through rate prediction model, we address the attention bias considering both query-level and item-level features. Due to the item-specific attention bias modeling, the optimization for expected utility corresponds to a maximum weight matching on the item-position bipartite graph. We base the optimization of this objective in an efficient Lambdaloss framework, which is supported by both theoretical and empirical analysis. We conduct extensive experiments for both web search and recommender systems over three benchmark datasets and two proprietary datasets, where the performance gain of U-rank over state-of-the-arts is demonstrated. Moreover, our proposed U-rank has been deployed on a large-scale commercial recommender and a large improvement over the production baseline has been observed in an online A/B testing.
In business domains, bundling is one of the most important marketing strategies to conduct product promotions, which is commonly used in online e-commerce and offline retailers. Existing recommender systems mostly focus on recommending individual items that users may be interested in. In this paper, we target at a practical but less explored recommendation problem named bundle recommendation, which aims to offer a combination of items to users. To tackle this specific recommendation problem in the context of the virtual mall in online games, we formalize it as a link prediction problem on a user-item-bundle tripartite graph constructed from the historical interactions, and solve it with a neural network model that can learn directly on the graph-structure data. Extensive experiments on three public datasets and one industrial game dataset demonstrate the effectiveness of the proposed method. Further, the bundle recommendation model has been deployed in production for more than one year in a popular online game developed by Netease Games, and the launch of the model yields more than 60% improvement on conversion rate of bundles, and a relative improvement of more than 15% on gross merchandise volume (GMV).
Table formatting is a typical task for spreadsheet users to better exhibit table structures and data relationships. But quickly and effectively formatting tables is a challenge for users. Lots of manual operations are needed, especially for complex tables. In this paper, we propose techniques for table formatting style transfer, i.e., to automatically format a target table according to the style of a reference table. Considering the latent many-to-many mappings between table structures and formats, we propose CellNet, which is a novel end-to-end, multi-task model leveraging conditional Generative Adversarial Networks (cGANs) with three key components to (1) model and recognize table structures; (2) encode formatting styles; (3) learn and apply the latent mapping based on recognized table structure and encoded style, respectively. Moreover, we build up a spreadsheet table corpus containing 5,226 tables with high-quality formats and 784 tables with human-labeled structures. Our evaluation shows that CellNet is highly effective according to both quantitative metrics and human perception studies by comparing with heuristic-based and other learning-based methods.
When reviewing documents for legal tasks such as Mergers and Acquisitions, granular information (such as start dates and exit clauses) need to be identified and extracted. Inspired by previous work in Named Entity Recognition (NER), we investigate how NER techniques can be leveraged to aid lawyers in this review process. Due to the extremely low prevalence of target information in legal documents, we find that the traditional approach of tagging all sentences in a document is inferior, in both effectiveness and data required to train and predict, to using a first-pass layer to identify sentences that are likely to contain the relevant information and then running the more traditional sentence-level sequence tagging. Moreover, we find that such entity-level models can be improved by training on a balanced sample of relevant and non-relevant sentences. We additionally describe the use of our system in production and how its usage by clients means that deep learning architectures tend to be cost inefficient, especially with respect to the necessary time to train models.
Personalization is a crucial aspect of many online experiences. In particular, content ranking is often a key component in delivering sophisticated personalization results. Commonly, supervised learning-to-rank methods are applied, which suffer from bias introduced during data collection by production systems in charge of producing the ranking. To compensate for this problem, we leverage contextual multi-armed bandits. We propose novel extensions of two well-known algorithms viz. LinUCB and Linear Thompson Sampling to the ranking use-case. To account for the biases in a production environment, we employ the position-based click model. Finally, we show the validity of the proposed algorithms by conducting extensive offline experiments on synthetic datasets as well as customer facing online A/B experiments.
Many institutions are devoted to providing investment advising services to stock investors to help them make sound investment decisions. Industry analysts at these institutions need to analyze huge amounts of financial news documents, and yield investment advising reports to the service subscribers. Automatic document classification is required to organize collected financial news documents into pre-defined fine-grained categories, before the document analysis tasks. It is challenging to implement accurate fine-grained classification over massive financial documents, because documents from close fine-grained categories are highly semantically similar, while existing classification methods may fail to differentiate the subtle differences for documents from close fine-grained categories. In this paper, we implement a document classification framework, named GraphSEAT, to classify financial documents for a leading financial information service provider in China. Specifically, we build a heterogeneous graph to model the global structure of our targeting financial documents, where documents and financial named entities are deemed as nodes, and a document is connected to a contained named entity with an edge, and we then train a graph convolutional network (GCN) with attention mechanisms, to learn an embedding representation containing domain information for a document. We also extract semantic information from a document's word sequence with a neural sequence encoder, and finally form an overall embedding representation for a document and make the prediction, via fusing the two learned representations of the document with attention mechanisms. We perform extensive experiments on our real-world financial news dataset and three public datasets, to evaluate the performance of the document classification framework, and the experimental results demonstrate that GraphSEAT outperforms all compared eight baseline models, especially on our dataset.
Click-through rate (CTR) prediction is a critical task for many industrial systems, such as display advertising and recommender systems. Recently, modeling user behavior sequences attracts much attention and shows great improvements in the CTR field. Existing works mainly exploit attention mechanism based on embedding product when considering relations between user behaviors and target item. However, this methodology lacks of concrete semantics and overlooks the underlying reasons driving a user to click on a target item. In this paper, we propose a new framework named Multiplex Target-Behavior Relation enhanced Network (MTBRN) to leverage multiplex relations between user behaviors and target item to enhance CTR prediction. Multiplex relations consist of meaningful semantics, which can bring a better understanding on users' interests from different perspectives. To explore and model multiplex relations, we propose to incorporate various graphs (e.g., knowledge graph and item-item similarity graph) to construct multiple relational paths between user behaviors and target item. Then Bi-LSTM is applied to encode each path in the path extractor layer. A path fusion network and a path activation network are devised to adaptively aggregate and finally learn the representation of all paths for CTR prediction. Extensive offline and online experiments clearly verify the effectiveness of our framework.
In the maritime domain, vessels typically maintain straight, predictable routes at open sea, except in the rare cases of adverse weather conditions, accidents and traffic restrictions. Consequently, large amounts of streaming positional updates from vessels can hardly contribute additional knowledge about their actual motion patterns. We have been developing a system for vessel trajectory compression discarding a significant part of the original positional updates, with minimal trajectory reconstruction error. In this work, we present an extension of this system, that allows the user to fine-tune trajectory compression according to the requirements of a given application. The extended system avoids the issues of hyper-parameter tuning, supports incremental optimization and facilitates composite maritime event recognition. Finally, we report empirical results from a comprehensive empirical evaluation against two real-world datasets of vessel positions.
Discovering similarities between online listings is a common backend task being used across different downstream experiences in eBay. Our baseline unstructured listing similarity method relies on measuring the semantic textual similarity between the embedding vectors of listing titles. However, we discovered that even with the latest contextualized embedding methods, our similarity fails to give the proper weight to the key tokens in the title that matter. This often results in identifying listing similarities that are not sufficient, which later hurts the downstream experiences. In this paper we present a method we call "Listing2Query", or "L2Q", which uses a Sequence Labeling approach to learn token importance from our users? search queries and on-site behaviour. We used pairs of listing titles and their matching search queries, and leveraged a contextualized character language model, to train L2Q as a bidirectional recurrent neural network to produce token importance weights. We demonstrate that plugging these weights into relatively straightforward listing similarity methods is a simple way to significantly improve the similarity results, even to the extent that it consistently outperforms those created by popular representations such as BERT. Notably, this approach is not reserved to only large online marketplaces but can be generalized to other cases that include a search-driven experience and a recall set of short documents.
The goal of Jobs Marketplace at LinkedIn is to match members to promoted job postings such that both job posters' ROI is optimized (amount of money spent per job clicks and applications) and the members are presented with relevant jobs that they are interested in and qualified for. This is achieved via a first-price auction mechanism where each job provides a bid for the member that comes to the job recommendations page. This bid depends on the match of the member to the job, as well as the daily budget that remains for the job, and its capability to spend it via clicks (e.g. some jobs might have more demand and have it easier to spend their budgets via clicks than others). In such a scheme, budget pacing, i.e. the capability of a job to spend its daily budget evenly, or according to a preset plan, is extremely important towards efficient utilization of its budget via reaching a higher number of candidates, and obey a variety of spending plans optimizing for different events such as clicks and applications.
In this paper, we propose an impression-based spend computation system, hence an impression-based pacing scheme. This approach works via assigning a projected/expected charge amount each time a job is shown to the user, taking into account both the likelihood that the user will click the job, and the recommender system specific considerations such as the order within a page that a job is recommended. The results of our alternate-day test shows that such a scheme leads to a smoother spending and improved adherence to the planned spend, and increases secondary metrics such as job clicks and applications.
Online auctions play a central role in online advertising, and are one of the main reasons for the industry's scalability and growth. With great changes in how auctions are being organized, such as changing the second- to first-price auction type, advertisers and demand platforms are compelled to adapt to a new volatile environment. Bid shading is a known technique for preventing overpaying in auction systems that can help maintain the strategy equilibrium in first-price auctions, tackling one of its greatest drawbacks. In this study, we propose a machine learning approach of modeling optimal bid shading for non-censored online first-price ad auctions. We clearly motivate the approach and extensively evaluate it in both offline and online settings on a major demand side platform. The results demonstrate the superiority and robustness of the new approach as compared to the existing approaches across a range of performance metrics.
Prospective display advertising poses a particular challenge for large advertising platforms. The existing machine learning algorithms are easily biased towards the highly predictable retargeting events that are often non-eligible for the prospective campaigns, thus exhibiting a decline in advertising performance. To that end, efforts are made to design powerful models that can learn from signals of various strength and temporal impact collected about each user from different data sources and provide a good quality and early estimation of users' conversion rates. In this study, we propose a novel deep time-aware approach designed to model sequences of users' activities and capture implicit temporal signals of users' conversion intents. On several real-world datasets, we show that the proposed approach consistently outperforms other, previously proposed approaches by a significant margin while providing interpretability of signal impact to conversion probability.
Meta-learning approaches have shown great success in solving challenging knowledge transfer and fast adaptation problems with few samples in vision and language domains. However, few studies discuss the practice of meta-learning for large-scale industrial applications, e.g., representation learning for e-commerce platform users. Although e-commerce companies have spent many efforts on learning accurate and expressive representations to provide a better user experience, we argue that such efforts cannot be stopped at this step. In addition to learning a strong profile of user behaviors, the challenging question about how to effectively transfer the learned representation and quickly adapt the learning process to the subsequent learning tasks or applications is raised simultaneously.
This paper introduces the contributions that we made to address these challenges from three aspects. 1) Meta-learning model: In the context of representation learning with e-commerce user behavior data, we propose a meta-learning framework called the Meta-Profile Network, which extends the ideas of matching network and relation network for knowledge transfer and fast adaptation; 2) Encoding strategy: To keep high fidelity of large-scale long-term sequential behavior data, we propose a time-heatmap encoding strategy that allows the model to encode data effectively; 3) Deep network architecture: A multi-modal model combined with multi-task learning architecture is utilized to address the cross-domain knowledge learning and insufficient label problems. Moreover, we argue that an industrial model should not only have good performance in terms of accuracy, but also have better robustness and uncertainty performance under extreme conditions. We evaluate the performance of our model with extensive control experiments in various extreme scenarios, i.e. out-of-distribution detection, data insufficiency and class imbalance scenarios. The Meta-Profile Network shows significant improvement in the model performance when compared to baseline models.
Recommender system (RS) has become a crucial module in most web-scale applications. Recently, most RSs are in the waterfall form based on the cloud-to-edge framework, where recommended results are transmitted to edge (e.g., user mobile) by computing in advance in the cloud server. Despite effectiveness, network bandwidth and latency between cloud server and edge may cause the delay for system feedback and user perception. Hence, real-time computing on edge could help capture user preferences more preciously and thus make more satisfactory recommendations. Our work, to our best knowledge, is the first attempt to design and implement the novel Recommender System on Edge (EdgeRec), which achieves Real-time User Perception and Real-time System Feedback. Moreover, we propose Heterogeneous User Behavior Sequence Modeling and Context-aware Reranking with Behavior Attention Networks to capture user's diverse interests and adjust recommendation results accordingly. Experimental results on both the offline evaluation and online performance in Taobao home-page feeds demonstrate the effectiveness of EdgeRec.
The availability of high-frequency trade data has made it possible for the intraday forecast of price patterns. With the help of technical indicators, recent studies have shown that LSTM based deep learning models are able to predict price directions (a binary classification problem) with performance better than a random guess. However, only naive recurrent networks were adopted, and these works did not compare with the tools used by finance practitioners. Our experiments show that GARCH beats their LSTM models by a large margin.
We propose to adopt an autoregressive recurrent network instead so that the loss of the prediction at every time step contributes to the model training; we also treat a rich set of technical indicators at each time step as covariates to enhance the model input. Finally, we treat the problem of price pattern forecast as a regression problem on the price itself; even for price direction prediction, we show that our performance is much better than if we model the problem as binary classification. We show that only when all these designs are adopted, an LSTM model can beat GARCH (and by a large margin).
This work corrects the poor use of LSTM networks in recent studies, and provides "the" baseline that is able to fully unleash the power of LSTM for future work to compare with. Moreover, since our model is a price regressor with very good prediction performance, it can serve as a valuable tool for designing trading strategies (including day trading). Our model has been used by quantitative analysts in Freddie Mac for over one quarter, and is found to be more effective than traditional GARCH variants in market prediction.
Recommender Systems have been playing essential roles in e-commerce portals. Existing recommendation algorithms usually learn the ranking scores of items by optimizing a single task (e.g. Click-through rate prediction) based on users' historical click sequences, but they generally pay few attention to simultaneously modeling users' multiple types of behaviors or jointly optimize multiple objectives (e.g. both Click-through rate and Conversion rate), which are both vital for e-commerce sites. In this paper, we argue that it is crucial to formulate users' different interests based on multiple types of behaviors and perform multi-task learning for significant improvement in multiple objectives simultaneously. We propose Deep Multifaceted Transformers (DMT), a novel framework that can model users' multiple types of behavior sequences simultaneously with multiple Transformers. It utilizes Multi-gate Mixture-of-Experts to optimize multiple objectives. Besides, it exploits unbiased learning to reduce the selection bias in the training data. Experiments on JD real production dataset demonstrate the effectiveness of DMT, which significantly outperforms state-of-art methods. DMT has been successfully deployed to serve the main traffic in the commercial Recommender System in JD.com. To facilitate future research, we release the codes and datasets at https://github.com/guyulongcs/CIKM2020_DMT.
For e-commerce platforms such as Taobao and Amazon, advertisers play an important role in the entire digital ecosystem: their behaviors explicitly influence users' browsing and shopping experience; more importantly, advertiser's expenditure on advertising constitutes a primary source of platform revenue. Therefore, providing better services for advertisers is essential for the long-term prosperity for e-commerce platforms. To achieve this goal, the ad platform needs to have an in-depth understanding of advertisers in terms of both their marketing intents and satisfaction over the advertising performance, based on which further optimization could be carried out to service the advertisers in the correct direction. In this paper, we propose a novel Deep Satisfaction Prediction Network (DSPN), which models advertiser intent and satisfaction simultaneously. It employs a two-stage network structure where advertiser intent vector and satisfaction are jointly learned by considering the features of advertiser's action information and advertising performance indicators. Experiments on an Alibaba advertisement dataset and online evaluations show that our proposed DSPN outperforms state-of-the-art baselines and has stable performance in terms of AUC in the online environment. Further analyses show that DSPN not only predicts advertisers' satisfaction accurately but also learns an explainable advertiser intent, revealing the opportunities to optimize the advertising performance further.
Ranking is the most important component in a search system. Most search systems deal with large amounts of natural language data, hence an effective ranking system requires a deep understanding of text semantics. Recently, deep learning based natural language processing (deep NLP) models have generated promising results on ranking systems. BERT is one of the most successful models that learn contextual embedding, which has been applied to capture complex query-document relations for search ranking. However, this is generally done by exhaustively interacting each query word with each document word, which is inefficient for online serving in search product systems. In this paper, we investigate how to build an efficient BERT-based ranking model for industry use cases. The solution is further extended to a general ranking framework, DeText, that is open sourced and can be applied to various ranking productions. Offline and online experiments of DeText on three real-world search systems present significant improvement over state-of-the-art approaches.
If one customer buys a tennis racket, what are the best 3 complementary products to purchase together? 3 tennis ball packs, 3 headbands, 3 overgrips, or 1 of each respectively? Complementary product recommendation (CPR), aiming at providing product suggestions that are often bought together to serve a joint demand, forms a pivotal component of e-commerce service, however, existing methods are far from optimal. Given one product, how to recommend its complementary products of different types is the key problem we tackle in this work. We first conduct an extensive analysis to correct the inaccurate assumptions adopted by existing work to show that co-purchased products are not always complementary and further propose a new strategy to generate clean distant supervision labels for CPR modeling. Moreover, to bridge in the gap from existing work that CPR does not only need relevance modeling but also requires diversity to fulfill the whole purchase demand, we develop a deep learning framework, P-Companion to explicitly model both relevance and diversity. More specifically, given one product with its product type, P-Companion first uses an encoder-decoder network to predict multiple complementary product types, then a transfer metric learning network is developed to project the embedding of query product to each predicted complementary product type subspace and further learn the complementary relationship based on the distant supervision labels within each subspace. The whole framework can be trained from end-to-end and robust to cold-start products attributed to a novel pretrained product embedding module named Product2Vec, based on graph attention networks. Extensive offline experiments show that P-Companion outperforms state-of-the-art baselines by 7.1% increase on the Hit@10 score with well-controlled diversity. Production-wise, we deploy P-Companion to provide online recommendations for over 200M products at Amazon and observe significant gains on product sales and profit.
Aiming to effectively distinguish loan default in the Mobile Credit Payment Service, industrial efforts mainly attempt to employ conventional classifier with complicated feature engineer for prediction. However, these solutions fail to exploit multiplex relations existed in the financial scenarios and ignore the key intrinsic properties of the loan default detection, i.e., communicability, complementation and induction. To address these issues, we develop a novel attributed multiplex graph based loan default detection approach for effectively integrating multiplex relations in financial scenarios. Considering the complexity of financial scenario, an Attributed Multiplex Graph (AMG) is proposed to jointly model various relations and objects as well as the rich attributes on nodes and edges. We elaborately design relation-specific receptive layers equipped with adaptive breadth function to incorporate important information derived from local structure in each aspect of AMG and stack multiple propagation layer to explore the high-order connectivity information. Furthermore, a relation-specific attention mechanism is adopted to emphasize relevant information during end-to-end training. Extensive experiments conducted on the large-scale real- world dataset verify the effectiveness of the proposed model com- pared with state of arts. Moreover, AMG-DP has also achieved a performance improvement of 9.37% on KS metric in recent months after successful deployment in the Alipay APP.
Identifying the faulty class of multivariate time series is crucial for today's flight data analysis. However, most of the existing time series classification methods suffer from imbalanced data and lack of model interpretability, especially on flight data of which faulty events are usually uncommon with a limited amount of data. Here, we present a neural network classification model for imbalanced multivariate time series by leveraging the information learned from normal class, which can also learn the nonlinear Granger causality for each class, so that we can pinpoint how time series classes differ from each other. Experiments on simulated data and real flight data shows that this model can achieve high accuracy of identifying anomalous flights.
Flight itinerary ranking is critical for Online Travel Agencies (OTAs) since more and more customers book flights online. Currently, most OTAs still adopt rule-based strategies. However, rule-based methods are not able to model context-aware information and user preferences. To this end, a novel Personalized Flight itinerary Ranking Network (PFRN) is proposed in this paper. In PFRN, a Listwise Feature Encoding (LFE) structure is proposed to capture global context-aware information and mutual influences among inputs. Besides, we utilize behaviors of both individual user and group users sharing the same intention to express user preferences. Then a User Attention Mechanism is proposed to rank flight itineraries based on the user preferences effectively and efficiently. Offline experiments on real-world datasets from Amadeus and Fliggy show the superior performance of the proposed PFRN. Moreover, PFRN has been successfully deployed on online system for searching itineraries at Fliggy and achieved significant improvements.
Person-job fit is to match candidates and job posts on online recruitment platforms using machine learning algorithms. The effectiveness of matching algorithms heavily depends on the learned representations for the candidates and job posts. In this paper, we propose to learn comprehensive and effective representations of the candidates and job posts via feature fusion. First, in addition to applying deep learning models for processing the free text in resumes and job posts, which is adopted by existing methods, we extract semantic entities from the whole resume (and job post) and then learn features for them. By fusing the features from the free text and the entities, we get a comprehensive representation for the information explicitly stated in the resume and job post. Second, however, some information of a candidate or a job may not be explicitly captured in the resume or job post. Nonetheless, the historical applications including accepted and rejected cases can reveal some implicit intentions of the candidates or recruiters. Therefore, we propose to learn the representations of implicit intentions by processing the historical applications using LSTM. Last, by fusing the representations for the explicit and implicit intentions, we get a more comprehensive and effective representation for person-job fit. Experiments over 10 months real data show that our solution outperforms existing methods with a large margin. Ablation studies confirm the contribution of each component of the fused representation. The extracted semantic entities help interpret the matching results in the case study.
As the largest professional network, LinkedIn hosts millions of user profiles and job postings. Users effectively find what they need by entering search queries. However, finding what they are looking for can be a challenge, especially if they are unfamiliar with specific keywords from their industry. Query Suggestion is a popular feature where a search engine can suggest alternate, related queries. At LinkedIn, we have productionized a deep learning Seq2Seq model to transform an input query into several alternatives. This model is trained by examining search history directly typed by users. Once online, we can determine whether or not users clicked on suggested queries. This new feedback data indicates which suggestions caught the user's attention. In this work, we propose training a model with both the search history and user feedback datasets. We examine several ways to incorporate feedback without any architectural change, including adding a novel pairwise ranking loss term during training. The proposed new training technique produces the best combined score out of several alternatives in offline metrics. Deployed in the LinkedIn search engine, it significantly outperforms the control model with respect to key business metrics.
We present Magellan - a personalized travel recommendation system that is built entirely from card transaction data. The data logs contain extensive metadata for each transaction between a user and a merchant. We describe the procedure employed to extract travel itineraries from such transaction data. Unlike traditional approaches, we formulate the recommendation problem into two steps: (1) predict coarse granularity information such as location and category of the next merchant; and (2) provide fine granularity individual merchant recommendations based on the predicted location and category. The breakdown helps us build a scalable recommendation system. We propose a quadtree-based algorithm that provides an adaptive spatial resolution for the location classes in our first step while also reducing the class-imbalance across various location labels. Finally, we propose a novel neural architecture, SoLEmNet, that implicitly learns the inherent class label hierarchy and achieves a higher performance on our dataset compared to previous baselines.
This study examines the impact of the 'diversity' of product recommendations on the 'preference' of a customer, using online/offline data from a leading fashion company. First, through interviews with fashion professionals, we categorized the characteristics of customers into four types - gift, coordinator, carry-over, and trendsetter. Then, using a hybrid filtering method, we increased the accuracy and diversity of recommended products. We derived 13 salient features that reflect customer behavior based on the Purchase Funnel model and built a classification model that predicts a customer's preference rates. Second, we conducted two large-scale user tests with 20,000 real customers to verify the effectiveness of our recommendation system. Study results empirically demonstrated the importance of diversity of recommended products. The more diverse the product recommendations were, the higher the purchase rate, the average purchase amount, and the cross purchase rate were observed. In addition, we tracked the customers? purchase for two months after the user tests and found that diverse product exposure positively influenced customer retention (e.g., repurchase rate, amount).
Pre-sales customer service is of importance to E-commerce platforms as it contributes to optimizing customers? buying process. To better serve users, we propose AliMe KG, a domain knowledge graph in E-commerce that captures user problems, points of interest (POI), item information and relations thereof. It helps to under stand user needs, answer pre-sales questions and generate explanation texts. We applied AliMe KG to several online business scenarios such as shopping guide, question answering over properties and selling point generation, and gained positive and beneficial business results. In the paper, we systematically introduce how we construct domain knowledge graph from free text, and demonstrate its business value with several applications. Our experience shows that min ing structured knowledge from free text in vertical domain is practicable, and can be of substantial value in industrial settings.
Student performance prediction is critical to online education. It can benefit many downstream tasks on online learning platforms, such as estimating dropout rates, facilitating strategic intervention, and enabling adaptive online learning. Interactive online question pools provide students with interesting interactive questions to practice their knowledge in online education. However, little research has been done on student performance prediction in interactive online question pools. Existing work on student performance prediction targets at online learning platforms with predefined course curriculum and accurate knowledge labels like MOOC platforms, but they are not able to fully model knowledge evolution of students in interactive online question pools. In this paper, we propose a novel approach using Graph Neural Networks (GNNs) to achieve better student performance prediction in interactive online question pools. Specifically, we model the relationship between students and questions using student interactions to construct the student-interaction-question network and further present a new GNN model, called R2GCN, which intrinsically works for the heterogeneous networks, to achieve generalizable student performance prediction in interactive online question pools. We evaluate the effectiveness of our approach on a real-world dataset consisting of 104,113 mouse trajectories generated in the problem-solving process of over 4,000 students on 1,631 questions. The experiment results show that our approach can achieve a much higher accuracy of student performance prediction than both traditional machine learning approaches and GNN models.
Online electronic coupon (e-coupon) is becoming a primary tool for e-commerce platforms to attract users to place orders. E-coupons are the digital equivalent of traditional paper coupons which provide customers with discounts or gifts. One of the fundamental problems related is how to deliver e-coupons with minimal cost while users' willingness to place an order is maximized. We call this problem the coupon allocation problem. This is a non-trivial problem since the number of regular users on a mature e-platform often reaches hundreds of millions and the types of e-coupons to be allocated are often multiple. The policy space is extremely large and the online allocation has to satisfy a budget constraint. Besides, one can never observe the responses of one user under different policies which increases the uncertainty of the policy making process. Previous work fails to deal with these challenges. In this paper, we decompose the coupon allocation task into two subtasks: the user intent detection task and the allocation task. Accordingly, we propose a two-stage solution: at the first stage (detection stage), we put forward a novel Instantaneous Intent Detection Network (IIDN) which takes the user-coupon features as input and predicts user real-time intents; at the second stage (allocation stage), we model the allocation problem as a Multiple-Choice Knapsack Problem (MCKP) and provide a computational efficient allocation method using the intents predicted at the detection stage. Long Short Term Memory (LSTM) and a special attention mechanism are applied on IIDN to better describe temporal dependencies of sequential features. And we manage to solve the imbalanced label problem for the user intent detection task with a brand new perspective by using the logical relationship between multiple user intents. We conduct extensive online and offline experiments and the results show the superiority of our proposed framework, which has brought great profits to the platform and continues to function online.
Traditional Learning to Rank (LTR) models in E-commerce are usually trained on logged data from a single domain. However, data may come from multiple domains, such as hundreds of countries in international E-commerce platforms. Learning a single ranking function obscures domain differences, while learning multiple functions for each domain may also be inferior due to ignoring the correlations between domains. It can be formulated as a multi-task learning problem where multiple tasks share the same feature and label space. To solve the above problem, which we name Multi-Scenario Learning to Rank, we propose the Hybrid of implicit and explicit Mixture-of-Experts (HMoE) approach. Our proposed solution takes advantage of Multi-task Mixture-of-Experts to implicitly identify distinctions and commonalities between tasks in the feature space, and improves the performance with a stacked model learning task relationships in the label space explicitly. Furthermore, to enhance the flexibility, we propose an end-to-end optimization method with a task-constrained back-propagation strategy. We empirically verify that the optimization method is more effective than two-stage optimization required by the stacked approach. Experiments on real-world industrial datasets demonstrate that HMoE significantly outperforms the popular multi-task learning methods. HMoE is in-use in the search system of AliExpress and achieved 1.92% revenue gain in the period of one-week online A/B testing. We also release a sampled version of our dataset to facilitate future research.
In tag-enhanced video recommendation systems, videos are attached with some tags that highlight the contents of videos from different aspects. Tag ranking in such recommendation systems provides personalized tag lists for videos from their tag candidates. A better tag ranking model could attract users to click more tags, enter their corresponding tag channels, and watch more tag-specific videos, which improves both tag click rate and video watching time. However, most conventional tag ranking models merely concentrate on tag-video relevance or tag-related behaviors, ignoring the rich information in video-related behaviors. We should consider user preferences on both tags and videos. In this paper, we propose a novel Graph neural network based tag ranking (GraphTR) framework on a huge heterogeneous network with video, tag, user and media. We design a novel graph neural network that combines multi-field transformer, GraphSAGE and neural FM layers in node aggregation. We also propose a neighbor-similarity based loss to encode various user preferences into heterogeneous node representations. In experiments, we conduct both offline and online evaluations on a real-world video recommendation system in WeChat Top Stories. The significant improvements in both video and tag related metrics confirm the effectiveness and robustness in real-world tag-enhanced video recommendation. Currently, GraphTR has been deployed on WeChat Top Stories for more than six months. The source codes are in https://github.com/lqfarmer/GraphTR.
Inferring substitutable and complementary items is an important and fundamental concern for recommendation in e-commerce websites. However, the item relationships in real-world are usually heterogeneous, posing great challenges to conventional methods that can only deal with homogeneous relationships. More specifically, for this problem, there is a lack of in-depth investigation on 1) decoupling item semantics for modeling heterogeneous item relationships, and at the same time, 2) incorporating mutual influence between different relationships. To fill this gap, we propose a novel solution, namely Decoupled Graph Convolutional Network (DecGCN), to solve the problem of inferring substitutable and complementary items. DecGCN is designed to model item substitutability and complementarity in separated embedding spaces, and is equipped with a two-step integration scheme,where inherent influences between 1) different graph structures and 2) different item semantics are captured. Our experiments on three real-world datasets demonstrate that DecGCN is more effective than the state-of-the-art baselines for the problem at hand. We also conduct offline and online A/B tests on large-scale industrial data, where the results show that DecGCN is effective to be deployed in real-world applications. We release the codes at https://github.com/liuyiding1993/CIKM2020_DecGCN.
With the revolution of mobile internet, online finance has grown explosively. In this new area, one challenge of significant importance is how to effectively deliver the financial products or services to a set of target users by marketing. Given a product or service to be promoted and a set of users as seeds, audience expansion is such a targeting technique, which aims to find potential audience among a large number of users. However, in the context of finance, financial products and services are dynamic in nature as they co-vary with the socio-economic environment. Moreover, marketing campaigns for promoting products or services always consist of different rules of play, even for the same type of products or services. As a result, there is a strong demand for the timeliness of seeds in financial targeting. Conventional one-stage audience expansion methods, which generate expanded users by expanding over seeds, would encounter two problems under this setting: (1) the seeds would inevitably involve a number of users that are not representative for expansion, and direct expansion over these noisy seeds would dramatically deteriorate the performance; (2) one-stage expansion over fixed seeds cannot timely and accurately capture users' preferences over the currently running campaign due to the lack of timeliness of seeds.
To address the above challenges, in this paper, we present a novel two-stage audience expansion system Hubble. In the first cold-start stage, a reweighting mechanism is devised to suppress the noises within seeds, which is motivated from the observation on the relationship between golden seeds and their corresponding density in the embedding space. With incrementally collecting feedbacks from users, we further include these feedbacks to guide subsequent audience expansion in the second stage. But the distribution of these feedbacks is usually biased and cannot fully characterize the distribution of all target audiences. Therefore, we propose a method to incorporate biased feedbacks with seeds in a meta-learning manner to pan for golden seeds from the noisy seed-set. Finally, we conduct extensive experiments on three real datasets and online A/B testing, which demonstrate the effectiveness of the proposed method. In addition, we release two datasets for boosting the study of this new research topic.
Generalized additive models (GAMs) are one of the popular methods of building intelligible models on classification and regression problems. Fitting the most accurate GAMs is usually done via gradient boosting with bagged shallow trees. However, such method can be expensive for large industrial applications. In this work, we aim to improve the training efficiency of GAM. To this end, we propose to use subsample aggregating (subagging) in place of bootstrap aggregating (bagging). Our key observation is that subsamples of reasonable size (e.g., 60% of the training set) usually overlap. Such property allows us to explore the computation ordering inside a subagged ensemble and we present a novel algorithm to speed up the computation of subagged ensemble with no loss of accuracy. Our experimental results on public datasets demonstrate that our proposed method can achieve up to 3.7x speedup over bagged ensembles with comparable accuracy. Finally, we demonstrate our methodology of finding global explanations on a real application at Alipay. We have developed several strategies from the findings of those explanations and found those strategies achieved significant lift on key metrics through online experiments.
Pre-trained language models have achieved great success in a wide variety of natural language processing (NLP) tasks, while the superior performance comes with high demand in computational resources, which hinders the application in low-latency information retrieval (IR) systems. To address the problem, we present TwinBERT model, which has two improvements: 1) represent query and document separately using twin-structured encoders and 2) each encoder is a highly compressed BERT-like model with less than one third of the parameters. The former allows document embeddings to be pre-computed offline and cached in memory, which is different from BERT, where the two input sentences are concatenated and encoded together. The change saves large amount of computation time, however, it is still not sufficient for real-time retrieval considering the complexity of BERT model itself. To further reduce computational cost, a compressed multi-layer transformer encoder is proposed with special training strategies as a substitution of the original complex BERT encoder. Lastly, two versions of TwinBERT are developed to combine the query and keyword embeddings for retrieval and relevance tasks correspondingly. Both of them have met the real-time latency requirement and achieve close or on-par performance to BERT-Base model.
The models were trained following the teacher-student framework and evaluated with data from one of the major search engines. Experimental results showed that the inference time was significantly reduced and was for the first time controlled within 20ms on CPUs while at the same time the performance gain from fine-tuned BERT-Base model was mostly retained. Integration of the models in production systems also demonstrated remarkable improvements on relevance metrics with negligible influence on latency. The models were released in 2019 with significant production impacts.
In the online advertising industry, the process of designing an ad creative i.e., ad text and image) requires manual labor. Typically, each advertiser launches multiple creatives via online A/B tests to infer effective creatives for the target audience, that are then refined further in an iterative fashion. Due to the manual nature of this process, it is time-consuming to learn, refine, and deploy the modified creatives. Since major ad platforms typically run A/B tests for multiple advertisers in parallel, we explore the possibility of collaboratively learning ad creative refinement via A/B tests of multiple advertisers. In particular, given an input ad creative, we study approaches to refine the given ad text and image by: (i) generating new ad text, (ii) recommending keyphrases for new ad text, and (iii) recommending image tags (objects in the image) to select new ad image. Based on A/B tests conducted by multiple advertisers, we form pairwise examples of inferior and superior ad creatives and use such pairs to train models for the above tasks. For generating new ad text, we demonstrate the efficacy of an encoder-decoder architecture with copy mechanism, which allows some words from the (inferior) input text to be copied to the output while incorporating new words associated with higher click-through-rate. For the keyphrase and image tag recommendation task, we demonstrate the efficacy of a deep relevance matching model, as well as the relative robustness of ranking approaches compared to ad text generation in cold-start scenarios with unseen advertisers. We also share broadly applicable insights from our experiments using data from the Yahoo Gemini ad platform.
Natural Language Understanding (NLU) models on voice-controlled speakers face several challenges. In particular, music streaming services have large catalogs, often containing millions of songs, artists, and albums and several thousands of custom playlists and stations. In many cases there is ambiguity and little structural difference between carrier phrases and entity names. In this work, we describe how we leveraged multi-armed bandits in combination with implicit customer feedback to improve accuracy and personalization of responses to voice request in the music domain. Our models are tested in a large-scale industrial system containing several other components. In particular, we focused on using this technology to correct errors made by upstream NLU models and personalize responses based on customer preferences and music provider functionality. The models resulted in significant improvement of playback rate for Amazon Music and are deployed in systems serving several countries and languages. We further used the implicit feedback of the customers to generate weakly labeled training data for the NLU models. This improved the experience for customers using other music providers on all Alexa devices.
Click-through rate (CTR) prediction is a critical task in online advertising systems. Existing works mainly address the single-domain CTR prediction problem and model aspects such as feature interaction, user behavior history and contextual information. Nevertheless, ads are usually displayed with natural content, which offers an opportunity for cross-domain CTR prediction. In this paper, we address this problem and leverage auxiliary data from a source domain to improve the CTR prediction performance of a target domain. Our study is based on UC Toutiao (a news feed service integrated with the UC Browser App, serving hundreds of millions of users daily), where the source domain is the news and the target domain is the ad. In order to effectively leverage news data for predicting CTRs of ads, we propose the Mixed Interest Network (MiNet) which jointly models three types of user interest: 1) long-term interest across domains, 2) short-term interest from the source domain and 3) short-term interest in the target domain. MiNet contains two levels of attentions, where the item-level attention can adaptively distill useful information from clicked news / ads and the interest-level attention can adaptively fuse different interest representations. Offline experiments show that MiNet outperforms several state-of-the-art methods for CTR prediction. We have deployed MiNet in UC Toutiao and the A/B test results show that the online CTR is also improved substantially. MiNet now serves the main ad traffic in UC Toutiao.
To drive purchase in online advertising, it is of the advertiser's great interest to optimize the sequential advertising strategy whose performance and interpretability are both important. The lack of interpretability in existing deep reinforcement learning methods makes it not easy to understand, diagnose and further optimize the strategy.In this paper, we propose our Deep Intents Sequential Advertising (DISA) method to address these issues. The key part of interpretability is to understand a consumer's purchase intent which is, however, unobservable (called hidden states). In this paper, we model this intention as a latent variable and formulate the problem as a Partially Observable Markov Decision Process (POMDP) where the underlying intents are inferred based on the observable behaviors. Large-scale industrial offline and online experiments demonstrate our method's superior performance over several baselines. The inferred hidden states are analyzed, and the results prove the rationality of our inference.
Rich user behavior data has been proven to be of great value for click-through rate prediction tasks, especially in industrial applications such as recommender systems and online advertising. Both industry and academy have paid much attention to this topic and propose different approaches to modeling with long sequential user behavior data. Among them, memory network based model MIMN proposed by Alibaba, achieves SOTA with the co-design of both learning algorithm and serving system. MIMN is the first industrial solution that can model sequential user behavior data with length scaling up to 1000. However, MIMN fails to precisely capture user interests given a specific candidate item when the length of user behavior sequence increases further, say, by 10 times or more. This challenge exists widely in previously proposed approaches.
In this paper, we tackle this problem by designing a new modeling paradigm, which we name as Search-based Interest Model (SIM). SIM extracts user interests with two cascaded search units: (i) General Search Unit (GSU) acts as a general search from the raw and arbitrary long sequential behavior data, with query information from candidate item, and gets a Sub user Behavior Sequence (SBS) which is relevant to candidate item; (ii) Exact Search Unit (ESU) models the precise relationship between candidate item and SBS. This cascaded search paradigm enables SIM with a better ability to model lifelong sequential behavior data in both scalability and accuracy. Apart from the learning algorithm, we also introduce our hands-on experience on how to implement SIM in large scale industrial systems. Since 2019, SIM has been deployed in the display advertising system in Alibaba, bringing 7.1% CTR and 4.4% RPM lift, which is significant to the business. Serving the main traffic in our real system now, SIM models sequential user behavior data with maximum length reaching up to 54000, pushing SOTA to 54x.
Helpful reviews in e-commerce sites can help customers acquire detailed information about a certain item, thus affecting customers' buying decisions. Predicting review helpfulness automatically in Taobao is an essential but challenging task for two reasons: (1) whether a review is helpful not only relies on its text, but also is related with the corresponding item and the user who posts the review, (2) the criteria of classifying review helpfulness under different items are not the same. To handle these two challenges, we propose CA-GNN (Category Aware Graph Neural Networks), which uses graph neural networks (GNNs) to identify helpful reviews in a multi-task manner --- we employ GNNs with one shared and many item-specific graph convolutions to learn the common features and each item's specific criterion for classifying reviews simultaneously. To reduce the number of parameters in CA-GNN and further boost its performance, we partition the items into several clusters according to their category information, such that items in one cluster share a common graph convolution.We conduct solid experiments on two public datasets and demonstrate that CA-GNN outperforms existing methods by up to 10.9% in AUC. We also deployed our system in Taobao with online A/B Test and verify that CA-GNN still outperforms the baseline system in most cases.
The use of AI in knowledge dense domains, e.g., chemistry, medicine, biology, etc. - is extremely promising, but often suffers from slow deployment and adaptation to different tasks. We propose a methodology to quickly capture the intent and expertise of a domain expert in order to train personalized AI models for specific tasks. Specifically we focus on the domain of polymer materials design and discovery: it often takes 10 years or more to design, synthesize, test, and introduce a new polymer material into the market. One way to accelerate up the design of polymer materials is through the use of computational methods to design the material, such as combinatorial screening, generative models, inverse design, etc. The drawback of these methods is that they generate a large number of candidates for new molecules, which then need to be manually reviewed by subject matter experts who select only a dozen for further investigation. Our solution is a human-in-the-loop methodology where we rank the candidates according to a utility function that is learned via the continued interaction with the subject matter experts, but which is also constrained by specific chemical knowledge. We prove the viability of our proposed methodology in a polymer production lab and we (i) evaluate against datasets of polymers previously produced in the lab as well as (ii) producing several novel materials that are undergoing experimental development, and (iii) quantitatively show that standard synthetic accessibility scores do not inform about patterns of SME decisions.
The identification of Obstructive Sleep Apnea (OSA) relies on laborious and expensive polysomnography (PSG) exams. However, it is known that other factors, easier to measure, can be good indicators of OSA and its severity. In this work, we extensively investigate the use of Machine Learning techniques in the task of determining which factors are more revealing with respect to OSA along with a discussion of the challenges to perform such a task. We ran extensive experiments over 1,042 patients from the Centre Hospitalier Universitaire of the city of Grenoble, France. The data included ordinary clinical information, and PSG results as baseline. We employed data preparation techniques including cleaning of outliers, imputation of missing values, and synthetic data generation. Following, we performed an exhaustive attribute selection scheme to find the most representative features. We found that the prediction of OSA depends largely on variables related to age, body mass, and sleep habits more than the ones related to alcoholism, tabagism, and depression. Next, we tested 60 regression/classification algorithms to predict the Apnea-Hypopnea Index (AHI), and the AHI-based severity of OSA. We achieved performances significantly superior to the state of the art both for AHI regression and classification. Our results can benefit the development of tools for the automatic screening of patients who should go through polysomnography and further treatments of OSA -- currently, our work in under consideration for production by the Centre Hospitalier Universitaire of Grenoble. Our thorough methodology enables experimental reproducibility on similar OSA-detection problems, and more generally, on other problems with similar data models.
Differential diagnostic systems provide a ranked list of highly prob-able diseases given a patient's profile and symptoms. Evaluation of diagnostic algorithms in literature has been limited to a small set of hand-crafted patient vignettes. Testing with high coverage and gaining insights for improvements are challenging because of thesize and complexity of the knowledge base. Furthermore, scalable practical methodologies for evaluation and deployment of such systems are missing in the literature. Here, we address this challenge using a novel patient vignette simulation algorithm within an iterative clinician-in-the-loop methodology for semi-automatically evaluating and deploying medical diagnostic systems in production.We evaluate our algorithms and methodology through a case study of a real product and knowledge base curated by medical experts.We conduct multiple iterations of the methodology, report novel accuracy measures, and discuss insights from our experience in applying this method to production
Malicious actors create inauthentic social media accounts controlled in part by algorithms, known as social bots, to disseminate misinformation and agitate online discussion. While researchers have developed sophisticated methods to detect abuse, novel bots with diverse behaviors evade detection. We show that different types of bots are characterized by different behavioral features. As a result, supervised learning techniques suffer severe performance deterioration when attempting to detect behaviors not observed in the training data. Moreover, tuning these models to recognize novel bots requires retraining with a significant amount of new annotations, which are expensive to obtain. To address these issues, we propose a new supervised learning method that trains classifiers specialized for each class of bots and combines their decisions through the maximum rule. The ensemble of specialized classifiers (ESC) can better generalize, leading to an average improvement of 56% in F1 score for unseen accounts across datasets. Furthermore, novel bot behaviors are learned with fewer labeled examples during retraining. We deployed ESC in the newest version of Botometer, a popular tool to detect social bots in the wild, with a cross-validation AUC of 0.99.
Reducing false positives while detecting anomalies is of growing importance for various industrial applications and mission-critical infrastructures, including satellite systems. Undesired false positives can be costly for such systems, bringing the operation to a halt for human experts to determine if the anomalies are true anomalies that need to be mitigated. Although rule-based or machine learning-based anomaly detection approaches have been studied, a tensor-based decomposition method has not been extensively explored. In this work, we introduce an Integrative Tensor-based Anomaly Detection (ITAD) framework to detect anomalies in a satellite system with the goal of minimizing false positives. We construct 3rd-order tensors with telemetry data collected from the Korea Multi-Purpose Satellite-2 (KOMPSAT-2) and calculate the anomaly score using one of the component matrices obtained by applying CANDECOMP/PARAFAC decomposition to detect anomalies. Our result shows that our tensor-based approach outperforms existing methods, achieving higher accuracy and lower false positive rates. And we successfully deployed our anomaly detection system in real KOMPSAT-2 mission operation.
Botnets have been using domain generation algorithms (DGA) for over a decade to covertly and robustly identify the domain name of their command and control servers (C&C). Recent advancements in DGA detection has motivated botnet owners to rapidly alter the C&C domain and use adversarial techniques to evade detection. As a result, it has become increasingly difficult to track botnets in DNS traffic. In this paper, we present Helix, a method for tracking and exploring botnets. Helix uses a spatio-temporal deep neural network autoencoder to convert domains into numerical vectors (embeddings) which capture the DGA and seed used to create the domain. This is made possible by leveraging both convolutional (spatial) and recurrent (temporal) layers, and by using techniques such as attention mechanisms and highways. Furthermore, by using an autoencoder architecture, the network can be trained in an unsupervised manner (no labeling of data) which makes the system practical for real world deployments.
In our evaluation, we found that Helix can track botnet campaigns, distinguish between DGA families and seeds, and can identify domains generated using the latest adversarial machine learning techniques. Helix is currently being used to track botnets in one of the world's largest Internet Service Providers (ISP), and we include some of the ISP's analysis work using our method.
The identification of criminals' behavioral patterns can be helpful for solving crimes. Currently, in order to perform this task, police investigators manually extract criminals' behavioral patterns (also referred to as criminals' modus operandi) from a large corpus of police reports. These patterns are compared to the patterns observed in an ongoing criminal investigation to identify similarities that may link the suspect to other documented crimes. Due to the large number of historical cases, this manual process is time consuming, very costly in terms of police resources, and limits the investigators' ability to solve open cases. In this study, we propose an automatic and language independent method for extracting behavioral patterns from police reports. Relying on the extracted behavioral patterns as input, we utilize a Siamese neural network to identify burglaries committed by the same criminals. Experiments performed using a large dataset of police reports written in Hebrew provided by the Israel Police demonstrate the proposed method's high performance, achieving an AUC above 0.9. Using our method, we are also able to identify potential suspects for 22.41% of the open burglary cases in Israel.
Medical research is risky and expensive. Drug discovery requires researchers to efficiently winnow thousands of potential targets to a small candidate set. However, scientists spend significant time and money long before seeing the intermediate results that ultimately determine this smaller set. Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. We present AGATHA, a deep-learning hypothesis generation system that learns a data-driven ranking criteria to recommend new biomedical connections. We massively validate our system with a temporal holdout wherein we predict connections first introduced after 2015 using data published beforehand. We additionally explore biomedical sub-domains, and demonstrate AGATHA's predictive capacity across the twenty most popular relationship types. Furthermore, we perform an ablation study to examine the aspects of our semantic network that most contribute to recommendation quality. Overall, AGATHA achieves best-in-class recommendation quality when compared to other hypothesis generation systems built to predict across all available biomedical literature. Reproducibility: All code, experimental data, and pre-trained models are available online: sybrandt.com/2020/agatha.
Platform ecosystems have witnessed an explosive growth by facilitating interactions between consumers and suppliers. Search systems powering such platforms play an important role in surfacing content in front of users. To maintain a healthy, sustainable platform, systems designers often need to explicitly consider exposing under-served content to users, content which might otherwise remain undiscovered. In this work, we consider the question when we might surface under-served content in search results, and investigate ways to provide exposure to certain content groups. We propose a framework to develop query understanding techniques to identify potential non-focused search queries on a music streaming platform, where users' information needs are non-specific enough to expose under-served content without severely impacting user satisfaction. We present insights from a search ranker deployed at scale and present results from live A/B test targeting a random sample of 72 million users and 593 million sessions, to compare performance of different methods considered to identify non-focused queries for surfacing under-served content.
Many internet applications are powered by machine learned models, which are usually trained on labeled datasets obtained through user feedback signals or human judgments. Since societal biases may be present in the generation of such datasets, it is possible for the trained models to be biased, thereby resulting in potential discrimination and harms for disadvantaged groups. Motivated by the need to understand and address algorithmic bias in web-scale ML systems and the limitations of existing fairness toolkits, we present the LinkedIn Fairness Toolkit (LiFT), a framework for scalable computation of fairness metrics as part of large ML systems. We highlight the key requirements in deployed settings, and present the design of our fairness measurement system. We discuss the challenges encountered in incorporating fairness tools in practice and the lessons learned during deployment at LinkedIn. Finally, we provide open problems based on practical experience.
Win prediction and performance evaluation are two core subjects in the sport analytics. Traditionally, they are treated separately and studied by two independent communities. However, this is not the intuitive way how humans interpret the matches: we predict the match results with the competition carrying on, and simultaneously evaluate each action based on the game context and its downstream impact. Predicting the match outcomes and evaluating the actions are coupled tasks, and the more accurately we predict, the better the evaluation is
To this end, we develop a unified Match Tracing framework (namely, MT), for tackling the win prediction and performance evaluation jointly. The main idea of MT is to learn a real-time look-ahead win rate curve rather than a single scalar (win or lose). And the value of an action can be objectively measured with respect to the increase or decrease of the curve. To meet the low-latency restrictions of the online deployments, an efficient model equipped with recurrent attention mechanism and matrix perturbation (i.e., MT-Net) is built for learning and yielding the win rate curve. MT-Net encodes the players' behavior sequence through an attention mechanism and captures the player-interaction effects through a graph embedding method. With the action values derived from the win rate curve, performance can be quantified at different granularities (action/player/match level) by integrated analysis.
Experiments on an e-sport game demonstrate the prediction effectiveness and the feasibility of the MT framework. Furthermore, we present the detailed application cases of MT, including key actions recognition and close match detection.
For many applications, predicting the users' intents can help the system provide the solutions or recommendations to the users. It improves the user experience, and brings economic benefits. The main challenge of user intent prediction is that we lack enough labeled data for training, and some intents (labels) are sparse in the training set. This is a general problem for many real-world prediction tasks. To overcome data sparsity, we propose a masked-field pre-training framework. In pre-training, we exploit massive unlabeled data to learn useful feature interaction patterns. We do this by masking partial field features, and learning to predict them from other unmasked features. We then finetune the pre-trained model for the target intent prediction task. This framework can be used to train various deep models. In the intent prediction task, each intent is only relevant to partial features. To tackle this problem, we propose a Field-Independent Transformer network. This network generates separate representation for each field, and aggregates the relevant field representations with attention mechanism for each intent. We test our method on intent prediction datasets in customer service scenarios as well as several public datasets. The results show that the masked-field pre-training framework significantly improves the prediction precision for deep models. And the Field-Independent Transformer network trained with the masked-field pre-training framework outperforms the state-of-the-art methods in the user intent prediction.
Query Auto Completion (QAC), as the starting point of information retrieval tasks, is critical to user experience. Generally it has two steps: generating completed query candidates according to query prefixes, and ranking them based on extracted features. Three major challenges are observed for a query auto completion system: (1) QAC has a strict online latency requirement. For each keystroke, results must be returned within tens of milliseconds, which poses a significant challenge in designing sophisticated language models for it. (2) For unseen queries, generated candidates are of poor quality as contextual information is not fully utilized. (3) Traditional QAC systems heavily rely on handcrafted features such as the query candidate frequency in search logs, lacking sufficient semantic understanding of the candidate.
In this paper, we propose an efficient neural QAC system with effective context modeling to overcome these challenges. On the candidate generation side, this system uses as much information as possible in unseen prefixes to generate relevant candidates, increasing the recall by a large margin. On the candidate ranking side, an unnormalized language model is proposed, which effectively captures deep semantics of queries. This approach presents better ranking performance over state-of-the-art neural ranking methods and reduces ~95% latency compared to neural language modeling methods. The empirical results on public datasets show that our model achieves a good balance between accuracy and efficiency. This system is served in LinkedIn job search with significant product impact observed.
Users' behavioral predictions are crucially important for many domains including major e-commerce companies, ride-hailing platforms, social networking, and education. The success of such prediction strongly depends on the development of representation learning that can effectively model the dynamic evolution of user's behavior. This paper aims to develop a joint framework of combining inverse reinforcement learning (IRL) with deep learning (DL) regression model, called IRL-DL, to predict drivers' future behavior in ride-hailing platforms. Specifically, we formulate the dynamic evolution of each driver as a sequential decision-making problem and then employ IRL as representation learning to learn the preference vector of each driver. Then, we integrate drivers' preference vector with their static features (e.g., age, gender) and other attributes to build a regression model (e.g., LTSM-neural network) to predict drivers' future behavior. We use an extensive driver data set obtained from a ride-sharing platform to verify the effectiveness and efficiency of our IRL-DL framework, and results show that our IRL-DL framework can achieve consistent and remarkable improvements over models without drivers' preference vectors.
Behavior tracing or predicting is a key component in various application scenarios like online user modeling and ubiquitous computing, which significantly benefits the system design (e.g., resource pre-caching) and improves the user experience (e.g., personalized recommendation). Traditional behavior tracing methods like Markovian and sequential models take recent behaviors as input and infer the next move by using the most real-time information. However, these existing methods rarely comprehensively model the low-level temporal irregularity in the recent behavior sequence, i.e., the unevenly distributed time intervals between consecutive behaviors, and the high-level periodicity in the long-term activity cycle, i.e., the periodic behavior patterns of each user.
In this paper, we propose an intuitive and effective embedding method called Multi-level Aligned Temporal Embedding (MATE), which can tackle the temporal irregularity of recent behavior sequence and then align with the long-term periodicity in the activity cycle. Specifically, we combine time encoding and decoupled attention mechanism to build a temporal self-attentive sequential decoder to address the behavior-level temporal irregularity. To embed the activity cycle from the raw behavior sequence, we employ a novel temporal dense interpolation followed by a self-attentive sequential encoder. Then we first propose the periodic activity alignment to capture the long-term activity-level periodicity and construct the activity-behavior alignment to combine the activity-level with behavior-level representation to make the final prediction. We experimentally prove the effectiveness of the proposed model on a game player behavior sequence dataset and a real-world App usage trace dataset. Further, we deploy the proposed behavior tracing model into a game scene preloading service which can effectively reduce the waiting time of scene transfer by preloading the predicted game scene for each user.
Many recent advances in neural information retrieval models, which predict top-K items given a query, learn directly from a large training set of (query, item) pairs. However, they are often insufficient when there are many previously unseen (query, item) combinations, often referred to as the cold start problem. Furthermore, the search system can be biased towards items that are frequently shown to a query previously, also known as the 'rich get richer' (a.k.a. feedback loop) problem. In light of these problems, we observed that most online content platforms have both a search and a recommender system that, while having heterogeneous input spaces, can be connected through their common output item space and a shared semantic representation. In this paper, we propose a new Zero-Shot Heterogeneous Transfer Learning framework that transfers learned knowledge from the recommender system component to improve the search component of a content platform. First, it learns representations of items and their natural-language features by predicting (item, item) correlation graphs derived from the recommender system as an auxiliary task. Then, the learned representations are transferred to solve the target search retrieval task, performing query-to-item prediction without having seen any (query, item) pairs in training. We conduct online and offline experiments on one of the world's largest search and recommender systems from Google, and present the results and lessons learned. We demonstrate that the proposed approach can achieve high performance on offline search retrieval tasks, and more importantly, achieved significant improvements on relevance and user interactions over the highly-optimized production system in online experiments.
Relevance ranking is a key component of many search engines, including the Tweet search engine at Twitter. Users often use Tweet search to discover live discussions and different voices on trending topics or recent events. Tweet search is thus unique due to its focus on real-time content, where both the retrieved content and queries change drastically on an hourly basis. Another important property of Tweet search is that its relevance ranking takes the social endorsements from other users into account, e.g., "likes" and "retweets", which is different from mainly relying on clicks as implicit feedback. The relevance ranking of Tweet search is also subject to strict latency constraints, because every second, a large amount of Tweets are posted and indexed, while tens of thousands of queries are issued to search posted Tweets. Considering the above properties and constraints, we present a relevance ranking system for Tweet search addressing all these challenges at Twitter. We first discuss the formation of the relevance ranking pipeline, which consists of a series of ranking models. We then present the methodology for training the models and the various groups of features we use, including real-time and personalized features. We also investigate approaches of achieving unbiased model training and building up automatic online tuning of system parameters. Experiments using online A/B testing demonstrate the effectiveness of the proposed approaches and we have deployed the proposed relevance ranking system in production for more than three years.
Vehicular path flow (trajectories) is an important data source for smart mobility, from which many road traffic parameters can be inferred. However, it has been a long-existing challenge that single source of trajectory data is biased in terms of its spatiotemporal coverage. In this paper, we leverage two types of large traffic datasets - point flows and sample trajectories - to generate the full city-scale vehicular paths. Our method consists of a low-granularity data fusion (LGDF) module, which uses point flow data to estimate the sparse paths that pass through some specific links (where sensors are mounted), and a high-granularity model training (HGMT) component, which uses sample trajectory data to pre-train a bi-gram sequence generation model. Afterwards, the results from LGDF and HGMT are combined to produce detailed on-road spatiotemporal paths. In this way, the data safety of single trajectory is protected while the full-scale city traffic can be reproduced for transportation analytics. The proposed method is verified via real-data case studies. As a result, starting from August 2019, this method has been implemented in Alibaba's city brain project and successively deployed in many cities in China for the purpose of traffic analysis and optimization.
Large-scale online promotions, such as Double 11 and Black Friday, are of great value to e-commerce platforms nowadays. Traditional methods are not successful when we aim to maximize global Gross Merchandise Volume (GMV) in the promotion scenarios due to three limitations. The first is that the GMV of sellers varies significantly from daily scenarios to promotions. Second, these methods do not consider explosive demands in promotions, so that a consumer may fail to purchase some popular items due to sellers' limited capacities. Third, the traffic distribution over sellers presents divergence in different channels, thus rendering the performance of the traditional single-channel methods far from optimal in creating commercial values. To address these problems, we design a Multi-Channel Sellers Traffic Allocation (MCSTA) optimization model to obtain optimal page view (PV) distribution concerning global GMV. Then we propose a general constrained non-smooth convex optimization solution with a Multi-Objective Shortest Distance (MOSD) hyperparameter tuning method to solve MCSTA. This is the first work to systematically address this issue in the scenario of large-scale online promotions. The empirical results show that MCSTA achieves significant improvement of GMV by 1.1% based on A/B test during Alibaba's "Global Shopping Festival", one of the world's largest online sales events. Furthermore, we deploy MCSTA in other popular scenarios, including everyday promotion and video live stream service, to showcase that MCSTA can be widely applied in e-commerce and online entertainment services.
As one of the core components of customer service bot, User Intent Prediction (UIP) aims at predicting users? intents (usually represented as predefined user questions) before they ask, and has been widely applied in real applications. However, when developing a machine learning system for this problem, two critical issues, i.e., the problem of feature drift and class imbalance, may emerge and seriously deprave the system performance. Moreover, various scenarios may arise due to business demands, making the aforementioned problems much more severe. To address these two problems, we propose an attention-based Deep Multi-instance Sequential Cross Network (aDMSCN) to deal with the UIP task. On the one hand,the UIP task can be subtly formalized as multi-instance learning(MIL) task with an attention-based method proposed to alleviate the influences of feature drift. To the best of our knowledge, this is the first attempt to model the problem from a MIL perspective.On the other hand, a ratio-sensitive loss is also developed in our model, which can mitigate the negative impact of class imbalance. Extensive experiments on both offline real-world datasets and on-line A/B testing show that our proposed framework significantly out performs other state-of-art methods for the UIP task.
Given the convenience of collecting information through online services, recommender systems now consume large scale data and play a more important role in improving user experience. With the recent emergence of Graph Neural Networks (GNNs), GNN-based recommender models have shown the advantage of modeling the recommender system as a user-item bipartite graph to learn representations of users and items. However, such models are expensive to train and difficult to perform frequent updates to provide the most up-to-date recommendations. In this work, we propose to update GNN-based recommender models incrementally so that the computation time can be greatly reduced and models can be updated more frequently. We develop a Graph Structure Aware Incremental Learning framework, GraphSAIL, to address the commonly experienced catastrophic forgetting problem that occurs when training a model in an incremental fashion. Our approach preserves a user's long-term preference (or an item's long-term property) during incremental model updating. GraphSAIL implements a graph structure preservation strategy which explicitly preserves each node's local structure, global structure, and self-information, respectively. We argue that our incremental training framework is the first attempt tailored for GNN based recommender systems and demonstrate its improvement compared to other incremental learning techniques on two public datasets. We further verify the effectiveness of our framework on a large-scale industrial dataset.
Many recommendation systems use users' attributes to retrieve documents before ranking. Instead of using all attributes, this work explores algorithms that choose a subset, in order to achieve higher precision. We propose a model that forecasts the relevance of documents matched by each individual attribute. By restricting to top-K attributes based on the forecast, we observed 50% reduction in latency at 99th percentile on LinkedIn's job recommendation system, as well as increased users' engagements.
User profiling is one of the most important components in recommendation systems, where a user is profiled using demographic (e.g. gender, age, and location) and user behavior information (e.g. browsing and search history). Among different dimensions of user profiling, tagging is an explainable and widely-used representation of user interest. In this paper, we propose a user tag profiling model (UTPM) to study user-tag profiling as a multi-label classification task using deep neural networks. Different from the conventional model, our UTPM model is a multi-head attention mechanism with shared query vectors to learn sparse features across different fields. Besides, we introduce the improved FM-based cross feature layer, which outperforms many state-of-the-art cross feature methods and further enhances model performance. Meanwhile, we design a novel joint method to learn the preference of different tags from a single clicked news article in recommendation systems. Furthermore, our UTPM model is deployed in the WeChat "Top Stories" recommender system, where both online and offline experiments demonstrate the superiority of the proposed model over baseline models.
Gas theft of restaurants is a major concern in the gas industry, which causes revenue losses for gas companies and endangers the public safety seriously. Traditional methods of gas theft detection highly rely on active human efforts that are extremely ineffective. Thanks to the gas consumption data collected by smart meters, we can devise a data-driven method to tackle this issue. In this paper, we propose a gas-theft detection method msRank to discover suspicious restaurant users when only scarce labels are available. Our method contains three main components: 1)data pre-processing, which filters reading noises and excludes data-missing or zero-use users; 2)normal user modeling, which quantifies the self-stable seasonality of normal users and distinguishes them from unstable ones; and 3)gas-theft suspect detection, which discovers gas-theft suspects among unstable users by RankNet-based suspicion scoring on extracted deviation features. By using detected normal users as negative samples to train RankNet, the component of normal user modeling and that of gas-theft suspect detection are seamlessly connected, overcoming the problem of label scarcity. We conduct extensive experiments on three real-world datasets, and the results demonstrate advantages of our approach. We have deployed a system GasShield which provides a gas-theft suspect list weekly for a gas group in northern China.
As a concise form of user reviews, tips have unique advantages to explain the search results, assist users' decision making, and further improve user experience in vertical search scenarios. Existing work on tip generation does not take query into consideration, which limits the impact of tips in search scenarios. To address this issue, this paper proposes a query-aware tip generation framework, integrating query information into encoding and subsequent decoding processes. Two specific adaptations of Transformer and Recurrent Neural Network (RNN) are proposed. For Transformer, the query impact is incorporated into the self-attention computation of both the encoder and the decoder. As for RNN, the query-aware encoder adopts a selective network to distill query-relevant information from the review, while the query-aware decoder integrates the query information into the attention computation during decoding. The framework consistently outperforms the competing methods on both public and real-world industrial datasets. Last but not least, online deployment experiments on Dianping demonstrate the advantage of the proposed framework for tip generation as well as its online business values.
Mobile advertising has become inarguably one of the fastest growing industries all over the world. The influx of capital attracts increasing fraudsters to defraud money from advertisers. There are many tricks a fraudster can leverage, among which bot install fraud is undoubtedly the most insidious one due to its ability to implement sophisticated behavioral patterns and emulate normal users, so as to evade from detection rules defined by human experts. In this work, we propose an anti-fraud method based on heterogeneous graph that incorporates both local context and global context via graph neural networks (GNN) and gradient boosting classifier to detect bot fraud installs at Mobvista, a leading global mobile advertising company. Offline evaluations in two datasets show the proposed method outperforms all the competitive baseline methods by at least 2.2% in the first dataset and 5.75% in the second dataset given the evaluation metric Recall@90% Precision. Furthermore, we deploy our method to tackle million-scale data daily at Mobvista. The online performance also shows that the proposed methods consistently detect more bots than other baseline methods.
The fast evolving and deadly outbreak of coronavirus disease (COVID-19) has posed grand challenges to human society. To slow the spread of virus infections and better respond with actionable strategies for community mitigation, leveraging the large-scale and real-time pandemic related data generated from heterogeneous sources (e.g., disease related data, demographic data, mobility data, and social media data), in this work, we propose and develop a data-driven system (named α-satellite), as an initial offering, to provide real-time COVID-19 risk assessment in a hierarchical manner in the United States. More specifically, given a location (either user input or automatic positioning), the system will automatically provide risk indices associated with the specific location, the county that location is in and the state as a whole to enable people to select appropriate actions for protection while minimizing disruptions to daily life to the extent possible. In α-satellite, we first construct an attributed heterogeneous information network (AHIN) to model the collected multi-source data in a comprehensive way; and then we utilize meta-path based schemes to model both vertical and horizontal information associated with a given location (i.e., point of interest, POI); finally we devise a novel heterogeneous graph neural network to aggregate its neighborhood information to estimate the risk of the given POI in a hierarchical manner. To comprehensively evaluate the performance of α-satellite in real-time COVID-19 risk assessment, a set of studies are first performed to validate its utility; based on a real-world dataset consisting of 6,538 annotated POIs, the experimental results show that α-satellite achieves the area of under curve (AUC) of 0.9378, which outperforms the state-of-the-art baselines. After we launched the system for public tests, it had attracted 51,190 users as of May 30. Based on the analysis of its large-scale users, we have a key finding that people from more severe regions (i.e., with larger numbers of COVID-19 cases) have stronger interests using the system for actionable information. Our system and generated benchmark datasets have been made publicly accessible through our website.
The recent paramount success of the gig economy has introduced new business opportunities in different areas such as food delivery service. However, there are food delivery ride abusers who break the company rule by driving unauthorized vehicles that are not stated in the contract. These abusers are particularly problematic because they break the transportation regulations and unfairly take more orders. However, detecting these abusers are challenging because of lack of labeled datasets and these anomalous abusers do not frequently occur compared to normal riders. Furthermore, sequential patterns of abusing behaviors are not easy to model.
In this work, we aim to detect food delivery abusers using unauthorized vehicles, by formulating this problem as a novelty detection over sequential data. We propose the Variational Reward Inference based Novelty Detector (VRIND) using sequential novelty detection using inverse reinforcement learning with variational inference, in which the reward function can learn the behavioral intention of decision-making experts. The reward function is represented by a neural network that is capable of approximating reward distributions by variational reward inference. Using a commercial food delivery trajectory dataset from our company, we demonstrate that our model significantly outperforms over the other baseline methods in identifying novelty (abusers) in sequential data, which can ensure regulatory compliance and provide the fair opportunity to more than 100 thousand delivery riders, serving more than 1.5 million daily transactions in our Baemin food delivery system.
Mobile navigation is a critical component in mobile maps. Yawing detection (does a vehicle yaw) is an important task in mobile navigation. In regions containing parallel and close elevated and surface roads, it is hard to detect yawing events using traditional methods, which mainly rely on low-accuracy positions and moving directions. Recognizing whether a vehicle is moving on an elevated road can significantly improve the performance of yawing detection.
We propose Elevated Road Network (ERNet), a lightweight and real industrial neural network model for mobile navigation, to solve elevated road recognition fundamentally. For an elevated road fragment and a surface road fragment in the same group (they are parallel and close), ERNet takes four types of high-level features as input and learns two 10-dim descriptors (A and B). In inference stage, ERNet predicts a 10-dim embedding (C) for a position of a vehicle. By comparing ||A-C||2 2 and ||B-C||22, and applying a technique called confidence constraint, we recognize the road type corresponding to the position. Significant improvements on elevated road recognition and yawing detection have been achieved compared with several methods in extensive experiments. ERNet is deployed as part of AMap, the famous mobile map in China, serves drivers in three large cities: Beijing, Shanghai and Guangzhou, and will cover the whole country as soon as possible.
Manufacturing of car bodies heavily relies on demanding welding processes of joining body parts together that introduce thousands of joining welding spots in each car. Quality monitoring for these spots impacts production efficiency and cost. In this paper we develop an ML pipeline to predict the spot quality before the actual welding happens. This pipeline is based on a Feature Engineering~(FE) approach to manually design features using domain knowledge. We evaluated the pipeline with two datasets from industrial plants, achieving very promising results with prediction errors around 2%. Then, we develop an approach to semantically enhance FE pipelines in order to automate the ML process without compromising the prediction accuracy and to facilitate generalisation and transfer of FE-based models to other datasets and processes. Our ML pipeline has been deployed offline on various Bosch manufacturing datasets in a controlled environment since early 2019 and evaluated.
Recently, deep learning-based models have been widely studied for click-through rate (CTR) prediction and lead to improved prediction accuracy in many industrial applications. However, current research focuses primarily on building complex network architectures to better capture sophisticated feature interactions and dynamic user behaviors. The increased model complexity may slow down online inference and hinder its adoption in real-time applications. Instead, our work targets at a new model training strategy based on knowledge distillation (KD). KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model. The KD strategy not only allows us to simplify the student model as a vanilla DNN model but also achieves significant accuracy improvements over the state-of-the-art teacher models. The benefits thus motivate us to further explore the use of a powerful ensemble of teachers for more accurate student model training. We also propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss. We conduct comprehensive experiments against 12 existing models and across three industrial datasets. Both offline and online A/B testing results show the effectiveness of our KD-based training strategy.
Since over a decade coreference resolution systems have been developed in order to find simple 1-to-1 equivalent mapping (sameAs relations) between instances of different linked datasets and knowledge graphs. Comparative evaluations of instance matching systems can inform us about the performance of such systems regarding artificial benchmarks or real-world data challenges. However, the lack of real data for evaluating these systems is currently a bottleneck. In this paper, we propose the use of the Cruise entities in the GeoLink data repository as a real-world instance matching benchmark for linked data and knowledge graphs. The GeoLink project has brought together seven datasets related to geoscience research. Both the ontology (T-box) and the instance data (A-box) of GeoLink are significantly larger than current benchmarks, and they have particularly interesting challenges, such as geospatial and temporal data. The benchmark we propose here consists of two real-world datasets in GeoLink called R2R data and BCO-DMO which includes manual curated owl:sameAs links between more than 900 Cruise entities of these two datasets. The reference alignment was discussed and generated by domain experts from different institutions and is expressed in the Alignment API format.
In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. A second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalising on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding.
Knowledge Graphs (KGs) have been integrated in several models of recommendation to augment the informational value of an item by means of its related entities in the graph. Yet, existing datasets only provide explicit ratings on items and no information is provided about users' opinions of other (non-recommendable) entities. To overcome this limitation, we introduce a new dataset, called the MindReader dataset, providing explicit user ratings both for items and for KG entities. In this first version, the MindReader dataset provides more than 102 thousands explicit ratings collected from 1,174 real users on both items and entities from a KG in the movie domain. This dataset has been collected through an online interview application that we also release as open source. As a demonstration of the importance of this new dataset, we present a comparative study of the effect of the inclusion of ratings on non-item KG entities in a variety of state-of-the-art recommendation models. In particular, we show that most models, whether designed specifically for graph data or not, see improvements in recommendation quality when trained on explicit non-item ratings. Moreover, for some models, we show that non-item ratings can effectively replace item ratings without loss of recommendation quality. This finding, in addition to an observed greater familiarity from users towards certain descriptive entities than movies, motivates the use of KG entities for both warm and cold-start recommendations.
Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpus. After aggregation and filtering, including a k -anonymity requirement, we find 1.4 million of the TREC DL URLs have 18 million connections to 10 million distinct queries. Our dataset of these queries and connections to TREC documents is of similar size to proprietary datasets used in previous papers on query mining and ranking. We perform some preliminary experiments using the click data to augment the TREC DL training data, offering by comparison: 28x more queries, with 49x more connections to 4.4x more URLs in the corpus. We present a description of the dataset's generation process, characteristics, use in ranking and other potential uses.
Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning and natural language processing methods. With respect to the recent outbreak of the Coronavirus disease 2019 (COVID-19), online discourse on Twitter reflects public opinion and perception related to the pandemic itself as well as mitigating measures and their societal impact. Understanding such discourse, its evolution, and interdependencies with real-world events or (mis)information can foster valuable insights. On the other hand, such corpora are crucial facilitators for computational methods addressing tasks such as sentiment analysis, event detection, or entity recognition. However, obtaining, archiving, and semantically annotating large amounts of tweets is costly. In this paper, we describe TweetsCOV19, a publicly available knowledge base of currently more than 8 million tweets, spanning October 2019 - April 2020. Metadata about the tweets as well as extracted entities, hashtags, user mentions, sentiments, and URLs are exposed using established RDF/S vocabularies, providing an unprecedented knowledge base for a range of knowledge discovery tasks. Next to a description of the dataset and its extraction and annotation process, we present an initial analysis and use cases of the corpus.
LensKit is an open-source toolkit for building, researching, and learning about recommender systems. First released in 2010 as a Java framework, it has supported diverse published research, small-scale production deployments, and education in both MOOC and traditional classroom settings. In this paper, I present the next generation of the LensKit project, re-envisioning the original tool's objectives as flexible Python package for supporting recommender systems research and development. LensKit for Python (LKPY) enables researchers and students to build robust, flexible, and reproducible experiments that make use of the large and growing PyData and Scientific Python ecosystem, including scikit-learn, and TensorFlow. To that end, it provides classical collaborative filtering implementations, recommender system evaluation metrics, data preparation routines, and tools for efficiently batch running recommendation algorithms, all usable in any combination with each other or with other Python software.
This paper describes the design goals, use cases, and capabilities of LKPY, contextualized in a reflection on the successes and failures of the original LensKit for Java software.
The automatic detection of bias in news articles can have a high impact on society because undiscovered news bias may influence the political opinions, social views, and emotional feelings of readers. While various analyses and approaches to news bias detection have been proposed, large data sets with rich bias annotations on a fine-grained level are still missing. In this paper, we firstly aggregate the aspects of news bias in related works by proposing a new annotation schema for labeling news bias. This schema covers the overall bias, as well as the bias dimensions (1) hidden assumptions, (2) subjectivity, and (3) representation tendencies. Secondly, we propose a methodology based on crowdsourcing for obtaining a large data set for news bias analysis and identification. We then use our methodology to create a dataset consisting of more than 2,000 sentences annotated with 43,000 bias and bias dimension labels. Thirdly, we perform an in-depth analysis of the collected data. We show that the annotation task is difficult with respect to bias and specific bias dimensions. While crowdworkers' labels of representation tendencies correlate with experts' bias labels for articles, subjectivity and hidden assumptions do not correlate with experts' bias labels and, thus, seem to be less relevant when creating data sets with crowdworkers. The experts' article labels better match the inferred crowdworkers' article labels than the crowdworkers' sentence labels. The crowdworkers' countries of origin seem to affect their judgements. In our study, non-Western crowdworkers tend to annotate more bias either directly or in the form of bias dimensions (e.g., subjectivity) than Western crowdworkers do.
Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems.
In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.
Causal knowledge is seen as one of the key ingredients to advance artificial intelligence. Yet, few knowledge bases comprise causal knowledge to date, possibly due to significant efforts required for validation. Notwithstanding this challenge, we compile CauseNet, a large-scale knowledge base of claimed causal relations between causal concepts. By extraction from different semi- and unstructured web sources, we collect more than 11 million causal relations with an estimated extraction precision of 83% and construct the first large-scale and open-domain causality graph. We analyze the graph to gain insights about causal beliefs expressed on the web and we demonstrate its benefits in basic causal question answering. Future work may use the graph for causal reasoning, computational argumentation, multi-hop question answering, and more.
There are many existing retrieval and question answering datasets. However, most of them either focus on ranked list evaluation or single-candidate question answering. This divide makes it challenging to properly evaluate approaches concerned with ranking documents and providing snippets or answers for a given query. In this work, we present FiRA: a novel dataset of Fine-Grained Relevance Annotations. We extend the ranked retrieval annotations of the Deep Learning track of TREC 2019 with passage and word level graded relevance annotations for all relevant documents. We use our newly created data to study the distribution of relevance in long documents, as well as the attention of annotators to specific positions of the text. As an example, we evaluate the recently introduced TKL document ranking model. We find that although TKL exhibits state-of-the-art retrieval results for long documents, it misses many relevant passages.
In recent years, the amount of data has increased exponentially, and knowledge graphs have gained attention as data structures to integrate data and knowledge harvested from myriad data sources. However, data complexity issues like large volume, high-duplicate rate, and heterogeneity usually characterize these data sources, being required data management tools able to address the negative impact of these issues on the knowledge graph creation process. In this paper, we propose the SDM-RDFizer, an interpreter of the RDF Mapping Language (RML), to transform raw data in various formats into an RDF knowledge graph. SDM-RDFizer implements novel algorithms to execute the logical operators between mappings in RML, allowing thus to scale up to complex scenarios where data is not only broad but has a high-duplication rate. We empirically evaluate the SDM-RDFizer performance against diverse testbeds with diverse configurations of data volume, duplicates, and heterogeneity. The observed results indicate that SDM-RDFizer is two orders of magnitude faster than state of the art, thus, meaning that SDM-RDFizer an interoperable and scalable solution for knowledge graph creation. SDM-RDFizer is publicly available as a resource through a Github repository and a DOI.
Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450~crowdsourced segmentations for 8,490~web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model'' of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.
Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic American newspapers. Over 16 million pages have been digitized to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations collected as part of the Library of Congress's Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings. We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus and describe the resulting Newspaper Navigator dataset, the largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model, and all source code are placed in the public domain for unrestricted re-use.
In the area of natural language processing, various financial datasets have informed recent research and analysis including financial news, financial reports, social media, and audio data from earnings calls. We introduce a new, large-scale multi-modal, text-audio paired, earnings-call dataset named MAEC, based on S&P 1500 companies. We describe the main features of MAEC, how it was collected and assembled, paying particular attention to the text-audio alignment process used. We present the approach used in this work as providing a suitable framework for processing similar forms of data in the future. The resulting dataset is more than six times larger than those currently available to the research community and we discuss its potential in terms of current and future research challenges and opportunities. All resources of this work are available at https://github.com/Earnings-Call-Dataset/
Graph data have become increasingly common. Visualizing them helps people better understand relations among entities. Unfortunately, existing graph visualization tools are primarily designed for single-person desktop use, offering limited support for interactive web-based exploration and online collaborative analysis. To address these issues, we have developed Argo Lite, a new in-browser interactive graph exploration and visualization tool. Argo Lite enables users to publish and share interactive graph visualizations as URLs and embedded web widgets. Users can explore graphs incrementally by adding more related nodes, such as highly cited papers cited by or citing a paper of interest in a citation network. Argo Lite works across devices and platforms, leveraging WebGL for high-performance rendering. Argo Lite has been used by over 1,000 students at Georgia Tech's Data and Visual Analytics class. Argo Lite may serve as a valuable open-source tool for advancing multiple CIKM research areas, from data presentation, to interfaces for information systems and more.
We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.
Federated learning is a technique that enables distributed clients to collaboratively learn a shared machine learning model without sharing their training data. This reduces data privacy risks, however, privacy concerns still exist since it is possible to leak information about the training dataset from the trained model's weights or parameters. Therefore, it is important to develop federated learning algorithms that train highly accurate models in a privacy-preserving manner. Setting up a federated learning environment, especially with security and privacy guarantees, is a time-consuming process with numerous configurations and parameters that can be manipulated. In order to help clients ensure that collaboration is feasible and to check that it improves their model accuracy, a real-world simulator for privacy-preserving and secure federated learning is required.
In this paper, we introduce PrivacyFL, which is an extensible, easily configurable, and scalable simulator for federated learning environments. Its key features include latency simulation, robustness to client departure/failure, support for both centralized (with one or more servers) and decentralized (serverless) learning, and configurable privacy and security mechanisms based on differential privacy and secure multiparty computation (MPC).
In this paper, we motivate our research, describe the architecture of the simulator and associated protocols, and discuss its evaluation in numerous scenarios that highlight its wide range of functionality and its advantages. Our paper addresses a significant real-world problem: checking the feasibility of participating in a federated learning environment under a variety of circumstances. It also has a strong practical impact because organizations such as hospitals, banks, and research institutes, which have large amounts of sensitive data and would like to collaborate, would greatly benefit from having a system that enables them to do so in a privacy-preserving and secure manner.
In this article, we introduce the \dataset dataset, a collection of implicit interactions and impressions of movies and TV series from an Over-The-Top media service, which delivers its media contents over the Internet. The dataset is distinguished from other already available multimedia recommendation datasets by the availability of impressions, \idest the recommendations shown to the user, its size, and by being open-source. We describe the data collection process, the preprocessing applied, its characteristics, and statistics when compared to other commonly used datasets. We also highlight several possible use cases and research questions that can benefit from the availability of user impressions in an open-source dataset. Furthermore, we release software tools to load and split the data, as well as examples of how to use both user interactions and impressions in several common recommendation algorithms.
Entity matching is a central task in data integration which has been researched for decades. Over this time, a wide range of benchmark tasks for evaluating entity matching methods has been developed. This resource paper systematically complements, profiles, and compares 21 entity matching benchmark tasks. In order to better understand the specific challenges associated with different tasks, we define a set of profiling dimensions which capture central aspects of the matching tasks. Using these dimensions, we create groups of benchmark tasks having similar characteristics. Afterwards, we assess the difficulty of the tasks in each group by computing baseline evaluation results using standard feature engineering together with two common classification methods. In order to enable the exact reproducibility of evaluation results, matching tasks need to contain exactly defined sets of matching and non-matching record pairs, as well as a fixed development and test split. As this is not the case for some widely-used benchmark tasks, we complement these tasks with fixed sets of non-matching pairs, as well as fixed splits, and provide the resulting development and test sets for public download. By profiling and complementing the benchmark tasks, we support researchers to select challenging as well as diverse tasks and to compare matching systems on clearly defined grounds.
Given a text with entity links, the task of entity aspect linking is to identify which aspect of an entity is referred to in the context. For example, if a text passage mentions the entity "USA'', is USA mentioned in the context of the 2008 financial crisis, American cuisine, or else? Complementing efforts of Nanni et al (2018), we provide a large-scale test collection which is derived from Wikipedia hyperlinks in a dump from 01/01/2020. Furthermore, we offer strong baselines with results and broken-out feature sets to stimulate more research in this area.
Data, code, feature sets, runfiles and results are released under a CC-SA license and offered on our aspect linking resource web page http://www.cs.unh.edu/~dietz/eal-dataset-2020/
The comment sections of online news platforms are an important space to indulge in political conversations and to discuss opinions. Although primarily meant as forums where readers discuss amongst each other, they can also spark a dialog with the journalists who authored the article. A small but important fraction of comments address the journalists directly, e.g., with questions, recommendations for future topics, thanks and appreciation, or article corrections. However, the sheer number of comments makes it infeasible for journalists to follow discussions around their articles in extenso. A better understanding of this data could support journalists in gaining insights into their audience and fostering engaging and respectful discussions. To this end, we present a dataset of dialogs in which journalists of The Guardian replied to reader comments and identify the reasons why. Based on this data, we formulate the novel task of recommending reader comments to journalists that are worth reading or replying to, i.e., ranking comments in such a way that the top comments are most likely to require the journalists' reaction. As a baseline, we trained a neural network model with the help of a pair-wise comment ranking task. Our experiment reveals the challenges of this task and we outline promising paths for future work. The data and our code are available for research purposes from: https://hpi.de/naumann/projects/repeatability/text-mining.html
Graphs encode important structural properties of complex systems. Machine learning on graphs has therefore emerged as an important technique in research and applications. We present Karate Club - a Python framework combining more than 30 state-of-the-art graph mining algorithms. These unsupervised techniques make it easy to identify and represent common graph features. The primary goal of the package is to make community detection, node and whole graph embedding available to a wide audience of machine learning researchers and practitioners. Karate Club is designed with an emphasis on a consistent application interface, scalability, ease of use, sensible out of the box model behaviour, standardized dataset ingestion, and output generation. This paper discusses the design principles behind the framework with practical examples. We show Karate Club's efficiency in learning performance on a wide range of real world clustering problems and classification tasks along with supporting evidence of its competitive speed.
Sampling graphs is an important task in data mining. In this paper, we describe Little Ball of Fur a Python library that includes more than twenty graph sampling algorithms. Our goal is to make node, edge, and exploration-based network sampling techniques accessible to a large number of professionals, researchers, and students in a single streamlined framework. We created this framework with a focus on a coherent application public interface which has a convenient design, generic input data requirements, and reasonable baseline settings of algorithms. Here we overview these design foundations of the framework in detail with illustrative code snippets. We show the practical usability of the library by estimating various global statistics of social networks and web graphs. Experiments demonstrate that Little Ball of Fur can speed up node and whole graph embedding techniques considerably with mildly deteriorating the predictive value of distilled features.
The Natural Language Processing (NLP) community has significantly contributed to the solutions for entity and relation recognition from a natural language text, and possibly linking them to proper matches in Knowledge Graphs (KGs). Considering Wikidata as the background KG, there are still limited tools to link knowledge within the text to Wikidata. In this paper, we present Falcon 2.0, the first joint entity and relation linking tool over Wikidata. It receives a short natural language text in the English language and outputs a ranked list of entities and relations annotated with the proper candidates in Wikidata. The candidates are represented by their Internationalized Resource Identifier (IRI) in Wikidata. Falcon 2.0 resorts to the English language model for the recognition task (e.g., N-Gram tiling and N-Gram splitting), and then an optimization approach for the linking task. We have empirically studied the performance of Falcon 2.0 on Wikidata and concluded that it outperforms all the existing baselines. Falcon 2.0 is open source and can be reused by the community; all the required instructions of Falcon 2.0 are well-documented at our GitHub repository (https://github.com/SDM-TIB/falcon2.0). We also demonstrate an online API, which can be run without any technical expertise. Falcon 2.0 and its background knowledge bases are available as resources at https://labs.tib.eu/falcon/falcon2/.
Apache Flink is an open-source system for scalable processing of batch and streaming data. Flink does not natively support efficient processing of spatial data streams, which is a requirement of many applications dealing with spatial data. Besides Flink, other scalable spatial data processing platforms including GeoSpark, Spatial Hadoop, etc. do not support streaming workloads and can only handle static/batch workloads. To fill this gap, we present GeoFlink, which extends Apache Flink to support spatial data types, indexes and continuous queries over spatial data streams. To enable efficient processing of spatial continuous queries and for the effective data distribution across Flink cluster nodes, a gird-based index is introduced. GeoFlink currently supports spatial range, spatial kNN and spatial join queries on point data type. An experimental study on real spatial data streams shows that GeoFlink achieves significantly higher query throughput than ordinary Flink processing.
Semantic Question Answering (QA) is a crucial technology to facilitate intuitive user access to semantic information stored in knowledge graphs. Whereas most of the existing QA systems and datasets focus on entity-centric questions, very little is known about these systems' performance in the context of events. As new event-centric knowledge graphs emerge, datasets for such questions gain importance. In this paper, we present the Event-QA dataset for answering event-centric questions over knowledge graphs. Event-QA contains 1000 semantic queries and the corresponding English, German and Portuguese verbalizations for EventKG - an event-centric knowledge graph with more than 970 thousand events.
In this paper, we implement and publicly share a configurable software workflow and a collection of gold standard datasets for training and evaluating supervised query refinement methods. Existing datasets such as AOL and MS MARCO, which have been extensively used in the literature for this purpose, are based on the weak assumption that users' input queries improve gradually within a search session, i.e., the last query where the user ends her information seeking session is the best reconstructed version of her initial query. In practice, such an assumption is not necessarily accurate for a variety of reasons, e.g., topic drift. The objective of our work is to enable researchers to build gold standard query refinement datasets without having to rely on such weak assumptions. Our software workflow, which generates such gold standard query datasets, takes three inputs: (1) a dataset of queries along with their associated relevance judgements (e.g. TREC topics), (2) an information retrieval method (e.g., BM25), and (3) an evaluation metric (e.g., MAP), and outputs a gold standard dataset. The produced gold standard dataset includes a list of revised queries for each query in the input dataset, each of which effectively improves the performance of the specified retrieval method in terms of the desirable evaluation metric. Since our workflow can be used to generate gold standard datasets for any input query set, in this paper, we have generated and publicly shared gold standard datasets for TREC queries associated with Robust04, Gov2, ClueWeb09, and ClueWeb12. The source code of our software workflow, the generated gold datasets, and benchmark results for three state-of-the-art supervised query refinement methods over these datasets are made publicly available for reproducibility purposes.
Knowledge graphs became a popular means for modeling complex biological systems where they model the interactions between biological entities and their effects on the biological system. They also provide support for relational learning models which are known to provide highly scalable and accurate predictions of associations between biological entities. Despite the success of the combination of biological knowledge graph and relation learning models in biological predictive tasks, there is a lack of unified biological knowledge graph resources. This forced all current efforts and studies for applying a relational learning model on biological data to compile and build biological knowledge graphs from open biological databases. This process is often performed inconsistently across such efforts, especially in terms of choosing the original resources, aligning identifiers of the different databases, and assessing the quality of included data. To make relational learning on biomedical data more standardised and reproducible, we propose a new biological knowledge graph which provides a compilation of curated relational data from open biological databases in a unified format with common, interlinked identifiers. We also provide a new module for mapping identifiers and labels from different databases which can be used to align our knowledge graph with biological data from other heterogeneous sources. Finally, to illustrate the practical relevance of our work, we provide a set of benchmarks based on the presented data that can be used to train and assess the relational learning models in various tasks related to pathway and drug discovery.
While a number of recent open-source toolkits for training and using neural information retrieval models have greatly simplified experiments with neural reranking methods, they essentially hard code a "search-then-rerank'' experimental pipeline. These pipelines consist of an efficient first-stage ranking method, like BM25, followed by a neural reranking method. Deviations from this setup often require hacks; some improvements, like adding a second reranking step that uses a more expensive neural method, are infeasible without major code changes. In order to improve the flexibility of such toolkits, we propose implementing experimental pipelines as dependency graphs of functional "IR primitives,'' which we call modules, that can be used and combined as needed.
For example, a neural IR pipeline may rerank results from a Searcher module that efficiently retrieves results from an Index module that it depends on. In turn, the Index depends on a Collection to index, which is provided by the pipeline. This Searcher module is self-contained: the pipeline does not need to know about or interact with the Index of the Searcher, which is transparently shared among Searcher modules when possible (e.g., a BM25 and a QL Searcher might share the same Index). Similarly, a Reranker module might depend on a Trainer (e.g., Tensorflow), feature Extractor, Tokenizer, etc. In both cases, the pipeline needs to interact only with the Reranker or Searcher directly; the complexity of their dependencies is hidden and intelligently managed. We rewrite the Capreolus toolkit to take this approach and demonstrate its use. %in a series of code examples and experiments.
Search clarification has recently attracted much attention due to its applications in search engines. It has also been recognized as a major component in conversational information seeking systems. Despite its importance, the research community still feels the lack of a large-scale dataset for studying different aspects of search clarification. In this paper, we introduce MIMICS, a collection of search clarification datasets for real web search queries sampled from the Bing query logs. Each clarification in MIMICS is generated by a Bing production algorithm and consists of a clarifying question and up to five candidate answers. MIMICS contains three datasets: (1) MIMICS-Click includes over 400k unique queries, their associated clarification panes, and the corresponding aggregated user interaction signals (i.e., clicks). (2) MIMICS-ClickExplore is an exploration data that includes aggregated user interaction signals for over 60k unique queries, each with multiple clarification panes. (3) MIMICS-Manual includes over 2k unique real search queries. Each query-clarification pair in this dataset has been manually labeled by at least three trained annotators. It contains graded quality labels for the clarifying question, the candidate answer set, and the landing result page for each candidate answer.
MIMICS is publicly available for research purposes, thus enables researchers to study a number of tasks related to search clarification, including clarification generation and selection, user engagement prediction for clarification, click models for clarification, and analyzing user interactions with search clarification. We also release the results returned by the Bing's web search API for all the queries in MIMICS. This would allow researchers to utilize search results for the tasks related to search clarification.
Ontology alignment has taken a critical place for helping heterogeneous resources to interoperate. It has been studied for over a decade, and over that time many alignment systems and methods have been developed by researchers to find simple 1:1 equivalence matches between two ontologies. However, very few alignment systems focus on finding complex correspondences. Even if the complex alignment systems are developed, the performance of finding complex relations still has a lot of room for improvement. One reason for this limitation may be that there are still few applicable alignment benchmarks that contain such complex relationships that can raise researchers' interests. In this paper, we propose a real-world dataset from the Enslaved project as a potential complex alignment benchmark. The benchmark consists of two resources, the Enslaved Ontology along with a Wikibase repository holding a large number of instance data from the Enslaved project, as well as a manually created reference alignment between them. The alignment was developed in consultation with domain experts in the digital humanities. The alignment not only includes simple 1:1 equivalence correspondences, but also more complex m:n equivalence and subsumption correspondences and are provided in both Expressive and Declarative Ontology Alignment Language (EDOAL) format and rule syntax. The Enslaved benchmark has been incorporated into the Ontology Alignment Evaluation Initiative (OAEI) 2020 and is completely free for public use to assist the researchers in developing and evaluating their complex alignment algorithms.
First identified in Wuhan, China, in December 2019, the outbreak of COVID-19 has been declared as a global emergency in January, and a pandemic in March 2020 by the World Health Organization (WHO). Along with this pandemic, we are also experiencing an "infodemic" of information with low credibility such as fake news and conspiracies. In this work, we present ReCOVery, a repository designed and constructed to facilitate research on combating such information regarding COVID-19. We first broadly search and investigate ~2,000 news publishers, from which 60 are identified with extreme [high or low] levels of credibility. By inheriting the credibility of the media on which they were published, a total of 2,029 news articles on coronavirus, published from January to May 2020, are collected in the repository, along with 140,820 tweets that reveal how these news articles have spread on the Twitter social network. The repository provides multimodal information of news articles on coronavirus, including textual, visual, temporal, and network information. The way that news credibility is obtained allows a trade-off between dataset scalability and label accuracy. Extensive experiments are conducted to present data statistics and distributions, as well as to provide baseline performances for predicting news credibility so that future methods can be compared. Our repository is available at http://coronavirus-fakenews.com.
Social media platforms have been growing at a rapid pace, attracting users engagement with contents due to their convenience facilitated by many usable features. Such platforms provide users with interactive options such as likes, dislikes as well as a way of expressing their opinions in the form of text (i.e., comments). The ability of posting comments on these online platforms has allowed some users to post racist, obscene, as well as to spread hate on these platforms. In some cases, this kind of toxic behavior might turn the comment section from a space where users can share their views to a place where hate and profanity are spread. Such issues are observed across various social media platforms and many users are often exposed to these kinds of behaviors which requires comment moderators to spend a lot of time filtering out such inappropriate comments. Moreover, such textual "inappropriate contents" can be targeted towards users irrespective of age, concerning variety of topics (not only controversial), and triggered by various events. My doctoral dissertation work, therefore, is primarily focused on studying, detecting and analyzing users exposure to this kind of toxicity on different social media platforms utilizing the state-of-art techniques in deep learning and natural language processing. This paper presents one example of my works on detecting and measuring kids exposure to inappropriate comments posted on YouTube videos targeting young users. In the meantime, the same pipeline is being examined for measuring users interaction with mainstream news media and sentiment towards various topics in the public discourse in light of the Coronavirus disease 2019 (COVID'19).
Entity matching has received significant attention from the research community over many years. Despite some limited success, most state-of-the-art methods see no widespread usage in industry.
In this paper, we present the author's PhD research, which aims at identifying issues that hold techniques and methods developed by the research community back from use in industry, and look at how they might be adapted to address those issues. In our proposed approach, we implement a modular framework, which will be used for real-world user testing and quantitative experiments of our adapted methods. We will have three main contributions from our research: 1) We develop a modular framework for interactive entity matching combining intra- and inter-session iterations. 2) We show how active learning methods for entity matching can be adapted to learn not only classification of matches but also classification of which records are of interest to the user jointly, and how it compares to current methods. 3) We show how deep learning can be used to synthesize interpretable rules for entity matching, and how it compares to traditional methods.
Knowledge Graphs (KGs) have recently gained attention for representing knowledge about a particular domain. Since its advent, the Linked Open Data (LOD) cloud has constantly been growing containing many KGs about many different domains such as government, scholarly data, biomedical domain, etc. Apart from facilitating the inter-connectivity of datasets in the LOD cloud, KGs have been used in a variety of machine learning and Natural Language Processing (NLP) based applications. However, the information present in the KGs are sparse and are often incomplete. Predicting the missing links between the entities is necessary to overcome this issue. Moreover, in the LOD cloud, information about the same entities is available in multiple KGs in different forms. But the information that these entities are the same across KGs is missing. The main focus of this thesis is to do Knowledge Graph Completion by tackling the link prediction tasks within a KG as well as across different KGs. To do so, the latent representation of KGs in a low dimensional vector space has been exploited to predict the missing information in order to complete the KGs.
Drug development is a costly and time consuming activity. The traditional process relies on extensive experimental efforts to map out the relevant part of the chemical space. Data about molecules, diseases, genes and other entities are present on many isolated databases, be that internal or external and in heterogeneous formats. They either require costly and inflexible data integration, or time-consuming workflows. Computational approaches, and more recently artificial intelligence based techniques, have emerged as a promising alternative for reducing the development cycle through drug repositioning. Knowledge bases are used to predict new links between old drugs and new targets. We present below the overall approach adopted for my PhD thesis, for a more holistic knowledge graph-based drug repositioning that aims to discover hidden or missing links between existing drugs and targets for which no known treatment is available. Currently, eight data and knowledge resources have already been integrated into the designed knowledge graph.
Access to medical data is highly regulated due to its sensitive nature, which can constrain communities' ability to utilise these data for research or clinical purposes. Common de-identification techniques to enable the sharing of data may not provide adequate protection for an individual's personal data in every circumstance. We investigate the ability of Generative Adversarial Networks (GANs) to generate realistic medical time series data to address these privacy and identification concerns. We generate synthetic, and more significantly, multichannel electrocardiogram (ECG) signals that are representative of waveforms observed in patients. Successful generation of high-quality synthetic time series data has the potential to act as an effective substitute for actual patient data. For the first time, we demonstrate a multivariate GAN architecture that can successfully generate dependent multichannel time series signals. We present the first application of multivariate dynamic time warping as a means of evaluating generated GAN samples. Quantitative evidence demonstrates our GAN can generate data that is structurally similar to the training set and diverse across generated samples, all whilst ensuring sufficient privacy guarantees for the underlying training data.
Location Based Services (LBS) is a continuous, local, and spatially restricted mobile database system (MDS) technology in the sense of a mobile environment. This domain is associated with many fascinating issues and challenges to be tackled and provided a fertile ground for many researchers and experts to work on it. This paper discusses the issues and challenges derived from the available literature to the research community regarding the latest approaches and experimentation results for location-dependent cache invalidation-replacement, prefetching, location privacy, and map matching (MM) policies. It offers potential future paths for exploring the unanswered questions.
Pattern matching is an important task in the field of Complex Event Processing (CEP). However, exact event pattern matching methods could suffer from low hit rate and loss for meaningful events identification due to the heterogeneous and dirty sources in the big data era. Since both events and patterns could be imprecise, the actual event trace may have different event names as well as structures from the pre-defined pattern. The low-quality data even intensifies the difficulty of matching. In this work, we propose to learn embedding representations for patterns and event traces separately and calculate their similarity as the scores for approximate matching.
The ultimate goal of my long-term project is "Augmented Inventing." This work is a follow-up effort toward the goal. It leverages the structural metadata in patent documents and the text-to-text mappings between metadata. The structural metadata includes patent title, abstract, independent claim, and dependent claim. By using the structural metadata, it is possible to control what kind of patent text to generate. By using the text-to-text mapping, it is possible to let a generative model generate one type of patent text from another type of patent text. Furthermore, through multiple mappings, it is possible to build a text generation flow, for example, generating from a few words to a patent title, from the title to an abstract, from the abstract to an independent claim, and from the independent claim to multiple dependent claims. The text generation flow can also go backward after training with bi-directional mappings. In addition to those above, the contributions of this work include: (1) released four generative models trained with patent corpus from scratch, (2) released the sample code to demonstrate how to generate patent text bi-directionally, (3) measuring the performances of the models by ROGUE and Universal Sentence Encoder as preliminary evaluations of text generation quality.
Deep learning requires volume, quality, and variety of training data. In neural question answering, a trade-off between quality and volume comes from the need to either manually curate or construct realistic question answering data, which is costly, or else augmenting, weakly labeling or generating training data from smaller datasets, leading to low variety and sometimes low quality. What can be done to make the best of this necessary trade-off? What can be understood from the endeavor to seek such solutions?
Storytelling is an ancient art and science of conveying wisdom through generations for centuries. Data-driven storytelling in the context of a natural language corpus has a huge potential for conveying fast valuable insights about the corpus for better decision making. But high dimensional unstructured nature of natural language text makes automatic extraction of stories extremely difficult. This PhD research project believes that modern storytelling is a hand in hand approach of contextual topic visualization and contextual summarization. While exploratory data visualization can provide valuable insights into the data, these insights can be used to understand and design models for producing abstract summarization. In this project, the context of a story is defined from three perspectives: a single document, a collection of multiple documents about a topic of interest and the whole corpus. In this project, exploratory data visualization is used to understand the context better and now with the achieved insights, research is focusing on abstract summarization for automatic contextual storytelling.
Asking a clarifying question can be a key element improving the performance of information seeking systems, particularly conversational search systems due to their limited bandwidth interfaces. While generating and asking clarifying questions is important; get-ting an answer for the clarifying question is also essential as a clarifying question without an answer is useless. Therefore, as the first step in current research, we analysed human-generated clarifying questions in a Community Question Answering website as a sample of conversation. This helped us to gain a better insight into how users interact with clarification. We investigated the clarifying questions in terms of whether they add any information to the question and the accepted answer. We further discovered the patterns and types of such clarifying questions. The next phase of this research will then generate clarifying questions in conversational search systems. We will then employ neural network models to generate clarifying questions to maximise clarification in questions. The proposed model will be trained using the MIMICS data collection in addition to our collected dataset. We will also attempt to consider the recognised patterns from the analysis conducted in the first step to enhance the chance that a user will interact with the clarifying questions. Finally, we will aim to minimise the interaction between the search system and the user to reduce the risk of dropping the conversation by the user due to asking too many clarifying questions.
With the trend of increasing vendors to develop various multi-model databases, people have reaped benefits from using a single and unified platform to manage both well-structured and NoSQL data. However, it causes a steep learning curve of mastering a multi-model query language for the specific multi-model database, not to mention various languages for different databases. Therefore, this research discusses the motivations of performing keyword searches on multi-model databases and then presents our current research. Methodologically, we attempt to use the quantum-inspired framework to query and explore multi-model databases. Firstly, we apply non-classical probabilities to estimate the relevance between a keyword query and candidate answers for guaranteeing getting good accuracy. Then we use the Principle Component Analysis (PCA) method to optimize the quantum language model for capturing good scalability. Finally, experiments show that our approaches are effective and our framework outperforms the state-of-the-art approaches.
We focus on summarizing hierarchical data by adapting the well-known notion of end biased-histograms to trees. Over relational data, such histograms have been well-studied, as they have a good balance between accuracy and space requirements. Extending histograms to tree data is a non-trivial problem, due to the need to preserve and leverage structure in the output. We develop a fast greedy algorithm, and a polynomial algorithm that finds provably optimal hierarchical end-biased histograms. Preliminary experimentation demonstrates that our histograms work well in practice.
Studying the effects of semantic analysis on retrieval effectiveness can be difficult using standard test collections because both queries and documents typically lack semantic markup. This paper describes extensions to two test collections, CLEF 2003/2004 Russian and TDT-3 Chinese, to support study of the utility of named entity annotation. A new set of topic aspects that were expected to benefit from named entity markup were defined for topics in those test collections, with two queries for each aspect. One of these queries uses named entities as bag-of-words query terms or as semantic constraints on a free-text query term; the other is a bag-of-words baseline query without named entity markup. Exhaustive judgment of the documents annotated by CLEF or TDT as relevant to each corresponding topic was performed, resulting in relevance judgments for 133 Russian and 33 Chinese topic aspects that each have at least one relevant document. Named entity tags were automatically generated for the documents in both collections. Use of the test collections is illustrated with some preliminary experiments.
Rule-based explanations are a popular method to understand the rationale behind the answers of complex machine learning (ML) classifiers. Recent approaches, such as Anchors, focus on local explanations based on if-then rules that are applicable in the vicinity of a target instance. This has proved effective at producing faithful explanations, yet anchor-based explanations are not free of limitations. These include long overly specific rules as well as explanations of low fidelity. This work presents two simple methods that can mitigate such issues on tabular and textual data. The first approach proposes a careful selection of the discretization method for numerical attributes in tabular datasets. The second one applies the notion of pertinent negatives to explanations on textual data. Our experimental evaluation shows the positive impact of such methods on the quality of anchor-based explanations.
The field of query-by-example aims at inferring queries from output examples given by non-expert users, by finding the underlying logic that binds the examples. However, for a very small set of examples, it is difficult to correctly infer such logic. To bridge this gap, previous work suggested attaching explanations to each output example, modeled as provenance, allowing users to explain the reason behind their choice of example. In this paper, we explore the problem of inferring queries from a few output examples and intuitive explanations. We propose a two step framework: (1) convert the explanations into (partial) provenance and (2) infer a query that generates the output examples using a novel algorithm that employs a graph based approach. This framework is suitable for non-experts as it does not require the specification of the provenance in its entirety or an understanding of its structure. We show promising initial experimental results of our approach.
Predicting endings for narrative stories is a grand challenge for machine commonsense reasoning. The task requires ac- curate representation of the story semantics and structured logic knowledge. Pre-trained language models, such as BERT, made progress recently in this task by exploiting spurious statistical patterns in the test dataset, instead of 'understanding' the stories per se. In this paper, we propose to improve the representation of stories by first simplifying the sentences to some key concepts and second modeling the latent relation- ship between the key ideas within the story. Such enhanced sentence representation, when used with pre-trained language models, makes substantial gains in prediction accuracy on the popular Story Cloze Test without utilizing the biased validation data.
In most existing works, nDCG is computed for a fixed cutoff k, i.e., nDCG@k and some fixed discounting coefficient. Such a conventional query-independent way to compute nDCG does not accurately reflect the utility of search results perceived by an individual user and is thus non-optimal. In this paper, we conduct a case study of the impact of using query-specific nDCG on the choice of the optimal Learning-to-Rank (LETOR) methods, particularly to see whether using a query-specific nDCG would lead to a different conclusion about the relative performance of multiple LETOR methods than using the conventional query-independent nDCG would otherwise. Our initial results show that the relative ranking of LETOR methods using query-specific nDCG can be dramatically different from those using the query-independent nDCG at the individual query level, suggesting that query-specific nDCG may be useful in order to obtain more reliable conclusions in retrieval experiments.
Building a robust predictive model requires an array of steps such as data imputation, feature transformations, estimator selection, hyper-parameter search, ensemble construction, amongst others. Due to this vast, complex and heterogeneous space of operations, off-the-shelf optimization methods offer infeasible solutions for realistic response time requirements. In practice, much of the predictive modeling process is conducted by experienced data scientists, who selectively make use of available tools. Over time, they develop an understanding of the behavior of operators, and perform serial decision making under uncertainty, colloquially referred to as educated guesswork. With an unprecedented demand for application of supervised machine learning, there is a call for solutions that automatically search for a suitable combination of operators across these tasks while minimize the modeling error. We introduce a novel system called APRL (Autonomous Predictive modeler via Reinforcement Learning), that uses past experience through reinforcement learning to optimize sequential decision making from within a set of diverse actions under a budget constraint. Our experiments demonstrate the superiority of the proposed approach over known AutoML systems that utilize Bayesian optimization or genetic algorithms.
For these few decades malwares have been posing a major concern in the cyber security. Recently, a number of "author groups" have been generating lots of newmalwares by sharing source code within a group and exploiting evasive schemes such as polymorphism and metamorphism. This motivates us to study the problem of identifying the author group of a given malware, which would be able to work for not only blocking malwares but also legally punishing suspected malware authors. In this paper, we propose a human-machine collaborative approach for classifying author groups of malwares accurately. We also propose a visualization method for helping human experts to make the decision easily. We verify the superiority of our framework through extensive experiments using real-world malware data.
We study the problem of deriving geolocations for Wikipedia pages. To this end, we introduce a general four-step process to location derivation, and consider different instantiations of this process, leveraging both textual and categorical data. Extensive experimentation shows that our methods provide good precision-recall trade-offs and improvements over text-only methods. Hence, our system can be used to augment the geographic information of Wikipedia, and to enable more effective geographic information retrieval.
With the increasing popularity of data structures such as graphs, recursion is becoming a key ingredient of query languages in analytic systems. Recursive query evaluation involves an iterative application of a function or operation until some condition is satisfied. It is particularly useful for retrieving nodes reachable along deep paths in a graph. The optimization of recursive queries has remained a challenge for decades. Recently, extensions of Codd's classical relational algebra to support recursive terms and their optimisation gained renewed interest . Query optimization crucially relies on enumeration of query evaluation plans and on cost estimation techniques. Cost estimation for recursive terms is far from trivial, and received less attention. In this paper, we propose a new cost estimation technique for recursive terms of the extended relational algebra. This technique allows to select an estimated cheapest query plan, in terms of computing resources usage e.g. memory footprint, CPU and I/O and evaluation time. We evaluate the effectiveness of our cost estimation technique on a set of recursive graph queries on both generated and real datasets of significant size, including Yago: a graph with more than 62 millions edges and 42 million nodes. Experiments show that our cost estimation technique improves the performance of recursive query evaluation on popular relational database engines such as PostgreSQL.
Pinterest as a popular image search platform has been widely adopted by users. Every day, people come to Pinterest searching for fashion- and home decor-related content. In these domains, users exhibit stable personal tastes. In this paper, we propose a novel search algorithm which can infer user tastes from their past engagement history and tailor the search results to fit their preferences. The online and offline experiments show that our method can efficiently improve user experience and increase user engagements.
In this paper, we propose RotaryDS to provide fast storage service for massive data streams. RotaryDS uses a rotation storage model, which employs distributed data buckets to accept highly-arriving data streams. All data buckets have a state, i.e., they can be in the state of data idle waiting, data filling, write waiting, and data dumping. The state of a data bucket is changed according to the data operations. With the rotation storage model, we distribute massive data streams among multiple data buckets, thereby improving the write throughput of the storage system. We implement RotaryDS based on the rotation storage model and conduct preliminary experiments to compare it with MongoDB. The results suggest the efficiency of our proposal.
An active learning (AL) algorithm seeks to construct an effective classifier with a minimal number of labeled examples in a bootstrapping manner. While standard AL heuristics, such as selecting those points for annotation for which a classification model yields least confident predictions, there has been no empirical investigation to see if these heuristics lead to models that are more interpretable to humans. In the era of data-driven learning, this is an important research direction to pursue. This paper describes our work-in-progress towards developing an AL selection function that in addition to model effectiveness also seeks to improve on the interpretability of a model during the bootstrapping steps. Concretely speaking, our proposed selection function trains an 'explainer' model in addition to the classifier model, and favours those instances where a different part of the data is used, on an average, to explain the predicted class. Initial experiments exhibited encouraging trends in showing that such a heuristic can lead to developing more effective and more explainable end-to-end data-driven classifiers.
In this paper, we focus on learning low-dimensional embeddings for nodes in graph-structured data. To achieve this, we propose Caps2NE -- a new unsupervised embedding model leveraging a network of two capsule layers. Caps2NE induces a routing process to aggregate feature vectors of context neighbors of a given target node at the first capsule layer, then feed these features into the second capsule layer to infer a plausible embedding for the target node. Experimental results show that our proposed Caps2NE obtains state-of-the-art performances on benchmark datasets for the node classification task. Our code is available at: https://github.com/daiquocnguyen/Caps2NE.
Structured world knowledge is at the foundation of knowledge-centric AI applications. Despite considerable research on knowledge base construction, beyond mere statement counts, little is known about the progress of KBs, in particular concerning their coverage, and one may wonder whether there is constant progress, or diminishing returns. In this paper we employ question answering and entity summarization as extrinsic use cases for a longitudinal study of the progress of KB coverage. Our analysis shows a near-continuous improvement of two popular KBs, DBpedia and Wikidata, over the last 19 years, with little signs of flattening out or leveling off.
Maximal clique enumeration is a well-studied problem due to its many applications. We present a new algorithm for this problem that enumerates maximal cliques in a diverse ordering. The main idea behind our approach is to adapt the classic Bron-Kerbosch (BK) algorithm by, conceptually, jumping between different nodes in the execution tree. Special care is taken to ensure that (1) each maximal clique is created precisely once, (2) the theoretical runtime remains the same as in the BK algorithm and (3) memory requirements remain reasonable. Experimental results show that we indeed achieve our goals, and moreover, that the cliques are enumerated in a diverse order.
In this paper, we provide a large dataset for fake news detection using social media comments. The dataset consists of 12,597 claims (of which 63% are labelled as fake) from four different sources (Snopes, Poltifact, Emergent and Twitter). The novel part of the dataset is that it also includes over 662K social media discussion comments related to these claims from Reddit. We make this dataset public for the research community. In addition, for the task of fake news detection using social media comments, we provide a simple but strong baseline solution deep neural network model which beats several solutions in the literature.
Automatically generating or ranking distractors for multiple-choice questions (MCQs) is still a challenging problem. In this work, we have focused towards automatic ranking of distractors for MCQs. Accordingly, we have proposed an semantically aware CNN-BiLSTM model. We evaluate our model with different word level embeddings as input over two different openly available datasets. Experimental results demonstrate our proposed model surpasses the performance of the existing baseline models. Furthermore, we have observed that intelligently incorporating word level semantic information along with context specific word embeddings boost up the predictive performance of distractors, which is a promising direction for further research.
Link prediction aims to predict whether two nodes in a network are likely to get connected. Motivated by its applications, e.g., in friend or product recommendation, link prediction has been extensively studied over the years. Most link prediction methods are designed based on specific assumptions that may or may not hold in different networks, leading to link prediction methods that are not generalizable. Here, we address this problem by proposing general link prediction methods that can capture network-specific patterns. Most link prediction methods rely on computing similarities between between nodes. By learning a γ-decaying model, the proposed methods can measure the pairwise similarities between nodes more accurately, even when only using common neighbor information, which is often used by current techniques.
Mining cohesive subgraphs is a fundamental problem in social network analysis. The k-truss model has been widely used to measure the cohesiveness of subgraphs. Most existing studies about k-truss focus on unsigned graphs. However, in real applications, the edges in the networks can be either positive or negative, e.g., friend or foe relationships, which represents more information than unsigned networks. Therefore, the traditional k-truss model is not applicable for the signed networks. Motivated by this, in this paper, we propose a novel model, named signed (k,r)-truss, which leverages the property of balanced triangle in singed network analysis. Specifically, a signed (k,r)-truss is a subgraph where each edge has no less than k balanced support and no more than r unbalanced support. We prove that the problem of identifying the maximum signed (k,r)-truss is NP-hard. Due to the hardness of the problem, we tend to the heuristic strategies. A trivial algorithm is first presented. Then, two greedy algorithms are developed to enhance the processing. Finally, we conduct comprehensive experiments on real-world signed networks to verify the performance of proposed techniques.
Semi-supervised consensus clustering integrates supervised information into consensus clustering in order to improve the quality of clustering. In this paper, we study the novel Semi-MultiCons semi-supervised consensus clustering method extending the previous MultiCons approach. Semi-MultiCons aims to improve the clustering result by integrating pairwise constraints in the consensus creation process and infer the number of clusters K using frequent closed itemsets extracted from the ensemble members. Experimental results show that the proposed method outperforms other state-of-art semi-supervised consensus algorithms.
Recently deep reinforcement learning (DRL) has been used for intelligent traffic light control. Unfortunately, we find that state-of-the-art on DRL-based intelligent traffic light essentially adopts discrete decision making and would suffer from the issue of unsafe driving. Moreover, existing feature representation of environment may not capture dynamics of traffic flow and thus cannot precisely predict future traffic flows. To overcome these issues, in this paper, we propose a DDPG-based DRL framework to learn a continuous time duration of traffic signal phases by introducing 1) a transit phase before the change of current phase for better safety, and 2) vehicle moving speed into feature representation for more precise estimation of traffic flow in next phase. Our preliminary evaluation on a well-known simulator SUMO indicates that our work significantly outperforms a recent work by much smaller number of emergency stops, queue length and waiting time.
Aiming at better representing multivariate relationships, this paper investigates a motif dimensional framework for higher-order graph learning. The graph learning effectiveness can be improved through OFFER. The proposed framework mainly aims at accelerating and improving higher-order graph learning results. We apply the acceleration procedure from the dimensional of network motifs. Specifically, the refined degree for nodes and edges are conducted in two stages: (1) employ motif degree of nodes to refine the adjacency matrix of the network; and (2) employ motif degree of edges to refine the transition probability matrix in the learning process. In order to assess the efficiency of the proposed framework, four popular network representation algorithms are modified and examined. By evaluating the performance of OFFER, both link prediction results and clustering results demonstrate that the graph representation learning algorithms enhanced with OFFER consistently outperform the original algorithms with higher efficiency.
A practical large scale product recognition system suffers from the phenomenon of long-tailed imbalanced training data under the E-commercial circumstance at Alibaba. In addition to images of products at Alibaba, plenty of related side information (e.g. title and tags) reveal rich semantic information about images. Prior works mainly focus on addressing the long tail problem from the visual perspective only, but lack of consideration of leveraging the side information. In this paper, we present a novel side information based large scale visual recognition co-training (SICoT) system to deal with the long tail problem by leveraging the image related side information. In the proposed co-training system, we firstly introduce a bilinear word attention module which aims to construct a semantic embedding from the noisy side information. A visual feature and semantic embedding co-training scheme is then designed to transfer knowledge between those classes with abundant training data (head classes) to classes with few training data (tail classes) in an end-to-end fashion. Extensive experiments on four challenging large scale datasets, whose numbers of classes range from one thousand to one million, demonstrate the scalable effectiveness of the proposed SICoT system in alleviating the long tail problem.
Text classification is one of the most frequent tasks for processing textual data, facilitating among others research from large-scale datasets. Embeddings of different kinds have recently become the de facto standard as features used for text classification. These embeddings have the capacity to capture meanings of words inferred from occurrences in large external collections. While they are built out of external collections, they are unaware of the distributional characteristics of words in the classification dataset at hand, including most importantly the distribution of words across classes in training data. To make the most of these embeddings as features and to boost the performance of classifiers using them, we introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on eight datasets show the effectiveness of TF-CR, leading to improved performance scores over the well-known weighting schemes TF-IDF and KLD as well as over the absence of a weighting scheme in most cases.
Given a large tensor, how can we analyze it efficiently? Multi-dimensional arrays or tensors have been widely used to model real-world data. Tensor decomposition plays an important role in analyzing trends and major factors in tensors. While several tensor analysis tools have been developed, they show slow running time and limited scalability due to their heavy computational requirements.
In this paper, we propose Gtensor, a fast and accurate tensor mining library using GPUs. Gtensor is carefully designed to utilize GPUs in addition to CPUs for fast analysis of tensors. Gtensor provides versatile tensor algebra tools including tensor decomposition, tensor generation, tensor-tensor operation, and tensor manipulation. We invite our audience to interact with Gtensor and analyze real-world tensor data in the domains of movie recommendation, place recommendations, and academic citation.