This paper proposes a new formula protocol for distributed concurrency control, and specifies a staged grid architecture for highly scalable database management systems. The paper also describes novel implementation techniques of Rubato DB based on the proposed protocol and architecture. We have conducted extensive experiments which clearly show that Rubato DB is highly scalable with efficient performance under both TPC-C and YCSB benchmarks. Our paper verifies that the formula protocol and the staged grid architecture provide a satisfactory solution to one of the important challenges in the database systems: to develop a highly scalable database management system that supports various consistency levels from ACID to BASE.
The popularity of crowdsourcing has recently brought about brand new opportunities for engaging human intelligence in the process of data analysis. Most existing works on crowdsourcing have developed sophisticated methods to utilize the crowd as a new kind of processor, a.k.a. Human Processor Units (HPU). In this paper, we propose a framework, called MaC, to combine the powers of both CPUs and HPUs. In order to build MaC, we need to tackle the following two challenges: (1) HIT Selection: Selecting the "right" HITs (Human Intelligent Tasks) can help reducing the uncertainty significantly and the results can converge quickly. Thus, we propose an entropy-based model to evaluate the informativeness of HITs. Furthermore, we find that selecting HITs has factorial complexity and the optimization function is non-linear, thus, we propose an efficient approximation algorithm with a bounded error. (2) Uncertainty Management: Crowdsourced answers can be inaccurate. To address this issue, we provide effective solutions in three common scenarios of crowdsourcing: (a) the answer and the confidence of each worker are available; (b) the confidence of each worker and the voting score for each HIT are available; (c) only the answer of each worker is available. To verify the effectiveness of the MaC framework, we built a hybrid Machine-Crowd system and tested it on three real-world applications - data fusion, information extraction and pattern recognition. The experimental results verified the effectiveness and the applicability of our framework.
Businesses and large organizations accumulate increasingly large amounts of customer interaction data. Analysis of such data holds great importance for tasks such as strategic planning and orchestration of sales/marketing campaigns. However, discovery and analysis over heterogeneous enterprise data can be challenging. Primary reasons for this are dispersed data repositories, requirements for schema knowledge, and difficulties in using complex user interfaces. As a solution to the above, we propose a TEmplated Search paradigm (TES) for exploring relational data that combines the advantages of keyword search interfaces with the expressive power of question-answering systems. The user starts typing a few keywords and TES proposes data exploration questions in real time. A key aspect of our approach is that the questions displayed are diverse to each other and optimally cover the space of possible questions for a given question-ranking framework. Efficient exact and provably approximate algorithms are presented. We show that the Templated Search paradigm renders the potentially complex underlying data sources intelligible and easily navigable. We support our claims with experimental results on real-world enterprise data.
Keyword search in relational databases has gained popularity due to its ease of use. However, the challenge to return query answers that satisfy users' information need remains. Traditional keyword queries have limited expressive capability and are ambiguous. In this work, we extend keyword queries to enhance their expressive power and describe an semantic approach to process these queries. Our approach considers keywords that match meta-data such as the names of relations and attributes, and utilizes them to provide the context of subsequent keywords in the query. Based on the ORM schema graph which captures the semantics of objects and relationships in the database, we determine the objects and relationships referred to by the keywords in order to infer the search target of the query. Then, we construct a set of minimal connected graphs called query patterns, to represent user's possible search intentions. Finally, we translate the top-k ranked query patterns into SQL statements in order to retrieve information that the user is interested in. We develop a system prototype called ExpressQ to process the extended keyword queries. Experimental results show that our system is able to generate SQL statements that retrieve user intended information effectively.
We present LECQTER, a tool for generating a 'perfect example' database, called exemplar, for a given conjunctive query. Indeed, exemplars separate the given query from any non-equivalent query. Therefore, LECQTER reduces the query equivalence problem to an evaluation of the queries on the exemplar. LECQTER can thus be used for applications ranging from testing coded conjunctive SQL queries to learning how to write sound conjunctive SQL queries, as it provides immediate feedback about the semantic correctness of a query, and not just the correctness of the query answer on some database as, e.g., other SQL tutoring systems. This key novelty of LECQTER relies on the bag semantics of SQL since exemplars do not always exist under set semantics. Detailed experiments show that our construction of exemplars is efficient in practice, and that they can separate a number of non-equivalent user queries that is exponential in the size of the exemplar for the target query. We identify natural parameters to control the time and size of the exemplars constructed. Finally, we offer a solution that overcomes the non-existence of exemplars under set semantics.
Information Retrieval systems are traditionally evaluated using the relevance of web pages to individual queries. Other work on IR evaluation has focused on exploring the use of preference judgments over two search result lists. Unlike traditional query-document evaluation, collecting preference judgments over two search result-lists takes the context of documents, and hence takes the interaction between search results, into consideration. Moreover, preference judgments have been shown to produce more accurate results compared to absolute judgment. On the other hand result list preference judgments have very high annotation cost. In this work, we investigate how machine learned models can assist human judges in order to collect reliable result list preference judgments at large scale with lower judgment-cost. We build novel models that can predict user preference automatically. We investigate the effect of different features on the prediction quality. We focus on predicting preferences with high confidence and show that these models can be effectively used to assist human judges resulting in significant reduction in annotation cost.
A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, ß: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance between n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.
Evaluation methods for information retrieval systems come in three types: offline evaluation, using static data sets annotated for relevance by human judges; user studies, usually conducted in a lab-based setting; and online evaluation, using implicit signals such as clicks from actual users. For the latter, preferences between rankers are typically inferred from implicit signals via interleaved comparison methods, which combine a pair of rankings and display the result to the user. We propose a new approach to online evaluation called multileaved comparisons that is useful in the prevalent case where designers are interested in the relative performance of more than two rankers. Rather than combining only a pair of rankings, multileaved comparisons combine an arbitrary number of rankings. The resulting user clicks then give feedback about how all these rankings compare to each other. We propose two specific multileaved comparison methods. The first, called team draft multileave, is an extension of team draft interleave. The second, called optimized multileave, is an extension of optimized interleave and is designed to handle cases where a large number of rankers must be multileaved. We present experimental results that demonstrate that both team draft multileave and optimized multileave can accurately determine all pairwise preferences among a set of rankers using far less data than the interleaving methods that they extend.
Retrievability provides an alternative way to assess an Information Retrieval (IR) system by measuring how easily documents can be retrieved. Retrievability can also be used to determine the level of retrieval bias a system exerts upon a collection of documents. It has been hypothesised that reducing the retrieval bias will lead to improved performance. To date, it has been shown that this hypothesis does not appear to hold on standard retrieval performance measures (MAP and P@10) when exploring the parameter space of a given retrieval model. However, the evidence is limited and confined to only a few models, collections and measures. In this paper, we perform a comprehensive empirical evaluation analysing the relationship between retrieval bias and retrieval performance using several well known retrieval models, five large TREC test collections and ten performance measures (including the recently proposed PRES, Time Biased Gain (TBG) and U-Measure). For traditional relevance based measures (MAP, P@10, MRR, Recall, etc) the correlation between retrieval bias and performance is moderate. However, for TBG and U-Measure, we find that there is strong and significant negative correlations between retrieval bias and performance (i.e as bias drops, performance increases). These findings suggest that for these more sophisticated, user oriented measures the retrievability bias hypothesis tends to hold. The implication is that for these measures, systems can then be tuned using retrieval bias, without recourse to relevance judgements.
In this paper, we study one important source of the mis-match between user data and relevance judgments, those due to the high degree of effort required by users to identify and consume the information in a document. Information retrieval relevance judges are trained to search for evidence of relevance when assessing documents. For complex documents, this can lead to judges' spending substantial time considering each document. However, in practice, search users are often much more impatient: if they do not see evidence of relevance quickly, they tend to give up.
Relevance judgments sit at the core of test collection construction, and are assumed to model the utility of documents to real users. However, comparisons of judgments with signals of relevance obtained from real users, such as click counts and dwell time, have demonstrated a systematic mismatch.
Our results demonstrate that the amount of effort required to find the relevant information in a document plays an important role in the utility of that document to a real user. This effort is ignored in the way relevance judgments are currently obtained, despite the expectation that judges inform us about real users. We propose that if the goal is to evaluate the likelihood of utility to the user, effort as well as relevance should be taken into consideration, and possibly characterized independently, when judgments are obtained.
In this paper, we propose a new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents. In order to capture the rich contextual structures in a query or a document, we start with each word within a temporal context window in a word sequence to directly capture contextual features at the word n-gram level. Next, the salient word n-gram features in the word sequence are discovered by the model and are then aggregated to form a sentence-level feature vector. Finally, a non-linear transformation is applied to extract high-level semantic information to generate a continuous vector representation for the full text string. The proposed convolutional latent semantic model (CLSM) is trained on clickthrough data and is evaluated on a Web document ranking task using a large-scale, real-world data set. Results show that the proposed model effectively captures salient semantic information in queries and documents for the task while significantly outperforming previous state-of-the-art semantic models.
A number of retrieval models incorporating term dependencies have recently been introduced. Most of these modify existing "bag-of-words" retrieval models by including features based on the proximity of pairs of terms (or bi-terms). Although these term dependency models have been shown to be significantly more effective than the bag-of-words models, there have been no previous systematic comparisons between the different approaches that have been proposed. In this paper, we compare the effectiveness of recent bi-term dependency models over a range of TREC collections, for both short (title) and long (description) queries. To ensure the reproducibility of our study, all experiments are performed on widely available TREC collections, and all tuned retrieval model parameters are made public. These comparisons show that the weighted sequential dependence model is at least as effective as, and often significantly better than, any other model across this range of collections and queries. We observe that dependency features are much more valuable in improving the performance of longer queries than for shorter queries. We then examine the effectiveness of dependence models that incorporate proximity features involving more than two terms. The results show that these features can improve effectiveness, but not consistently, over the available data sets.
The sheer volume of scholarly publications available online significantly challenges how scholars retrieve the new information available and locate the candidate reference papers. While classical text retrieval and pseudo relevance feedback (PRF) algorithms can assist scholars in accessing needed publications, in this study, we propose an innovative publication ranking method with PRF by leveraging a number of meta-paths on the heterogeneous bibliographic graph. Different meta-paths on the graph address different ranking hypotheses, whereas the pseudo-relevant papers (from the retrieval results) are used as the seed nodes on the graph. Meanwhile, unlike prior studies, we propose "restricted meta-path" facilitated by a new context-rich heterogeneous network extracted from full-text publication content along with citation context. By using learning-to-rank, we integrate 18 different meta-path-based ranking features to derive the final ranking scores for candidate cited papers. Experimental results with ACM full-text corpus show that meta-path-based ranking with PRF on the new graph significantly (p < 0.0001) outperforms text retrieval algorithms with text-based or PageRank-based PRF.
The term weighting and document ranking functions used with informational queries are typically optimized for cases in which queries are short and documents are long. It is reasonable to assume that the presence of a term in a short query reflects some aspect of the topic that is important to the user, and thus rewarding documents that contain the greatest number of distinct query terms is a useful heuristic. Verbose informational queries, such as those that result from cut-and-paste of example text, or that might result from informal spoken interaction, pose a different challenge in which many extraneous (and thus potentially misleading) terms may be present in the query. Modest improvements have been reported from applying supervised methods to learn which terms in a verbose query deserve the greatest emphasis. This paper proposes a novel unsupervised method for weighting terms in verbose informational queries that relies instead on iteratively estimating which terms are most central to the query. The key idea is to use an initial set of retrieval results to define a recursion on the term weight vector that converges to a fixed point representing the vector that optimally describes the initial result set. Experiments with several TREC news and Web test collections indicate that the proposed method often statistically significantly outperforms state of the art supervised methods.
Question retrieval aims to increase the accessibility of the community Question Answer (cQA) archives and has attracted increasing research interests recently. In this paper, we present a novel method for improving the question retrieval performance by investigating the question term selection and weighting as well as reranking results. Different from previous work, we propose a hierarchical question classification method with a sparse regularization to mimc user's question labeling in cQAs. Based on the hierarchical classification, we explore the local context of the question for term selection and reranking results and then integrating them into our proposed general question retrieval framework. The experimental results on a Yahoo! Answers dataset show the effectiveness of our method as compared to existing general question retrieval models and some state-of-the-art methods of utilizing category information for question retrieval.
Temporal analysis on dynamic networks has become a popularly discussed topic today, with more and more emerging data over time. In this paper we investigate the problem of detecting and tracking the variational communities within a given time period. We first define a metric to measure the strength of a community, called the normalized temporal community strength. And then, we propose our analysis framework. The community may evolve over time, either split to multiple communities or merge with others. We address the problem of evolutionary clustering with requirement on temporal smoothness and propose a revised soft clustering method based on non-negative matrix factorization. Then we use a clustering matching method to find the soft correspondence between different community distribution structures. This matching establishes the connection between consecutive snapshots. To estimate the variational rate and meanwhile address the smoothness during continuous evolution, we propose an objective function that combines the conformity of current variation and historical variational trend. In addition, we integrate the weights to the objective function to identify the temporal outliers. An iterative coordinate descent method is proposed to solve the optimization framework. We extensively evaluate our method with a synthetic dataset and several real datasets. The experimental results demonstrate the effectiveness of our method, which is greatly superior to the baselines on detection of the communities with significant variation over time.
Social networks have already emerged as inconceivably vast information repositories and have provided great opportunities for social connection and information diffusion. In light of these notable outcomes, social prediction is a critical research goal for analyzing and understanding social media and online social networks. We investigate underlying social theories that drive the characteristics and dynamics of social networks, including homophily, heterophily, and the structural hole theories. We propose a unified coherent framework, namely mutual latent random graphs (MLRGs), to exploit mutual interactions and benefits for predicting social actions (e.g., users' behaviors, opinions, preferences or interests) and discovering social ties (e.g., multiple labeled relationships between users) simultaneously in large-scale social networks. MLRGs introduce latent, or hidden factors and coupled models with users, users' actions and users' ties to flexibly encode evidences from both sources. We propose an approximate optimization algorithm to learn the model parameters efficiently. Furthermore, we speedup this algorithm based on the Hadoop MapReduce framework to handle large-scale social networks. We performed experiments on two real-world social networking datasets to demonstrate the validity and competitiveness of our approach.
Information diffusion in social networks is emerging as a promising solution to successful viral marketing, which relies on the effective and efficient identification of a set of nodes with the maximal social influence. While there are tremendous efforts on the development of social influence models and algorithms for social influence maximization, limited progress has been made in terms of designing both efficient and effective algorithms for finding a set of nodes with the maximal social influence. To this end, in this paper, we provide a bounded linear approach for influence computation and influence maximization. Specifically, we first adopt a linear and tractable approach to describe the influence propagation. Then, we develop a quantitative metric, named Group-PageRank, to quickly estimate the upper bound of the social influence based on this linear approach. More importantly, we provide two algorithms Linear and Bound, which exploit the linear approach and Group-PageRank for social influence maximization. Finally, extensive experimental results demonstrate that (a) the adopted linear approach has a close relationship with traditional models and Group-PageRank provides a good estimation of social influence; (b) Linear and Bound can quickly find a set of the most influential nodes and both of them are scalable for large-scale social networks.
Trust plays a crucial role in helping users collect reliable information in an online world, and has attracted more and more attention in research communities lately. As a conceptual counterpart of trust, distrust can be as important as trust. However, distrust is rarely studied in social media because distrust information is usually unavailable. The value of distrust has been widely recognized in social sciences and recent work shows that distrust can benefit various online applications in social media. In this work, we investigate whether we can obtain distrust information via learning when it is not directly available, and propose to study a novel problem - predicting distrust using pervasively available interaction data in an online world. In particular, we analyze interaction data, provide a principled way to mathematically incorporate interaction data in a novel framework dTrust to predict distrust information. Experimental results using real-world data show that distrust information is predictable with interaction data by the proposed framework dTrust. Further experiments are conducted to gain a deep understand on which factors contribute to the effectiveness of the proposed framework.
Multi-matrix factorization models provide a scalable and effective approach for multi-relational learning tasks such as link prediction, Linked Open Data (LOD) mining, recommender systems and social network analysis. Such models are learned by optimizing the sum of the losses on all relations in the data. Early models address the problem where there is only one target relation for which predictions should be made. More recent models address the multi-target variant of the problem and use the same set of parameters to make predictions for all target relations. In this paper, we argue that a model optimized for each target relation individually has better predictive performance than models optimized for a compromise on the performance on all target relations. We introduce specific parameters for each target but, instead of learning them independently from each other, we couple them through a set of shared auxiliary parameters, which has a regularizing effect on the target specific ones. Experiments on large Web datasets derived from DBpedia, Wikipedia and BlogCatalog show the performance improvement obtained by using target specific parameters and that our approach outperforms competitive state-of-the-art methods while being able to scale gracefully to big data.
Label propagation is a well-explored family of methods for training a semi-supervised classifier where input data points (both labeled and unlabeled) are connected in the form of a weighted graph. For binary classification, the performance of these methods starts degrading considerably whenever input dataset exhibits following characteristics - (i) one of the class label is rare label or equivalently, class imbalance (CI) is very high, and (ii) degree of supervision (DoS) is very low -- defined as fraction of labeled points. These characteristics are common in many real-world datasets relating to network fraud detection. Moreover, in such applications, the amount of class imbalance is not known a priori. In this paper, we have proposed and justified the use of an alternative formulation for graph label propagation under such extreme behavior of the datasets. In our formulation, objective function is the difference of two convex quadratic functions and the constraints are box constraints. We solve this program using Concave-Convex Procedure (CCCP). Whenever the problem size becomes too large, we suggest to work with a k-NN subgraph of the given graph which can be sampled by using Locality Sensitive Hashing (LSH) technique. We have also discussed various issues that one typically faces while sampling such a k-NN subgraph in practice. Further, we have proposed a novel label flipping method on top of the CCCP solution, which improves the result of CCCP further whenever class imbalance information is made available a priori. Our method can be easily adopted for a MapReduce platform, such as Hadoop. We have conducted experiments on 11 datasets comprising a graph size of up to 20K nodes, CI as high as 99:6%, and DoS as low as 0:5%. Our method has resulted up to 19:5-times improvement in F-measure and up to 17:5-times improvement in AUC-PR measure against baseline methods.
We propose a new probabilistic approach for multi-label classification that aims to represent the class posterior distribution P(Y|X). Our approach uses a mixture of tree-structured Bayesian networks, which can leverage the computational advantages of conditional tree-structured models and the abilities of mixtures to compensate for tree-structured restrictions. We develop algorithms for learning the model from data and for performing multi-label predictions using the learned model. Experiments on multiple datasets demonstrate that our approach outperforms several state-of-the-art multi-label classification methods.
We present a new methodology for solving linear Support Vector Machines (SVMs) that capitalizes on multiple 1D projections. We show that the approach approximates the optimal solution with high accuracy and comes with analytical guarantees. Our solution adapts on methodologies from random projections, exponential search, and coordinate descent. In our experimental evaluation, we compare our approach with the popular liblinear SVM library. We demonstrate a significant speedup on various benchmarks. At the same time, the new methodology provides a comparable or better approximation factor of the optimal solution and exhibits smooth convergence properties. Our results are accompanied by bounds on the time complexity and accuracy.
Many classification algorithms have been successfully deployed in security-sensitive applications including spam filters and intrusion detection systems. Under such adversarial environments, adversaries can generate exploratory attacks against the defender such as evasion and reverse engineering. In this paper, we discuss why reverse engineering attacks can be carried out quite efficiently against fixed classifiers, and investigate the use of randomization as a suitable strategy for mitigating their risk. In particular, we derive a semidefinite programming (SDP) formulation for learning a distribution of classifiers subject to the constraint that any single classifier picked at random from such distribution provides reliable predictions with a high probability. We analyze the tradeoff between variance of the distribution and its predictive accuracy, and establish that one can almost always incorporate randomization with large variance without incurring a loss in accuracy. In other words, the conventional approach of using a fixed classifier in adversarial environments is generally Pareto suboptimal. Finally, we validate such conclusions on both synthetic and real-world classification problems.
Time-to-event outcomes based data can be modelled using survival regression methods which can predict these outcomes in different censored data applications in diverse fields such as engineering, economics and healthcare. Predictive models are built by inferring from the censored variable in time-to-event data, which differentiates them from other regression methods. Censoring is represented as a binary indicator variable and machine learning methods have been tuned to account for the censored attribute. Active learning from censored data using survival regression methods can make the model query a domain expert for the time-to-event label of the sampled instances. This offers higher advantages in the healthcare domain where a domain expert can interactively refine the model with his feedback. With this motivation, we address this problem by providing an active learning based survival model which uses a novel model discriminative gradient based sampling scheme. We evaluate this framework on electronic health records (EHR), publicly available survival and synthetic censored datasets of varying diversity. Experimental evaluation against state of the art survival regression methods indicates the higher discriminative ability of the proposed approach. We also present the sampling results for the proposed approach in an active learning setting which indicate better learning rates in comparison to other sampling strategies.
Most collaborative filtering (CF) algorithms only make use of the rating scores given by users for items. However, it is often the case that each rating score is associated with a piece of review text. Such review texts, which are capable of providing us valuable information to reveal the reasons why users give a certain rating, have not been exploited and they are usually ignored by most CF algorithms. Moreover, the underlying relationship buried in users and items has not been fully exploited. Items we would recommend can often be characterized into hidden groups (e.g. comedy, horror movie and action movie), and users can also be organized as hidden communities. We propose a new generative model to predict user's ratings on previously unrated items by considering review texts as well as hidden user communities and item groups relationship. Regarding the rating scores, traditional algorithms would not perform well on uncovering the community and group information of each user and each item since the user-item rating matrix is dyadic involving the mutual interactions between users and items. Instead, co-clustering, which is capable of conducting simultaneous clustering of two variables, is able to take advantage of such user-item relationships to better predict the rating scores. Additionally, co-clustering would be more effective for modeling the generation of review texts since different user communities would discuss different topics and vary their own wordings or expression patterns when dealing with different item groups. Besides, by modeling as a mixed membership over community and group respectively, each user or item can belong to multiple communities or groups with varying degrees. We have conducted extensive experiments to predict the missing rating scores on 22 real word datasets. The experimental results demonstrate the superior performance of our proposed model comparing with the state-of-the-art methods.
Recommending products to users means estimating their preferences for certain items over others. This can be cast either as a problem of estimating the rating that each user will give to each item, or as a problem of estimating users' relative preferences in the form of a ranking. Although collaborative-filtering approaches can be used to identify users who rate and rank products similarly, another source of data that informs us about users' preferences is their set of social connections. Both rating- and ranking-based paradigms are important in real-world recommendation settings, though rankings are especially important in settings where explicit feedback in the form of a numerical rating may not be available. Although many existing works have studied how social connections can be used to build better models for rating prediction, few have used social connections as a means to derive more accurate ranking-based models. Using social connections to better estimate users' rankings of products is the task we consider in this paper. We develop a model, SBPR (Social Bayesian Personalized Ranking), based on the simple observation that users tend to assign higher ranks to items that their friends prefer. We perform experiments on four real-world recommendation data sets, and show that SBPR outperforms alternatives in ranking prediction both in warm- and cold-start settings.
Context-aware recommender systems (CARS) help improve the effectiveness of recommendations by adapting to users' preferences in different contextual situations. One approach to CARS that has been shown to be particularly effective is Context-Aware Matrix Factorization (CAMF). CAMF incorporates contextual dependencies into the standard matrix factorization (MF) process, where users and items are represented as collections of weights over various latent factors. In this paper, we introduce another CARS approach based on an extension of matrix factorization, namely, the Sparse Linear Method (SLIM). We develop a family of deviation-based contextual SLIM (CSLIM) recommendation algorithms by learning rating deviations in different contextual conditions. Our CSLIM approach is better at explaining the underlying reasons behind contextual recommendations, and our experimental evaluations over five context-aware data sets demonstrate that these CSLIM algorithms outperform the state-of-the-art CARS algorithms in the top-N recommendation task. We also discuss the criteria for selecting the appropriate CSLIM algorithm in advance based on the underlying characteristics of the data.
Recent years have witnessed an increasing interest in how to incorporate social network information into recommendation algorithms to enhance the user experience. In this paper, we find the phenomenon that users in the contexts of recommendation system and social network do not share the same interest space. Based on this finding, we proposed the social regulatory factor regression model (SRFRM) which could connect different interest spaces in different contexts together in an unified latent factor model. Specifically, different from the traditional social based latent factor models with strong limitation that all sides share the same feature space, the proposed method leverages the regulatory factor number on both sides to meet the fact that users and items or users in different contexts may not share the same interest space. It works by incorporating two linear transformation matrices into the matrix co-factorization framework that matrix factorization of user ratings is regularized by that of social trust network. We study a large subsets of data from epinions.com and douban.com respectively. The experimental results indicate that users in different contexts have different interest spaces and our model achieves a higher performance compared with related state-of-the-art methods.
Rich contextual information is typically available in many recommendation domains allowing recommender systems to model the subtle effects of context on preferences. Most contextual models assume that the context shares the same latent space with the users and items. In this work we propose CARS2, a novel approach for learning context-aware representations for context-aware recommendations. We show that the context-aware representations can be learned using an appropriate model that aims to represent the type of interactions between context variables, users and items. We adapt the CARS2 algorithms to explicit feedback data by using a quadratic loss function for rating prediction, and to implicit feedback data by using a pairwise and a listwise ranking loss functions for top-N recommendations. By using stochastic gradient descent for parameter estimation we ensure scalability. Experimental evaluation shows that our CARS2 models achieve competitive recommendation performance, compared to several state-of-the-art approaches.
The automatic summarization of long-running events from news steams is a challenging problem. A long-running event can contain hundreds of unique 'nuggets' of information to summarize, spread-out over its lifetime. Meanwhile, information reported about it can rapidly become outdated and is often highly redundant. Incremental update summarization (IUS) aims to select sentences from news streams to issue as updates to the user, summarising that event over time. The updates issued should cover all of the key nuggets concisely and before the information contained in those nuggets becomes outdated. Prior summarization approaches when applied to IUS can fail, since they define a fixed summary length that cannot effectively account for the different magnitudes and varying rate of development of such events. In this paper, we propose a novel IUS approach that adaptively alters the volume of content issued as updates over time with respect to the prevalence and novelty of discussions about the event. It incorporates existing state-of-the-art summarization techniques to rank candidate sentences, followed by a supervised regression model that balances novelty, nugget coverage and timeliness when selecting sentences from the top ranks. We empirically evaluate our approach using the TREC 2013 Temporal Summarization dataset extended with additional assessments. Our results show that by adaptively adjusting the number of sentences to select over time, our approach can nearly double the performance of effective summarization baselines.
User generated contents (UGCs) from various social media sites give analysts the opportunity to obtain a comprehensive and dynamic view of any topic from multiple heterogeneous information sources. Summarization provides a promising means of distilling the overview of the targeted topic by aggregating and condensing the related UGCs. However, the mass volume, uneven quality, and dynamics of UGCs, pose new challenges that are not addressed by existing multi-document summarization techniques. In this paper, we introduce a timely task of dynamic structural and textual summarization. We generate topic hierarchy from the UGCs as a high level overview and structural guide for exploring and organizing the content. To capture the evolution of events in the content, we propose a unified dynamic reconstruction approach to detect the update points and generate the time-sequence textual summary. To enhance the expressiveness of the reconstruction space, we further use the topic hierarchy to organize the UGCs and the hierarchical subtopics to augment the sentence representation. Experimental comparison with the state-of-the-art summarization models on a multi-source UGC dataset shows the superiority of our proposed methods. Moreover, we conducted a user study on our usability enhancement measures. It suggests that by disclosing some meta information of the summary generation process in the proposed framework, the time-sequence textual summaries can pair with the structural overview of the topic hierarchy to achieve interpretable and verifiable summarization.
For many applications measuring the similarity between documents is essential. However, little is known about how users perceive similarity between documents. This paper presents the first large-scale empirical study that investigates perception of narrative similarity using crowdsourcing. As a dataset we use a large collection of Dutch folk narratives. We study the perception of narrative similarity by both experts and non-experts by analyzing their similarity ratings and motivations for these ratings. While experts focus mostly on the plot, characters and themes of narratives, non-experts also pay attention to dimensions such as genre and style. Our results show that a more nuanced view is needed of narrative similarity than captured by story types, a concept used by scholars to group similar folk narratives. We also evaluate to what extent unsupervised and supervised models correspond with how humans perceive narrative similarity.
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%. In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams. We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ∼30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
Over the past few years, community QA websites (e.g. Yahoo! Answers) have become a useful platform for users to post questions and obtain answers. However, not all questions posted there receive informative answers or are answered in a timely manner. In this paper, we show that the answers to some of these questions are available in online domain-specific knowledge bases and propose an approach to automatically discover those answers. In the proposed approach, we would first mine appropriate SQL query patterns by leveraging an existing collection of QA pairs, and then use the learned query patterns to answer previously unseen questions by returning relevant entities from the knowledge base. Evaluation on a collection of health domain questions from Yahoo! Answers shows that the proposed method is effective in discovering potential answers to user questions from an online medical knowledge base.
Query term weighting is a fundamental task in information retrieval and most popular term weighting schemes are primarily based on statistical analysis of term occurrences within the document collection. In this work we study how term weighting may benefit from syntactic analysis of the corpus. Focusing on community question answering (CQA) sites, we take into account the syntactic function of the terms within CQA texts as an important factor affecting their relative importance for retrieval. We analyze a large log of web queries that landed on Yahoo Answers site, showing a strong deviation between the tendencies of different document words to appear in a landing (click-through) query given their syntactic function. To this end, we propose a novel term weighting method that makes use of the syntactic information available for each query term occurrence in the document, on top of term occurrence statistics. The relative importance of each feature is learned via a learning to rank algorithm that utilizes a click-through query log. We examine the new weighting scheme using manual evaluation based on editorial data and using automatic evaluation over the query log. Our experimental results show consistent improvement in retrieval when syntactic information is taken into account.
Semantically searching and navigating products (e.g., on Taobao.com or Amazon.com) with professional metadata and user-generated content from social media is a hot topic in information retrieval and recommendation systems, while most existing methods are specifically designed as a purely searching system. In this paper, taking Social Book Search as an example, we propose a general search-recommendation hybrid system for this topic. Firstly, we propose a Generalized Content-Based Filtering (GCF) model. In this model, a preference value, which flexibly ranges from 0 to 1, is defined to describe a user's preference for each item to be recommended, unlike conventionally using a set of preferable items. We also design a weighting formulation for the measure of recommendation. Next, assuming that the query in a searching system acts as a user in a recommendation system, a general reranking model is constructed with GCF to rerank the initial resulting list by utilizing a variety of rich social information. Afterwards, we propose a general search-recommendation hybrid framework for Social Book Search, where learning-to-rank is used to adaptively combine all reranking results. Finally, our proposed system is extensively evaluated on the INEX 2012 and 2013 Social Book Search datasets, and has the best performance (NDCG@10) on both datasets compared to other state-of-the-art systems. Moreover, our system recently won the INEX 2014 Social Book Search Evaluation.
This paper studies the problem of question retrieval in community question answering (CQA). To bridge lexical gaps in questions, which is regarded as the biggest challenge in retrieval, state-of-the-art methods learn translation models using answers under an assumption that they are parallel texts. In practice, however, questions and answers are far from "parallel". Indeed, they are heterogeneous for both the literal level and user behaviors. There are a particularly large number of low quality answers, to which the performance of translation models is vulnerable. To address these problems, we propose a supervised question-answer topic modeling approach. The approach assumes that questions and answers share some common latent topics and are generated in a "question language" and "answer language" respectively following the topics. The topics also determine an answer quality signal. Compared with translation models, our approach not only comprehensively models user behaviors on CQA portals, but also highlights the instinctive heterogeneity of questions and answers. More importantly, it takes answer quality into account and performs robustly against noise in answers. With the topic modeling approach, we propose a topic-based language model, which matches questions not only on a term level but also on a topic level. We conducted experiments on large scale data from Yahoo! Answers and Baidu Knows. Experimental results show that the proposed model can significantly outperform state-of-the-art retrieval models in CQA.
People have multiple accounts on Online Social Networks (OSNs) for various purposes. It is of great interest for third parties to collect more users' information by linking their accounts on different OSNs. Unfortunately, most users have not been aware of potential risks of such accounts linkage. Therefore, the design of a control methodology that allows users to share their information without the risk of being linked becomes an urgent need, yet still remains open.
In this paper, we first aim to raise the users' awareness by presenting an effective User Accounts Linkage Inference (UALI), which is shown to be more powerful to users than existing methods. In order to help users control the risks of UALI, we next propose the first Information Control Mechanism (ICM), in which users' information is still visible as intended and, in the meanwhile, the risk of their accounts linkage can be controlled. Using real-world datasets, the performance of ICM is validated, and we also show that it works well for various linkage inference approaches. Both UALI and ICM approaches, designed to take generic inputs, extend their ability to be widely applied into many practical social services.
Personal social networks are considered as one of the most influential sources in shaping a customer's attitudes and behaviors. However, the interactions with friends or colleagues in social networks of individual customers are barely observable in most e-commerce companies. In this paper, we study the problem of customer identification in social networks, i.e., connecting customer accounts at e-commerce sites to the corresponding user accounts in online social networks such as Twitter. Identifying customers in social networks is a crucial prerequisite for many potential marketing applications. These applications, for example, include personalized product recommendation based on social correlations, discovering community of customers, and maximizing product adoption and profits over social networks.
We introduce a methodology CSI (Customer-Social Identification) for identifying customers in online social networks effectively by using the basic information of customers, such as username and purchase history. It consists of two key phases. The first phase constructs the features across networks that can be used to compare the similarity between pairs of accounts across networks with different schema (e.g. an e-commerce company and an online social network). The second phase identifies the top-K maximum similar and stable matched pairs of accounts across partially aligned networks. Extensive experiments on real-world datasets show that our CSI model consistently outperforms other commonly-used baselines on customer identification.
Many social networks are characterized by actors (nodes) holding quantitative opinions about movies, songs, sports, people, colleges, politicians, and so on. These opinions are influenced by network neighbors. Many models have been proposed for such opinion dynamics, but they have some limitations. Most consider the strength of edge influence as fixed. Some model a discrete decision or action on part of each actor, and an edge as causing an ``infection'' (that is often permanent or self-resolving). Others model edge influence as a stochastic matrix to reuse the mathematics of eigensystems. Actors' opinions are usually observed globally and synchronously. Analysis usually skirts transient effects and focuses on steady-state behavior. There is very little direct experimental validation of estimated influence models. Here we initiate an investigation into new models that seek to remove these limitations. Our main goal is to estimate, not assume, edge influence strengths from an observed series of opinion values at nodes. We adopt a linear (but not stochastic) influence model. We make no assumptions about system stability or convergence. Further, actors' opinions may be observed in an asynchronous and incomplete fashion, after missing several time steps when an actor changed its opinion based on neighbors' influence. We present novel algorithms to estimate edge influence strengths while tackling these aggressively realistic assumptions. Experiments with Reddit, Twitter, and three social games we conducted on volunteers establish the promise of our algorithms. Our opinion estimation errors are dramatically smaller than strong baselines like the DeGroot, flocking, voter, and biased voter models. Our experiments also lend qualitative insights into asynchronous opinion updates and aggregation.
Online gaming is one of the largest industries on the Internet, generating tens of billions of dollars in revenues annually. One core problem in online game is to find and convert free users into paying customers, which is of great importance for the sustainable development of almost all online games. Although much research has been conducted, there are still several challenges that remain largely unsolved: What are the fundamental factors that trigger the users to pay? How does users? paying behavior influence each other in the game social network? How to design a prediction model to recognize those potential users who are likely to pay? In this paper, employing two large online games as the basis, we study how a user becomes a new paying user in the games. In particular, we examine how users' paying behavior influences each other in the game social network. We study this problem from various sociological perspectives including strong/weak ties, social structural diversity and social influence. Based on the discovered patterns, we propose a learning framework to predict potential new payers. The framework can learn a model using features associated with users and then use the social relationships between users to refine the learned model. We test the proposed framework using nearly 50 billion user activities from two real games. Our experiments show that the proposed framework significantly improves the prediction accuracy by up to 3-11% compared to several alternative methods. The study also unveils several intriguing social phenomena from the data. For example, influence indeed exists among users for the paying behavior. The likelihood of a user becoming a new paying user is 5 times higher than chance when he has 5 paying neighbors of strong tie. We have deployed the proposed algorithm into the game, and the Lift_Ratio has been improved up to 196% compared to the prior strategy.
Semi-supervised learning is an essential approach to classification when the available labeled data is insufficient and we need to also make use of unlabeled data in the learning process. Numerous research efforts have focused on designing algorithms to improve the F1 score, but have any mechanism to control precision or recall individually. However, many applications have precision/recall preferences. For instance, an email spam classifier requires a precision of 0.9 to mitigate the false dismissal of useful emails. In this paper, we propose a method that allows to specify a precision/recall preference while maximising the F1 score. Our key idea is that we divide the semi-supervised learning process into multiple rounds of supervised learning, and the classifier learned at each round is calibrated using a sub-set of the labeled dataset before we use it on the unlabeled dataset for enlarging the training dataset. Our idea is applicable to a number of learning models such as Support Vector Machines (SVMs), Bayesian networks and neural networks. We focus our research and the implementation of our idea on SVMs. We conduct extensive experiments to validate the effectiveness of our method. The experimental results show that our method can train classifiers with a precision/recall preference, while the popular semi-supervised SVM training algorithm (which we use as the baseline) cannot. When we specify the precision preference and the recall preference to be the same, which indicates to maximise the F1 score only as the baseline does, our method achieves better or similar F1 scores to the baseline. An additional advantage of our method is that it converges much faster than the baseline.
With the advance of internet, multi-modal data can be easily collected from many social websites such as Wikipedia, Flickr, YouTube, etc. Images shared on the web are usually associated with social tags or other textual information. Although existing multi-modal methods can make use of associated text to improve image annotation, the disadvantages of them are that associated text is also required for a new image to be predicted. In this paper, we propose the cross-modal multi-task learning (CMMTL) framework for image annotation. Labeled and unlabeled multi-modal data are both levaraged for training in CMMTL, and it finally obtains visual classifiers which can predict concepts for a single image without any associated information. CMMTL integrates graph learning, multi-task learning and cross-modal learning into a joint framework, where a shared subspace is learned to preserve both cross-modal correlation and concept correlation. The optimal solution of the proposed framework can be obtained by solving a generalized eigenvalue problem. We conduct comprehensive experiments on two real world image datasets: MIR Flickr and NUS-WIDE, to evaluate the performance of the proposed framework. Experimental results demonstrate that CMMTL obtains a significant improvement over several representative methods for cross-modal image annotation.
Multi-task multi-view learning deals with the learning scenarios where multiple tasks are associated with each other through multiple shared feature views. All previous works for this problem assume that the tasks use the same set of class labels. However, in real world there exist quite a few applications where the tasks with several views correspond to different set of class labels. This new learning scenario is called Multi-task Multi-view Learning for Heterogeneous Tasks in this study. Then, we propose a Multi-tAsk MUlti-view Discriminant Analysis (MAMUDA) method to solve this problem. Specifically, this method collaboratively learns the feature transformations for different views in different tasks by exploring the shared task-specific and problem intrinsic structures. Additionally, MAMUDA method is convenient to solve the multi-class classification problems. Finally, the experiments on two real-world problems demonstrate the effectiveness of MAMUDA for heterogeneous tasks.
Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously. While sometimes the underlying task relationship structure is known, often the structure needs to be estimated from data at hand. In this paper, we present a novel family of models for MTL, applicable to regression and classification problems, capable of learning the structure of task relationships. In particular, we consider a joint estimation problem of the task relationship structure and the individual task parameters, which is solved using alternating minimization. The task relationship structure learning component builds on recent advances in structure learning of Gaussian graphical models based on sparse estimators of the precision (inverse covariance) matrix. We illustrate the effectiveness of the proposed model on a variety of synthetic and benchmark datasets for regression and classification. We also consider the problem of combining climate model outputs for better projections of future climate, with focus on temperature in South America, and show that the proposed model outperforms several existing methods for the problem.
The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the detection of spatial events when smartphone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with low-quality data. Consequently, a major challenge in quality control is to discover true events from diverse and noisy participants' reports. This truth discovery problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants' mobility and reliability. Decoupling these two types of uncertainties through location tracking will raise severe privacy and energy issues, whereas simply ignoring missing reports or treating them as negative reports will significantly degrade the accuracy of the discovered truth. In this paper, we propose a new method to tackle this truth discovery problem through principled probabilistic modeling. In particular, we integrate the modeling of location popularity, location visit indicators, truth of events and three-way participant reliability in a unified framework. The proposed model is thus capable of efficiently handling various types of uncertainties and automatically discovering truth without any supervision or the need of location tracking. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art truth discovery approaches in the mobile crowdsourcing environment.
Detecting anomalous events from spatial data has important applications in real world. The spatial scan statistic methods are popular in this area. With maximizing the spatial statistical discrepancy by comparing observed data with a given baseline data distribution, significant spatial overdensity and underdensity can be detected. In reality, the spatial discrepancy is often irregularly shaped and has a structure of multiple spatial scales. However, a large-scale discrepancy pattern may not be significant when conducting fine granularity analysis. Meanwhile, local irregular boundaries of a maximized discrepancy cannot be well approximated with a coarse granularity analysis. Existing methods mostly work either on a fixed granularity, or with a regularly shaped scanning window. Thus, they have difficulties in characterizing such flexible spatial discrepancies. To solve the problem, in this paper we propose a novel discrepancy maximization algorithm, RefineScan. A grid hierarchy encoding multi-scale information is employed, making the algorithm capable of maximizing spatial discrepancies with multi-scale structures and irregular shapes. Experiments on a wide range of datasets demonstrate the advantages of RefineScan over the state-of-the-art algorithms: It always finds the largest discrepancy scores and remarkably better characterizes multi-scale discrepancy boundaries. Theoretical and empirical analyses also show that RefineScan has a moderate computational complexity and a good scalability.
Location-based services allow users to perform check-in actions, which not only record their geo-spatial activities, but also provide a plentiful source for data scientists to analyze and plan more accurate and useful geographical recommender system. In this paper, we present a novel Time-aware Route Planning (TRP) problem using location check-in data. The central idea is that the pleasure of staying at the locations along a route is significantly affected by their visiting time. Each location has its own proper visiting time due to the category, objective, and population. To consider the visiting time of locations into route planning, we develop a three-stage time-aware route planning framework. First, since there is usually either noise time on existing locations or no visiting information on new locations constructed, we devise an inference method, LocTimeInf, to predict and recover the location visiting time on routes. Second, we aim to find the representative and popular time-aware location-transition behaviors from user check-in data, and a Time-aware Transit Pattern Mining (TTPM) algorithm is proposed correspondingly. Third, based on the mined time-aware transit patterns, we develop a Proper Route Search (PR-Search) algorithm to construct the final time-aware routes for recommendation. Experiments on Gowalla check-in data exhibit the promising effectiveness and efficiency of the proposed methods, comparing to a series of competitors.
Predicting promising academic papers is useful for a variety of parties, including researchers, universities, scientific councils, and policymakers. Researchers may benefit from such data to narrow down their reading list and focus on what will be important, and policymakers may use predictions to infer rising fields for a more strategic distribution of resources. This paper proposes a novel technique to predict a paper's future impact (i.e., number of citations) by using temporal and topological features derived from citation networks. We use a behavioral modeling approach in which the temporal change in the number of citations a paper gets is clustered, and new papers are evaluated accordingly. Then, within each cluster, we model the impact prediction as a regression problem where the objective is to predict the number of citations that a paper will get in the near or far future, given the early citation performance of the paper. The results of empirical evaluations on data from several well-known citation databases show that the proposed framework performs significantly better than the state of the art approaches.
Entity Linking is the task of assigning entities from a Knowledge Base to textual mentions of such entities in a document. State-of-the-art approaches rely on lexical and statistical features which are abundant for popular entities but sparse for unpopular ones, resulting in a clear bias towards popular entities and poor accuracy for less popular ones. In this work, we present a novel approach that is guided by a natural notion of semantic similarity which is less amenable to such bias. We adopt a unified semantic representation for entities and documents - the probability distribution obtained from a random walk on a subgraph of the knowledge base - which can overcome the feature sparsity issue that affects previous work. Our algorithm continuously updates the semantic signature of the document as mentions are disambiguated, thus focusing the search based on context. Our experimental evaluation uses well-known benchmarks and different samples of a Wikipedia-based benchmark with varying entity popularity; the results illustrate well the bias of previous methods and the superiority of our approach, especially for the less popular entities.
The flexibility of the RDF data model has attracted an increasing number of organizations to store their data in an RDF format. With the rapid growth of RDF datasets, we envision that it is inevitable to deploy a cluster of computing nodes to process large-scale RDF data in order to deliver desirable query performance. In this paper, we address the challenging problems of data partitioning and query optimization in a scale-out RDF engine. We identify that existing approaches only focus on using fine-grained structural information for data partitioning, and hence fail to localize many types of complex queries. We then propose a radically different approach, where a coarse-grained structure, namely Rooted Sub-Graph (RSG), is used as the partition unit. By doing so, we can capture structural information at a much greater scale and hence are able to localize many complex queries. We also propose a k-means partitioning algorithm for allocating the RSGs onto the computing nodes as well as a query optimization strategy to minimize the inter-node communication during query processing. An extensive experimental study using benchmark datasets and real dataset shows that our engine, SemStore, outperforms existing systems by orders of magnitudes in terms of query response time.
Many studies have been conducted on seeking an efficient solution for pattern matching over graphs. This interest is largely due to large number of applications in many fields, which require efficient solutions for pattern matching, including protein complex prediction, social network analysis and structural pattern recognition. However, in many real applications, the graph data are often noisy, incomplete, and inaccurate. In other words, there exist many uncertain graphs. Therefore, in this paper, we study pattern matching in a large uncertain graph. Specifically, we want to retrieve all qualified matches of a query pattern in the uncertain graph. Though pattern matching over an uncertain graph is NP-hard, we employ a filtering-and verification framework to speed up the search. In the filtering phase, we propose a probabilistic matching tree, PM-tree, based on match cuts obtained by a cut selection process. Based on PM-tree, we devise a collective pruning strategy to prune a large number of unqualified matches. During the verification phase, we develop an efficient sampling algorithm to validate the remaining candidates. Extensive experimental results demonstrate the effectiveness and efficiency of the proposed algorithms.
Due to imprecise query intention, Web database users often use a limited number of keywords that are not directly related to their precise query to search information. Semantic approximate keyword query is challenging but helpful for specifying such query intent and providing more relevant answers. By extracting the semantic relationships both between keywords and keyword queries, this paper proposes a new keyword query approach which generates semantic approximate answers by identifying a set of keyword queries from the query history whose semantics are related to the given keyword query. To capture the semantic relationships between keywords, a semantic coupling relationship analysis model is introduced to model both the intra- and inter-keyword couplings. Building on the coupling relationships between keywords, the semantic similarity of different keyword queries is then measured by a semantic matrix. The representative queries in query history are identified and then a priori order of remaining queries corresponding to each representative query in an off-line preprocessing step is created. These representative queries and associated orders are then used to expeditiously generate top-k ranked semantically related keyword queries. We demonstrate that our coupling relationship analysis model can accurately capture the semantic relationships both between keywords and queries. The efficiency of top-k keyword query selection algorithm is also demonstrated.
Aggregated search is the task of blending results from different search services, or verticals, into a set of web search results. Aggregated search coherence is the extent to which results from different sources focus on similar senses of an ambiguous or underspecified query. Prior work investigated the "spill-over" effect between a set of blended vertical results and the web results. These studies found that users are more likely to interact with the web results when the vertical results are more consistent with the user's intended query-sense. We extend this prior work by investigating three new research questions: (1) Does the spill-over effect generalize across different verticals? (2) Does the vertical rank moderate the level of spill-over? and (3) Does the presence of a border around the vertical results moderate the level of spill-over? We investigate four different verticals (images, news, shopping, and video) and measure spill-over using interaction measures associated with varying levels of engagement with the web results (bookmarks, clicks, scrolls, and mouseovers). Results from a large-scale crowdsourced study suggest that: (1) The spill-over effect generalizes across verticals, but is stronger for some verticals than others, (2) Vertical rank has a stronger moderating effect for verticals with a mid-level of spill-over, and (3) Including a border around the vertical results has a subtle moderating effect for those verticals with a low level of spill-over.
Query Auto Completion (QAC) suggests possible queries to web search users from the moment they start entering a query. This popular feature of web search engines is thought to reduce physical and cognitive effort when formulating a query.
Perhaps surprisingly, despite QAC being widely used, users' interactions with it are poorly understood. This paper begins to address this gap. We present the results of an in-depth user study of user interactions with QAC in web search. While study participants completed web search tasks, we recorded their interactions using eye-tracking and client-side logging. This allows us to provide a first look at how users interact with QAC. We specifically focus on the effects of QAC ranking, by controlling the quality of the ranking in a within-subject design.
We identify a strong position bias that is consistent across ranking conditions. Due to this strong position bias, ranking quality affects QAC usage. We also find an effect on task completion, in particular on the number of result pages visited. We show how these effects can be explained by a combination of searchers' behavior patterns, namely monitoring or ignoring QAC, and searching for spelling support or complete queries to express a search intent. We conclude the paper with a discussion of the important implications of our findings for QAC evaluation.
Tail queries, which occur with low frequency, make up a large fraction of unique queries and often affect a user's experience during Web searching. Because of the data sparseness problem, information that can be leveraged for tail queries is not sufficient. Hence, it is important and difficult to improve the tail query performance. According to our observation, 26% of the tail queries are not essentially scarce: they are expressed in an unusual way, but the information requirements are not rare. In this study, we improve the tail query performance by fusing the results from original query and the query reformulation candidates. Other than results re-ranking, new results can be introduced by the fusion model. We emphasize that queries that can be improved are not only bad queries, and we propose to extract features that predict whether the performance can be improved. Then, we utilize a learning-to-rank method, which is trained to directly optimize a retrieval metric, to fuse the documents and obtain a final results list. We conducted experiments using data from two popular Chinese search engines. The results indicate that our fusion method significantly improves the performance of the tail queries and outperforms the state-of-the-art approaches on the same reformulations. Experiments show that our method is effective for the non-tail queries as well.
Knowing, in real time, whether a current searcher in an information retrieval system finds the search task difficult can be valuable for tailoring the system's support for that searcher. This study investigated searcher's behaviors at different stages of the search process; they are: 1) first-round point at the beginning of the search, right before searchers issued their second query; 2) middle point, when searchers proceeded to the middle of the search process, and 3) end point, when searchers finished the whole task. We compared how the behavioral features calculated at these three points were different between difficult and easy search tasks, and identified behavioral features during search sessions that can be used in real-time to predict perceived task difficulty. In addition, we compared the prediction performance at different stages of search process. Our results show that a number of user behavioral measures at all three points differed between easy and difficult tasks. Query interval time, dwell time on viewed documents, and number of viewed documents per query were important predictors of task difficulty. The results also indicate that it is possible to make relatively accurate prediction of task difficulty at the first query round of a search. Our findings can help search systems predict task difficulty which is necessary in personalizing support for the individual searcher.
This study investigates recall and recognition in a news refinding task where participants were asked to read news articles and then to search for the same articles a fortnight later. Recall, which is a task to express what a person remembers, corresponds to query formulations, while recognition, which is a task to judge whether a presented item has been shown before, corresponds to a user's relevance judgment on search results in a refinding task. Our four main contributions can be summarized as follows: (i) we developed a method to investigate the effects of memory loss on episode refinding tasks on a large scale; (ii) our user study revealed a big drop on search performances in the refinding task after a fortnight and several differences between search queries input immediately after news browsing and ones at a later time; (iii) we found that asking questions and expanding input queries on the basis of the answers significantly improved the search performance in the news refinding task; and (iv) the users' recognition abilities were different than their recall abilities, e.g. object names in a news story could be correctly recognized even though they were rarely recalled. Our findings support several findings in cognitive psychology from the viewpoint of information refinding and also have several implications for search algorithms for assisting user refinding.
In web search, recency ranking refers to the task of ranking documents while taking into account freshness as one of the criteria of their relevance. There are two approaches to recency ranking. One focuses on extending existing learning to rank algorithms to optimize for both freshness and relevance. The other relies on an aggregated search strategy: a (dedicated) fresh vertical is used and fresh results from this vertical are subsequently integrated into the search engine result page. In this paper, we adopt the second strategy. In particular, we focus on the fresh vertical prediction task for repeating queries and identify the following novel algorithmic problem: how to quickly correct fresh intent detection mistakes made by a state-of-the-art fresh intent detector, which erroneously detected or missed a fresh intent shift upwards for a particular repeating query (i.e., a change in the degree to which the query has a fresh intent). We propose a method for solving this problem. We use online exploration at the early start of what we believe to be a detected intent shift. Based on this exploratory phase, we correct fresh intent detection mistakes made by a state-of-that-art fresh intent detector for queries, whose fresh intent has shifted. Using query logs of Yandex, we demonstrate that our methods allow us to significantly improve the speed and quality of the detection of fresh intent shifts.
Test collections play an important role in adhoc and diversity retrieval evaluation. Constructing a test collection for adhoc evaluation involves (1) selecting a set of queries to be judged, (2) selecting an intent (topic) description for that query, and (3) obtaining relevance judgments with respect to the specific intent description for that particular query. Recent work showed that the selection of intents play an important role in the relative performance of retrieval systems for diversity evaluation. However, no previous work has analysed how the choice of a specific intent description may affect adhoc evaluation. We show that intent descriptions have a significant impact in adhoc evaluation and that special care should be given as to how the intent descriptions are selected. We further show that it is better to have very general intent descriptions or no intent descriptions at all when constructing test collections for adhoc evaluation. We then focus on diversity evaluation and identify the effect intent descriptions have on diversity based retrieval evaluation. We quantify this effect and discuss experimental design decisions for the optimal distribution of judgment effort across different intents for a query vs. different queries.
Result diversification is a topic of great value for enhancing user experience in many fields, such as web search and recommender systems. Many existing methods generate a diversified result in a sequential manner, but they work well only if the preceding choices are optimal or close to the optimal solution. Moreover, a manually tuned parameter (say,λ) is often required to trade off relevance and diversity. This makes it difficult to know whether the failures are caused by the optimization criterion or the setting of λ. In context of web search, we formulate the result diversification task as a 0-1 multiple subtopic knapsack problem (MSKP), where a subset of documents are optimally chosen like filling up multiple subtopic knapsacks. This formulation yields no trade-off parameters to be specified beforehand. Solving the 0-1 MSKP is NP-hard, we treat the optimization of 0-1 MSKP using a graphical model over latent binary variables as a maximum posterior inference problem, and tackle it with the max-sum belief propagation algorithm. To validate the effectiveness and efficiency of the proposed 0-1 MSKP model, we conduct a series of experiments on two TREC diversity collections. The experimental results show that the proposed model outperforms several state-of-the-art methods significantly, not only in terms of standard diversity metrics (α-nDCG, nERRIA and subtopic recall), but also in terms of efficiency.
Search advertising shows trends of vertical extension. Vertical ads, including product ads and local search ads, are proliferating at an ever increasing pace. They typically offer better ROI to advertisers as a result of better user engagement. However, campaigns and bids in vertical ads are not set at the keyword level. As a result, the matching between user query and ads suffers low recall rate and the match quality is heavily impacted by tail queries. In this paper, we propose an ad retrieval framework for retail vertical ads, based on query rewrite using personal history data to improve ad recall rate. To insure ad quality, we also present a relevance model for matching rewritten queries with user search intent, with a particular focus on tail queries. In addition, we designed and implemented a GPU-based system to accelerate the training of the relevance model to meet production performance constraints. Finally, we carry out extensive experiments on large-scale logs collected from Bing, and show significant gains in ad retrieval rate without compromising ad quality.
Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces.
Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified seed set of nodes) and Influence Maximization (identifying the most influential seed set of a given size). Answering each influence query involves many edge traversals, and does not scale when there are many queries on very large graphs. The gold standard for Influence Maximization is the greedy algorithm, which iteratively adds to the seed set a node maximizing the marginal gain in influence. Greedy has a guaranteed approximation ratio of at least (1-1/e) and actually produces a sequence of nodes, with each prefix having approximation guarantee with respect to the same-size optimum. Since Greedy does not scale well beyond a few million edges, for larger inputs one must currently use either heuristics or alternative algorithms designed for a pre-specified small seed set size.
We develop a novel sketch-based design for influence computation. Our greedy Sketch-based Influence Maximization (SKIM) algorithm scales to graphs with billions of edges, with one to two orders of magnitude speedup over the best greedy methods. It still has a guaranteed approximation ratio, and in practice its quality nearly matches that of exact greedy. We also present influence oracles, which use linear-time preprocessing to generate a small sketch for each node, allowing the influence of any seed set to be quickly answered from the sketches of its nodes.
Many interesting domains in machine learning can be viewed as networks, with relationships (e.g., friendships) connecting items (e.g., individuals). The Active Exploration (AE) task is to identify all items in a network with a desired trait (i.e., positive labels) given only partial information about the network. The AE process iteratively queries for labels or network structure within a limited budget; thus, accurate predictions prior to making each query is critical to maximizing the number of positives gathered. However, the targeted AE query process produces partially observed networks that can create difficulties for predictive modeling. In particular, we demonstrate that these partial networks can exhibit extreme label correlation bias, which makes it difficult for conventional relational learning methods to accurately estimate relational parameters. To overcome this issue, we model the joint distribution of possible edges and labels to improve learning and inference. Our proposed method, Probabilistic Relational Expectation Maximization (PR-EM), is the first AE approach to accurately learn the complex dependencies between attributes, labels, and structure to improve predictions. PR-EM utilizes collective inference over the missing relationships in the partial network to jointly infer unknown item traits. Further, we develop a linear inference algorithm to facilitate efficient use of PR-EM in large networks. We test our approach on four real world networks, showing that AE with PR-EM gathers significantly more positive items compared to state-of-the-art methods.
Information diffusion has been widely studied in networks, aiming to model the spread of information among objects when they are connected with each other. Most of the current research assumes the underlying network is homogeneous, i.e., objects are of the same type and they are connected by links with the same semantic meanings. However, in the real word, objects are connected via different types of relationships, forming multi-relational heterogeneous information networks.
In this paper, we propose to model information diffusion in such multi-relational networks, by distinguishing the power in passing information around for different types of relationships. We propose two variations of the linear threshold model for multi-relational networks, by considering the aggregation of information at either the model level or the relation level. In addition, we use real diffusion action logs to learn the parameters in these models, which will benefit diffusion prediction in real networks. We apply our diffusion models in two real bibliographic information networks, DBLP network and APS network, and experimentally demonstrate the effectiveness of our models compared with single-relational diffusion models. Moreover, our models can determine the diffusion power of each relation type, which helps us understand the diffusion process better in the multi-relational bibliographic network scenario.
The availability of user check-in data in large volume from the rapid growing location-based social networks (LBSNs) enables a number of important location-aware services. Point-of-interest (POI) recommendation is one of such services, which is to recommend POIs that users have not visited before. It has been observed that: (i) users tend to visit nearby places, and (ii) users tend to visit different places in different time slots, and in the same time slot, users tend to periodically visit the same places. For example, users usually visit a restaurant during lunch hours, and visit a pub at night. In this paper, we focus on the problem of time-aware POI recommendation, which aims at recommending a list of POIs for a user to visit at a given time. To exploit both geographical and temporal influences in time aware POI recommendation, we propose the Geographical-Temporal influences Aware Graph (GTAG) to model check-in records, geographical influence and temporal influence. For effective and efficient recommendation based on GTAG, we develop a preference propagation algorithm named Breadth first Preference Propagation (BPP). The algorithm follows a relaxed breath-first search strategy, and returns recommendation results within at most 6 propagation steps. Our experimental results on two real-world datasets show that the proposed graph-based approach outperforms state-of-the-art POI recommendation methods substantially.
Decision trees have been used for several decades as simple and effective solutions to supervised learning problems. Their success extends to tasks across a variety of areas. Yet, data collected today through web-domains such as on-line advertising presents many new challenges: sheer size, the prevalence of high-arity categorical features, unknown feature-values, "cold starts", sparse training instances, and imbalance in the class labels. We argue that decision trees remain an ideal choice for applications of on-line advertising as they naturally construct higher-order conjunctive features; we then contribute two ideas to improve tree-building accordingly. First, to handle high-arity categorical features, we introduce a method to cluster feature-values based on their output responses. The result is more "data-dense" trees with relatively small branching factors. Second, we employ cross-validation as a principled approach to derive splitting and stopping criteria: thereby we identify splits that generalize well, and also curb overfitting. Evaluated on three distinct probability-estimation tasks in on-line advertising, our method, "CCDT", shows significant improvements in the accuracy of prediction.
Businesses store an ever increasing amount of historical customer sales data. Given the availability of such information, it is advantageous to analyze past sales, both for revealing dominant buying patterns, and for providing more targeted recommendations to clients. In this context, co-clustering has proved to be an important data-modeling primitive for revealing latent connections between two sets of entities, such as customers and products.
In this work, we introduce a new algorithm for co-clustering that is both scalable and highly resilient to noise. Our method is inspired by k-Means and agglomerative hierarchical clustering approaches: (i) first it searches for elementary co-clustering structures and (ii) then combines them into a better, more compact, solution. The algorithm is flexible as it does not require an explicit number of co-clusters as input, and is directly applicable on large data graphs. We apply our methodology on real sales data to analyze and visualize the connections between clients and products. We showcase a real deployment of the system, and how it has been used for driving a recommendation engine. Finally, we demonstrate that the new methodology can discover co-clusters of better quality and relevance than state-of-the-art co-clustering techniques.
Tensor decomposition operation is the basis for many data analysis tasks from clustering, trend detection, anomaly detection, to correlation analysis. One key problem with tensor decomposition, however, is its computational complexity -- especially for dense data sets, the decomposition process takes exponential time in the number of tensor modes; the process is relatively faster for sparse tensors, but decomposition is still a major bottleneck in many applications. While it is possible to reduce the decomposition time by trading performance with decomposition accuracy, a drop in accuracy may not always be acceptable. In this paper, we first recognize that in many applications, the user may have a focus of interest -- i.e., part of the data for which the user needs high accuracy -- and beyond this area focus, accuracy may not be as critical. Relying on this observation, we propose a novel Personalized Tensor Decomposition(PTD) mechanism for accounting for the user's focus: PTD takes as input one or more areas of focus and performs the decomposition in such a way that, when reconstructed, the accuracy of the tensor is boosted for these areas of focus. We discuss alternative ways PTD can be implemented. Experiments show that PTD helps boost accuracy at the foci of interest, while reducing the overall tensor decomposition time.
Recently there is an increasing attention in heterogeneous information network analysis, which models networked data as networks including different types of objects and relations. Many data mining tasks have been exploited in heterogeneous networks, among which clustering and ranking are two basic tasks. These two tasks are usually done separately, whereas recent researches show that they can mutually enhance each other. Unfortunately, these works are limited to heterogeneous networks with special structures (e.g. bipartite or star-schema network). However, real data are more complex and irregular, so it is desirable to design a general method to manage objects and relations in heterogeneous networks with arbitrary schema. In this paper, we study the ranking-based clustering problem in a general heterogeneous information network and propose a novel solution HeProjI. HeProjI projects a general heterogeneous network into a sequence of sub-networks and an information transfer mechanism is designed to keep the consistency among sub-networks. For each sub-network, a path-based random walk model is built to estimate the reachable probability of objects which can be used for clustering and ranking analysis. Iteratively analyzing each sub-network leads to effective ranking-based clustering. Extensive experiments on three real datasets illustrate that HeProjI can achieve better clustering and ranking performances compared to other well-established algorithms.
Question-and-answer (Q&A) websites, such as Yahoo! Answers, Stack Overflow and Quora, have become a popular and powerful platform for Web users to share knowledge on a wide range of subjects. This has led to a rapidly growing volume of information and the consequent challenge of readily identifying high quality objects (questions, answers and users) in Q&A sites. Exploring the interdependent relationships among different types of objects can help find high quality objects in Q&A sites more accurately. In this paper, we specifically focus on the ranking problem of co-ranking questions, answers and users in a Q&A website. By studying the tightly connected relationships between Q&A objects, we can gain useful insights toward solving the co-ranking problem. However, co-ranking multiple objects in Q&A sites is a challenging task: a) With the large volumes of data in Q&A sites, it is important to design a model that can scale well; b) The large-scale Q&A data makes extracting supervised information very expensive. In order to address these issues, we propose an unsupervised Network-based Co-Ranking framework (NCR) to rank multiple objects in Q&A sites. Empirical studies on real-world Yahoo! Answers datasets demonstrate the effectiveness and the efficiency of the proposed NCR method.
The rapid proliferation of hand-held devices has led to the development of rich, interactive and immersive applications, such as e-readers for electronic books. These applications motivate retrieval systems that can implicitly satisfy any information need of the reader by exploiting the context of the user's interactions. Such retrieval systems differ from traditional search engines in that the queries constructed using the context are typically complex objects (including the document and its structure).
In this paper, we develop an efficient retrieval system, only assuming an oracle access to a traditional search engine that admits 'succinct' keyword queries for retrieving objects of a desired media type. As part of query generation, we first map the complex query object to a concept graph and then use the concepts along with their relationships in the graph to compute a small set of keyword queries to the search engine. Next, as part of the result generation, we aggregate the results of these queries to identify relevant web content of the desired type, thereby eliminating the need for explicitly computing similarity between the query object and all web content. We present a theoretical analysis of our approach and carry out a detailed empirical evaluation to show the practicality of the approach for the task of augmenting electronic documents with high quality videos from the web.
Nowadays, WWW brings overwhelming variety of choices to consumers. Recommendation systems facilitate the selection by issuing recommendations to them. Recommendations for users, or groups, are determined by considering users similar to the users in question. Scanning the whole database for locating similar users, though, is expensive. Existing approaches build cluster models by employing full-dimensional clustering to find sets of similar users. As the datasets we deal with are high-dimensional and incomplete, full-dimensional clustering is not the best option. To this end, we explore the fault-tolerant subspace clustering approach. We extend the concept of fault tolerance to density-based subspace clustering, and to speed up our algorithms, we introduce the significance threshold for considering only promising dimensions for subspace extension. Moreover, as we potentially receive a multitude of users from subspace clustering, we propose a weighted ranking approach to refine the set of like-minded users. Our experiments on real movie datasets show that the diversification of the similar users that the subspace clustering approaches offer results in better recommendations compared to traditional collaborative filtering and full-dimensional clustering approaches.
Geographical characteristics derived from the historical check-in data have been reported effective in improving location recommendation accuracy. However, previous studies mainly exploit geographical characteristics from a user's perspective, via modeling the geographical distribution of each individual user's check-ins. In this paper, we are interested in exploiting geographical characteristics from a location perspective, by modeling the geographical neighborhood of a location. The neighborhood is modeled at two levels: the instance-level neighborhood defined by a few nearest neighbors of the location, and the region-level neighborhood for the geographical region where the location exists. We propose a novel recommendation approach, namely Instance-Region Neighborhood Matrix Factorization (IRenMF), which exploits two levels of geographical neighborhood characteristics: a) instance-level characteristics, i.e., nearest neighboring locations tend to share more similar user preferences; and b) region-level characteristics, i.e., locations in the same geographical region may share similar user preferences. In IRenMF, the two levels of geographical characteristics are naturally incorporated into the learning of latent features of users and locations, so that IRenMF predicts users' preferences on locations more accurately. Extensive experiments on the real data collected from Gowalla, a popular LBSN, demonstrate the effectiveness and advantages of our approach.
We consider the experts recommendation problem for open collaborative projects in large-scale Open Source Software (OSS) communities. In large-scale online community, recommending expert collaborators to a project coordinator or lead developer has two prominent challenges: (i) the "cold shoulder"' problem, which is the lack of interest from the experts to collaborate and share their skills, and (ii) the "cold start" problem, which is an issue with community members who has scarce data history. In this paper, we consider the Degree of Knowledge (DoK) which imposes the knowledge of the skills factor, and the Social Relative Importance (SRI) which imposes the social distance factor to tackle the aforementioned challenges. We propose four DoK models and integrate them with three SRI methods under our proposed Expert Ranking (ER) framework to rank the candidate expert collaborators based on their likelihood of collaborating in response to a query formulated by the social network of a query initiator and certain required skills to a project/task. We evaluate our proposal using a dataset collected from Github.com, which is one of the most fast-growing, large-scale online OSS community. In addition, we test the models under different data scarcity levels. The experiment shows promising results of recommending expert collaborators who tend to make real collaborations to projects.
Collaborative filtering is a fundamental building block in many recommender systems. While most of the existing collaborative filtering methods focus on explicit, multi-class settings (e.g., 1-5 stars in movie recommendation), many real-world applications actually belong to the one-class setting where user feedback is implicitly expressed (e.g., views in news recommendation and video recommendation). The main challenges in such one-class setting include the ambiguity of the unobserved examples and the sparseness of existing positive examples.
In this paper, we propose a dual-regularized model for one-class collaborative filtering. In particular, we address the ambiguity challenge by integrating two state-of-the-art one-class collaborative filtering methods to enjoy the best of both worlds. We tackle the sparseness challenge by exploiting the side information from both users and items. Moreover, we propose efficient algorithms to solve the proposed model. Extensive experimental evaluations on two real data sets demonstrate that our method achieves significant improvement over the state-of-the-art methods. Overall, the proposed method leads to 7.9% - 21.1% improvement over its best known competitors in terms of prediction accuracy, while enjoying the linear scalability.
Matrix factorization is one of the most powerful techniques in collaborative filtering, which models the (user, item) interactions behind historical explicit or implicit feedbacks. However, plain matrix factorization may not be able to uncover the structure correlations among users and items well such as communities and taxonomies. As a response, we design a novel algorithm, i.e., hierarchical group matrix factorization (HGMF), in order to explore and model the structure correlations among users and items in a principled way. Specifically, we first define four types of correlations, including (user, item), (user, item group), (user group, item) and (user group, item group); we then extend plain matrix factorization with a hierarchical group structure; finally, we design a novel clustering algorithm to mine the hidden structure correlations. In the experiments, we study the effectiveness of our HGMF for both rating prediction and item recommendation, and find that it is better than some state-of-the-art methods on several real-world data sets.
An essential component of building a successful crowdsourcing market is effective task matching, which matches a given task to the right crowdworkers. In order to provide high- quality task matching, crowdsourcing systems rely on past task-solving activities of crowdworkers. However, the average number of past activities of crowdworkers in most crowd- sourcing systems is very small. We call the workers who have only solved a small number of tasks cold-start crowdworkers. We observe that most of the workers in crowdsourcing systems are cold-start crowdworkers, and crowdsourcing systems actually enjoy great benefits from cold-start crowd-workers. However, the problem of task matching with the presence of many cold-start crowdworkers has not been well studied. We propose a new approach to address this issue. Our main idea, motivated by the prevalence of online social networks, is to transfer the knowledge about crowdworkers in their social networks to crowdsourcing systems for task matching. We propose a SocialTransfer model for cold-start crowdsourcing, which not only infers the expertise of warm- start crowdworkers from their past activities, but also transfers the expertise knowledge to cold-start crowdworkers via social connections. We evaluate the SocialTransfer model on the well-known crowdsourcing system Quora, using knowledge from the popular social network Twitter. Experimental results show that, by transferring social knowledge, our method achieves significant improvements over the state-of-the-art methods.
Online social streams such as Twitter timelines, forum discussions and email threads have emerged as important channels for information propagation. Mining transient stories and their correlations implicit in social streams is a challenging task, since these streams are noisy and surge quickly. In this paper, we propose CAST, which is a context-aware story-teller that discovers new stories from social streams and tracks their structural context on the fly to build a vein of stories. More precisely, we model the social stream as a capillary network, and define stories by a new cohesive subgraph type called (k,d)-Core in the capillary network. We propose deterministic and randomized context search to support the iceberg query, which builds the story vein as social streams flow. We perform detailed experimental study on real Twitter streams and the results demonstrate the creativity and value of our approach.
Graph has been a ubiquitous and essential data representation to model real world objects and their relationships. Today, large amounts of graph data have been generated by various applications. Graph summarization techniques are crucial in uncovering useful insights about the patterns hidden in the underlying data. However, all existing works in graph summarization are single-process solutions, and as a result cannot scale to large graphs. In this paper, we introduce three distributed graph summarization algorithms to address this problem. Experimental results show that the proposed algorithms can produce good quality summaries and scale well with increasing data sizes. To the best of our knowledge, this is the first work to study distributed graph summarization methods.
In recent years, with the emergence of a number of new real applications, such as protein-protein interaction (PPI) networks, visual pattern recognition, and intelligent traffic systems, managing huge volumes of uncertain graphs has attracted much attention in the database community. Currently, most existing fundamental queries over graphs only support deterministic (or certain) graphs, although real graph data are often noisy, inaccurate, and incomplete. In this paper, we study a new type of uncertain graph query, probabilistic supergraph containment query over large uncertain graphs. Specifically, given an uncertain graph database UGD which contains a set of uncertain graphs, a deterministic query graph q, and a probabilistic threshold δ, a probabilistic supergraph containment query is to find the set of uncertain graphs from UGD, denoted as UGDq, such that UGDq={ugi∈ UGD|Pr(ugi⊆ q)≥δ} where Pr(ugi⊆q) means the likelihood that ugi is a subgraph of q. We prove that the computation of Pr(ugi⊆q) is #P-hard and design an efficient filtering-and-verification framework to avoid the expensive computation. In particular, we propose an effective filtering strategy and a novel probabilistic inverted index, called PS-Index, to enhance pruning power in the filtering phase. Furthermore, the candidate graphs which pass the filtering phase are tested in the verification phase via an efficient unequal probability sampling-based approximation algorithm. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments.
Supporting exploratory search is a very challenging problem, not least because of the dynamic nature of the exercise: both the knowledge and interests of the user are subject to constant change. Moreover, whether the results for a query are informative is strongly subjective. What is informative to one user, is too specific for the other; specificity differs between users depending on their intent and accumulated knowledge about the domain.
We propose a formal model - motivated by Information Foraging Theory - for predicting the subjective specificity of search results based on simple observables such as result-clicks. Through two studies including both controlled and free-form exploratory search we show our model allows us to differentiate between levels of subjective result specificity with regard to the current information need of the user.
We present methods to automatically identify and recommend sub-tasks to help people explore and accomplish complex search tasks. Although Web searchers often exhibit directed search behaviors such as navigating to a particular Website or locating a particular item of information, many search scenarios involve more complex tasks such as learning about a new topic or planning a vacation. These tasks often involve multiple search queries and can span multiple sessions. Current search systems do not provide adequate support for tackling these tasks. Instead, they place most of the burden on the searcher for discovering which aspects of the task they should explore. Particularly challenging is the case when a searcher lacks the task knowledge necessary to decide which step to tackle next. In this paper, we propose methods to automatically mine search logs for tasks and build an association graph connecting multiple tasks together. We then leverage the task graph to assist new searchers in exploring new search topics or tackling multi-step search tasks. We demonstrate through experiments with human participants that we can discover related and interesting tasks to assist with complex search scenarios.
Faceted search helps users by offering drill-down options as a complement to the keyword input box, and it has been used successfully for many vertical applications, including e-commerce and digital libraries. However, this idea is not well explored for general web search, even though it holds great potential for assisting multi-faceted queries and exploratory search. In this paper, we explore this potential by extending faceted search into the open-domain web setting, which we call Faceted Web Search. To tackle the heterogeneous nature of the web, we propose to use query-dependent automatic facet generation, which generates facets for a query instead of the entire corpus. To incorporate user feedback on these query facets into document ranking, we investigate both Boolean filtering and soft ranking models. We evaluate Faceted Web Search systems by their utility in assisting users to clarify search intent and find subtopic information. We describe how to build reusable test collections for such tasks, and propose an evaluation method that considers both gain and cost for users. Our experiments testify to the potential of Faceted Web Search, and show Boolean filtering feedback models, which are widely used in conventional faceted search, are less effective than soft ranking models.
User's examination of search results is a key concept involved in all the click models. However, most studies assumed that eye fixation means examination and no further study has been carried out to better understand user's examination behavior. In this study, we design an experimental search engine to collect both the user's feedback on their examinations and the eye-tracking/click-through data. To our surprise, a large proportion (45.8%) of the results fixated by users are not recognized as being "read". Looking into the tracking data, we found that before the user actually "reads" the result, there is often a "skimming" step in which the user quickly looks at the result without reading it. We thus propose a two-stage examination model which composes of a first "from skimming to reading" stage (Stage 1) and a second "from reading to clicking" stage (Stage 2). We found that the biases (e.g. position bias, domain bias, attractiveness bias) considered in many studies impact in different ways in Stage 1 and Stage 2, which suggests that users make judgments according to different signals in different stages. We also show that the two-stage examination behaviors can be predicted with mouse movement behavior, which can be collected at large scale. Relevance estimation with the two-stage examination model also outperforms that with a single-stage examination model. This study shows that the user's examination of search results is a complex cognitive process that needs to be investigated in greater depth and this may have a significant impact on Web search.
Users frequently rely on online reviews for decision making. In addition to allowing users to evaluate the quality of individual products, reviews also support comparison shopping. One key user activity is to compare two (or more) products based on a specific aspect. However, making a comparison across two different reviews, written by different authors, is not always equitable due to the different standards and preferences of individual authors. Therefore, we focus instead on comparative sentences, whereby two products are compared directly by a review author within a single sentence.
We study the problem of comparative relation mining. Given a set of comparative sentences, each relating a pair of entities, our objective is two-fold: to interpret the comparative direction in each sentence, and to determine the relative merits of each entity. This requires mining comparative relations at two levels of resolution: at the sentence level, as well as at the entity level. Our key observation is that there is significant synergy between the two levels. We therefore propose a generative model for comparative text, which jointly models comparative directions at the sentence level, and ranking at the entity level. This model is tested comprehensively on Amazon reviews dataset with good empirical outperformance over the state-of-the-art baselines.
Email classification is still a mostly manual task. Consequently, most Web mail users never define a single folder. Recently however, automatic classification offering the same categories to all users has started to appear in some Web mail clients, such as AOL or Gmail. We adopt this approach, rather than previous (unsuccessful) personalized approaches because of the change in the nature of consumer email traffic, which is now dominated by (non-spam) machine-generated email. We propose here a novel approach for (1) automatically distinguishing between personal and machine-generated email and (2) classifying messages into latent categories, without requiring users to have defined any folder. We report how we have discovered that a set of 6 "latent" categories (one for human- and the others for machine-generated messages) can explain a significant portion of email traffic. We describe in details the steps involved in building a Web-scale email categorization system, from the collection of ground-truth labels, the selection of features to the training of models. Experimental evaluation was performed on more than 500 billion messages received during a period of six months by users of Yahoo mail service, who elected to be part of such research studies. Our system achieved precision and recall rates close to 90% and the latent categories we discovered were shown to cover 70% of both email traffic and email search queries. We believe that these results pave the way for a change of approach in the Web mail industry, and could support the invention of new large-scale email discovery paradigms that had not been possible before.
We investigate latent aspect mining problem that aims at automatically discovering aspect information from a collection of review texts in a domain in an unsupervised manner. One goal is to discover a set of aspects which are previously unknown for the domain, and predict the user's ratings on each aspect for each review. Another goal is to detect key terms for each aspect. Existing works on predicting aspect ratings fail to handle the aspect sparsity problem in the review texts leading to unreliable prediction. We propose a new generative model to tackle the latent aspect mining problem in an unsupervised manner. By considering the user and item side information of review texts, we introduce two latent variables, namely, user intrinsic aspect interest and item intrinsic aspect quality facilitating better modeling of aspect generation leading to improvement on the accuracy and reliability of predicted aspect ratings. Furthermore, we provide an analytical investigation on the Maximum A Posterior (MAP) optimization problem used in our proposed model and develop a new block coordinate gradient descent algorithm to efficiently solve the optimization with closed-form updating formulas. We also study its convergence analysis. Experimental results on the two real-world product review corpora demonstrate that our proposed model outperforms existing state-of-the-art models.
In this paper, we present our work of humor recognition on Twitter, which will facilitate affect and sentimental analysis in the social network. The central question of what makes a tweet (Twitter post) humorous drives us to design humor-related features, which are derived from influential humor theories, linguistic norms, and affective dimensions. Using machine learning techniques, we are able to recognize humorous tweets with high accuracy and F-measure. More importantly, we single out features that contribute to distinguishing non-humorous tweets from humorous tweets, and humorous tweets from other short humorous texts (non-tweets). This proves that humorous tweets possess discernible characteristics that are neither found in plain tweets nor in humorous non-tweets. We believe our novel findings will inform and inspire the burgeoning field of computational humor research in the social media.
Data captured in OWL ontologies is generally considered to be more prone to changes than the schema in many situations. Such changes often necessitate consistency checking over the resulting ontologies in order to maintain coherent knowledge, specifically in dynamic settings. In this paper, we present an approach to check the consistency over an evolving ontology resulting from data insertions and deletions, given by some expressive underlying Description Logic dialect. The approach, assuming an initially consistent ontology, works by syntactically identifying "relevant" and representative parts of the data for the given updates, i.e., the part that may contribute to subsequent consistency checking. Our approach has demonstrated its efficacy in checking consistency over large and real-world ontologies and outperforms existing approaches in several circumstances.
Resolving incoherent terminologies is an important task in the maintenance of evolving OWL 2 DL ontologies. Existing approaches to this task are either semi-automatic or based on simple deletion of axioms. There is a need of fine-grained approaches to automatize this task. Since a fine-grained approach should consider multiple choices for modifying an axiom other than the deletion of axioms only, the primary challenges for developing such an approach lie in both the semantics of the repaired results and the efficiency in computing the repaired results. To tackle these challenges, we first introduce the notion of fine-grained repair based on modifying one axiom to zero or more axioms, then propose an efficient incremental method for computing all fine-grained repairs one by one. We also propose a modification function for axioms expressed in OWL 2 DL, which returns weaker axioms. Based on this modification function and the method for computing fine-grained repairs, we develop an automatic approach to resolving incoherent OWL 2 DL terminologies. Our extensive experimental results demonstrate that the proposed approach is efficient and practical.
In this work we propose an unsupervised framework to construct a shallow domain ontology from corpus. It is essential for Information Retrieval systems, Question-Answering systems, Dialogue etc. to identify important concepts in the domain and the relationship between them. We identify important domain terms of which multi-words form an important component. We show that the incorporation of multi-words improves parser performance, resulting in better parser output, which improves the performance of an existing Question-Answering system by upto 7%. On manually annotated smartphone dataset, the proposed system identifies 40:87% of the domain terms, compared to 22% recall obtained using WordNet, 43:77% by Yago and 53:74% by BabelNet respectively. However, it does not use any manually annotated resource like the compared systems. Thereafter, we propose a framework to construct a shallow ontology from the discovered domain terms by identifying four domain relations namely, Synonyms ('similar-to'), Type-Of ('is-a'), Action-On ('methods') and Feature-Of ('attributes'), where we achieve significant performance improvement over WordNet, BabelNet and Yago without using any mode of supervision or manual annotation.
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
The discovery of unknown functional dependencies in a dataset is of great importance for database redesign, anomaly detection and data cleansing applications. However, as the nature of the problem is exponential in the number of attributes none of the existing approaches can be applied on large datasets. We present a new algorithm DFD for discovering all functional dependencies in a dataset following a depth-first traversal strategy of the attribute lattice that combines aggressive pruning and efficient result verification. Our approach is able to scale far beyond existing algorithms for up to 7.5 million tuples, and is up to three orders of magnitude faster than existing approaches on smaller datasets.
Duplicates in a dataset are multiple representations of the same real-world entity and constitute a major data quality problem. This paper investigates the problem of estimating the number and sizes of duplicate record clusters in advance and describes a sampling-based method for solving this problem. In extensive experiments, on multiple datasets, we show that the proposed method reliably estimates the number of duplicate clusters, while being highly efficient.
Our method can be used a) to measure the dirtiness of a dataset, b) to assess the quality of duplicate detection configurations, such as similarity measures, and c) to gather approximate statistics about the true number of entities represented in the dataset.
As the relevant data sets get large, existing in-memory schemes for tensor decomposition become increasingly ineffective and, instead, memory-independent solutions, such as in-database analytics, are necessitated. In this paper, we present techniques for efficient implementations of in-database tensor decompositions on chunk-based array data stores. The proposed static and incremental in-database tensor decomposition operators and their optimizations address the constraints imposed by the main memory limitations when handling large and high-order tensor data. Firstly, we discuss how to implement alternating least squares operations efficiently on a chunk-based data storage system. Secondly, we consider scenarios with frequent data updates and show that compressed matrix multiplication techniques can be effective in reducing the incremental tensor decomposition maintenance costs. To the best of our knowledge, this paper presents the first attempt to develop efficient and optimized in-database tensor decomposition operations. We evaluate the proposed algorithms on tensor data sets that do not fit into the available memory and results show that the proposed techniques significantly improve the scalability of this core data analysis.
The Earth Mover's Distance, proposed in computer vision as a distance-based similarity model reflecting the human perceptual similarity, has been widely utilized in numerous domains for similarity search applicable on both feature histograms and signatures. While efficiency improvement methods towards the Earth Mover's Distance were frequently investigated on feature histograms, not much work is known to study this similarity model on feature signatures denoting object-specific feature representations. Given a very large multimedia database of features signatures, how can k-nearest-neighbor queries be processed efficiently by using the Earth Mover's Distance? In this paper, we propose an efficient filter approximation technique to lower bound the Earth Mover's Distance on feature signatures by restricting the number of earth flows locally. Extensive experiments on real world data indicate the high efficiency of the proposal, attaining order-of-magnitude query processing time cost reduction for high dimensional feature signatures.
We tackle the problem of searching microblog posts and frame it as a rank aggregation problem where we merge result lists generated by separate rankers so as to produce a final ranking to be returned to the user. We propose a rank aggregation method, TimeRA, that is able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms.
The adoption of hashtags in major social networks including Twitter, Facebook, and Google+ is a strong evidence of its importance in facilitating information diffusion and social chatting. To understand the factors (e.g., user interest, posting time and tweet content) that may affect hashtag annotation in Twitter and to capture the implicit relations between latent topics in tweets and their corresponding hashtags, we propose two PLSA-style topic models to model the hashtag annotation behavior in Twitter. Content-Pivoted Model (CPM) assumes that tweet content guides the generation of hashtags while Hashtag-Pivoted Model (HPM) assumes that hashtags guide the generation of tweet content. Both models jointly incorporate user, time, hashtag and tweet content in a probabilistic framework. The PLSA-style models also enable us to verify the impact of social factor on hashtag annotation by introducing social network regularization in the two models. We evaluate the proposed models using perplexity and demonstrate their effectiveness in two applications: retrospective hashtag annotation and related hashtag discovery. Our results show that HPM outperforms CPM by perplexity and both user and time are important factors that affect model performance. In addition, incorporating social network regularization does not improve model performance. Our experimental results also demonstrate the effectiveness of our models in both applications compared with baseline methods.
Popular online social networks (OSN) generate hundreds of terabytes of new data per day and connect millions of users. To help users cope with the immense scale and influx of new information, OSNs provide a search functionality. However, most of the search engines in OSNs today only support keyword queries and provide basic faceted search capabilities overlooking serendipitous network exploration and search for relationships between OSN entities. This results in siloed information and a limited search space. In 2013 Facebook introduced its innovative Graph Search product with the goal to take the OSN search experience to the next level and facilitate exploration of the Facebook Graph beyond the first degree. In this paper we explore people search on Facebook by analyzing an anonymized social graph, anonymized user profiles, and large scale anonymized query logs generated by users of Facebook Graph Search. We uncover numerous insights about people search across several demographics. We find that named entity and structured queries complement each other across one's duration on Facebook, that females search for people proportionately more than males, and that users submit more queries as they gain more friends. We introduce the concept of a lift predicate and highlight how a graph distance varies with the search goal. Based on these insights, we present a set of design implications to guide the research and development of the OSN search in the future.
With the development of information technology, online social networks grow dramatically. They now play a significant role in people's social life, especially for the younger generation. While huge amount of information is available in online social networks, privacy concerns arise. Among various privacy protection proposals, the notions of privacy as control and information boundary have been introduced. Commercial social networking sites have adopted the concept to implement mechanisms such as Google circles and Facebook custom lists. However, the functions are not widely accepted by the users, partly because it is tedious and labor-intensive to manually assign friends into circles.
In this paper, we introduce a social circle discovery approach using multi-view clustering. First, we present our observations on the key features of social circles: friendship links, content similarity and social interactions. We propose a one-side co-trained spectral clustering algorithm, which is tailored for the sparse nature of social network data. We also propose two evaluation measurements. One is based on quantitative similarity measures, while the other employs human evaluators to examine pairs of users selected by the max-risk evaluation approach. We evaluate our approach on ego networks of twitter users, and compare the proposed technique with single-view clustering and original co-trained spectral clustering techniques. Results show that multi-view clustering is more accurate for social circle detection; and our proposed approach gains significantly higher similarity ratio than the original multi-view clustering approach.
Kernel-based learning has been largely applied to semantic textual inference tasks. In particular, Tree Kernels (TKs) are crucial in the modeling of syntactic similarity between linguistic instances in Question Answering or Information Extraction tasks. At the same time, lexical semantic information has been studied through the adoption of the so-called Distributional Semantics (DS) paradigm, where lexical vectors are acquired automatically from large corpora. Notice how methods to account for compositional linguistic structures (e.g. grammatically typed bi-grams or complex verb or noun phrases) have been proposed recently by defining algebras on lexical vectors. The result is an extended paradigm called Distributional Compositional Semantics (DCS). Although lexical extensions have been already proposed to generalize TKs towards semantic phenomena (e.g. the predicate argument structures as for role labeling), currently studied TKs do not account for compositionality, in general. In this paper, a novel kernel called Compositionally Smoothed Partial Tree Kernel is proposed to integrate DCS operators into the tree kernel evaluation, by acting both over lexical leaves and non-terminal, i.e. complex compositional, nodes. The empirical results obtained on a Question Classification and Paraphrase Identification tasks show that state-of-the-art performances can be achieved, without resorting to manual feature engineering, thus suggesting that a large set of Web and text mining tasks can be handled successfully by the kernel proposed here.
The Web is rapidly transforming from a pure document collection to the largest connected public data space. Semantic annotations of web pages make it notably easier to extract and reuse data and are increasingly used by both search engines and social media sites to provide better search experiences through rich snippets, faceted search, task completion, etc. In our work, we study the novel problem of crawling structured data embedded inside HTML pages. We describe Anthelion, the first focused crawler addressing this task. We propose new methods of focused crawling specifically designed for collecting data-rich pages with greater efficiency. In particular, we propose a novel combination of online learning and bandit-based explore/exploit approaches to predict data-rich web pages based on the context of the page as well as using feedback from the extraction of metadata from previously seen pages. We show that these techniques significantly outperform state-of-the-art approaches for focused crawling, measured as the ratio of relevant pages and non-relevant pages collected within a given budget.
This paper addresses the problem of post-processing of ranking in search, referred to as post ranking. Although important, no research seems to have been conducted on the problem, particularly with a principled approach, and in practice ad-hoc ways of performing the task are being adopted. This paper formalizes the problem as constrained optimization in which the constraints represent the post-processing rules and the objective function represents the trade-off between adherence to the original ranking and satisfaction of the rules. The optimization amounts to refining the original ranking result based on the rules. We further propose a specific probabilistic implementation of the general formalization on the basis of the Bradley-Terry model, which is theoretically sound, effective, and efficient. Our experimental results, using benchmark datasets and enterprise search dataset, show that the proposed method works much better than several baseline methods of utilizing rules.
Graph-based ranking plays a key role in many applications, such as web search and social computing. Pioneering methods of ranking on graphs (e.g., PageRank and HITS) computed ranking scores relying only on the graph structure. Recently proposed methods, such as Semi-Supervised Page-Rank, take into account both the graph structure and the metadata associated with nodes and edges in a unified optimization framework. Such approaches are based on initializing the underlying random walk models with prior weights of nodes and edges that in turn depend on their individual properties. While in those models the prior weights of nodes and edges depend only on their own features, one can also assume that such weights may also depend or be related to the prior weights of their neighbors. This paper addresses the problem of weighting nodes and edges according to this intuition by realizing it in a general ranking model and an efficient algorithm of tuning the parameters of that model.
Most existing approaches for text classification represent texts as vectors of words, namely ``Bag-of-Words.'' This text representation results in a very high dimensionality of feature space and frequently suffers from surface mismatching. Short texts make these issues even more serious, due to their shortness and sparsity. In this paper, we propose using ``Bag-of-Concepts'' in short text representation, aiming to avoid the surface mismatching and handle the synonym and polysemy problem. Based on ``Bag-of-Concepts,'' a novel framework is proposed for lightweight short text classification applications. By leveraging a large taxonomy knowledgebase, it learns a concept model for each category, and conceptualizes a short text to a set of relevant concepts. A concept-based similarity mechanism is presented to classify the given short text to the most similar category. One advantage of this mechanism is that it facilitates short text ranking after classification, which is needed in many applications, such as query or ad recommendation. We demonstrate the usage of our proposed framework through a real online application: Channel-based Query Recommendation. Experiments show that our framework can map queries to channels with a high degree of precision (avg. precision=90.3%), which is critical for recommendation applications.
Classification of short text messages is becoming more and more relevant in these years, where billion of users use online social networks to communicate with other people. Understanding message content can have a huge impact on many data analysis processes, ranging from the study of online social behavior to targeted advertisement, to security and privacy purposes. In this paper, we propose a new unsupervised knowledge-based classifier for short text messages, where each category is represented by an ego-network. A short text is classified into a category depending on how far its words are from the ego of that category. We show how this technique can be used both in single label and in multi-label classification, and how it outperforms the state of the art for short text messages classification.
Sentiment analysis in various languages has been a research hotspot with many applications. However, sentiment resources (e.g., labeled corpora, sentiment lexicons) of different languages are unbalanced in terms of quality and quantity, which arouses interests in cross-lingual sentiment analysis aiming at using the resources in a source language to improve sentiment analysis in a target language. Nevertheless, many existing cross-lingual related works rely on a certain machine translation system to directly adapt the labeled data from the source language to the target language, which usually suffers from inaccurate results generated by the machine translation system. On the other hand, most sentiment analysis studies focus on document-level sentiment classification that cannot solve the aspect dependency problem of sentiment words. For instance, in the reviews on a cell phone, long is positive for the lifespan of its battery, but negative for the response time of its operating system. To solve these problems, this paper develops a novel Cross-Lingual Joint Aspect/Sentiment (CLJAS) model to carry out aspect-specific sentiment analysis in a target language using the knowledge learned from a source language. Specifically, the CLJAS model jointly detects aspects and sentiments of two languages simultaneously by incorporating sentiments into a cross-lingual topic model framework. Extensive experiments on different domains and different languages demonstrate that the proposed model can significantly improve the accuracy of sentiment classification in the target language.
A recent study on collective attention in Twitter shows that an epidemic spreading of hashtags is predominantly driven by external factors. We extend a time-series form of susceptible-infectious-recovered (SIR) model to monitor microblog emerging outbreaks by considering both endogenous and exogenous drivers. In addition, we adopt partially labeled Dirichlet allocation (PLDA) model to generate both background latent topics and hashtag topics. It overcomes the problem of small available samples in hashtag analysis by including related but unlabeled tweets through inference. We standardize hashtag topic contagiousness measure as the estimated effective-reproduction-number R derived from epidemiology. It is obtained by Bayesian parameter estimation. Guided by R, one can profile and categorize emerging topics, and generate alerts on potential outbreaks. Experiment results confirm the effectiveness of this approach.
How can we discover interesting patterns from time-evolving high speed data streams? How to analyze the data streams quickly and accurately, with little space overhead? High speed data stream has been receiving increasing attentions due to its wide applications such as sensors, network traffic, social networks, etc. One of the most fundamental tasks in the data stream is to find frequent items; especially, finding recently frequent items has become important in real world applications.
In this paper, we propose TwMinSwap, a fast, accurate, and space-efficient method for tracking recent frequent items. TwMinSwap is a deterministic version of our motivating algorithm TwSample which is a sampling based randomized algorithm with nice theoretical guarantees. TwMinSwap improves TwSample in terms of speed, accuracy, and memory usage. Both require only O(k) memory spaces, and do not require any prior knowledge on the stream such as its length and the number of distinct items in the stream. Through extensive experiments, we demonstrate that TwMinSwap outperforms all competitors in terms of accuracy and memory usage, with fast running time. Thanks to TwMinSwap, we report interesting discoveries in real world data streams, including the difference of trends between the winner and the loser of U.S. presidential candidates, and doubly-active patterns of movies.
Non-negative matrix factorization (NMF) is a well known method for obtaining low rank approximations of data sets, which can then be used for efficient indexing, classification, and retrieval. The non-negativity constraints enable probabilistic interpretation of the results and discovery of generative models. One key disadvantage of the NMF, however, is that it is costly to obtain and this makes it difficult to apply NMF in applications where data is dynamic. In this paper, we recognize that many applications involve redundancies and we argue that these redundancies can and should be leveraged for reducing the computational cost of the NMF process: Firstly, online applications involving data streams often include temporal redundancies. Secondly, and perhaps less obviously, many applications include integration of multiple data streams (with potential overlaps) and/or involves tracking of multiple similar (but different) queries; this leads to significant data and query redundancies, which if leveraged properly can help alleviate computational cost of NMF. Based on these observations, we introduce Group Incremental Non-Negative Matrix Factorization (GI-NMF) which leverages redundancies across multiple NMF tasks over data streams. The proposed algorithm relies on a novel group multiplicative update rules (G-MUR) method to significantly reduce the cost of NMF. GMUR is further complemented to support incremental update of the factors where data evolves continuously. Experiments show that GI-NMF significantly reduces the processing time, with minimal error overhead.
Mining high-speed data streams has become an important topic due to the rapid growth of online data. In this paper, we study the problem of active learning for streaming networked data. The goal is to train an accurate model for classifying networked data that arrives in a streaming manner by querying as few labels as possible. The problem is extremely challenging, as both the data distribution and the network structure may change over time. The query decision has to be made for each data instance sequentially, by considering the dynamic network structure.
We propose a novel streaming active query strategy based on structural variability. We prove that by querying labels we can monotonically decrease the structural variability and better adapt to concept drift. To speed up the learning process, we present a network sampling algorithm to sample instances from the data stream, which provides a way for us to handle large volume of streaming data. We evaluate the proposed approach on four datasets of different genres: Weibo, Slashdot, IMDB, and ArnetMiner. Experimental results show that our model performs much better (+5-10% by F1-score on average) than several alternative methods for active learning over streaming networked data.
The location profiles of social media users are valuable for various applications, such as marketing and real-world analysis. As most users do not disclose their home locations, the problem of inferring home locations has been well studied in recent years. In fact, most existing methods perform batch inference using static (i.e., pre-stored) social media contents. However, social media contents are generated and delivered in real-time as social streams. In this situation, it is important to continuously update current inference results based on the newly arriving contents to improve the results over time. Moreover, it is effective for location inference to use the spatiotemporal correlation between contents and locations. The main idea of this paper is that we can infer the locations of users who simultaneously post about a local event (e.g., earthquakes). Hence, in this paper, we propose an online location inference method over social streams that exploits the spatiotemporal correlation, achieving 1) continuous updates with low computational and storage costs, and 2) better inference accuracy than that of existing methods. The experimental results using a Twitter dataset show that our method reduces the inference error to less than 68% of existing methods. The results also show that the proposed method can update inference results in constant time regardless of the amount of accumulated contents.
Recovering matrices from incomplete and corrupted observations is a fundamental problem with many applications in various areas of science and engineering. In theory, under certain conditions, this problem can be solved via a natural convex relaxation. However, all current provable algorithms suffer from superlinear per-iteration cost, which severely limits their applicability to large scale problems. In this paper, we propose a robust principal component analysis (RPCA) plus matrix completion framework to recover low-rank and sparse matrices from missing and grossly corrupted observations. Under the unified framework, we first present a convex robust matrix completion model to replace the linear projection operator constraint by a simple equality one. To further improve the efficiency of our convex model, we also develop a scalable structured factorization model, which can yield an orthogonal dictionary and a robust data representation simultaneously. Then, we develop two alternating direction augmented Lagrangian (ADAL) algorithms to efficiently solve the proposed problems. Finally, we discuss the convergence analysis of our algorithms. Experimental results verified both the efficiency and effectiveness of our methods compared with the state-of-the-art algorithms.
Model selection in kernel methods is the problem of choosing an appropriate hypothesis space for kernel-based learning algorithms to avoid either underfitting or overfitting of the resulting hypothesis. One of main problems faced by model selection is how to control the sample complexity when designing the model selection criterion. In this paper, we take balls of reproducing kernel Hilbert spaces (RKHSs) as candidate hypothesis spaces and propose a novel model selection criterion via minimizing the empirical optimal error in the ball of RKHS and the covering number of the ball. By introducing the covering number to measure the capacity of the ball of RKHS, our criterion could directly control the sample complexity. Specifically, we first prove the relation between expected optimal error and empirical optimal error in the ball of RKHS. Using the relation as the theoretical foundation, we give the definition of our criterion. Then, by estimating the expectation of optimal empirical error and proving an upper bound of the covering number, we represent our criterion as a functional of the kernel matrix. An efficient algorithm is further developed for approximately calculating the functional so that the fast Fourier transform (FFT) can be applied to achieve a quasi-linear computational complexity. We also prove the consistency between the approximate criterion and the accurate one for large enough samples. Finally, we empirically evaluate the performance of our criterion and verify the consistency between the approximate and accurate criterion.
In many real world settings the data to analyze is heterogeneous consisting of (say) images, text and video. An elegant approach when dealing with such data is to project all the data to a common space so standard learning methods can be used. However, typical projection methods make strong assumptions such as the multi-view assumption (datum in one data set are always associated with a single datum in the other view) or that the multiple data sets have an overlapping feature space. Such strong assumptions limit what data such work can be applied to. We present a framework for projecting heterogeneous data from multiple data sets into a common lower dimensional space using a rich range of guidance which does not assume any overlap between the instances or features in different data sets. Our work can specify inter-dataset (between instances in different data sets) guidance and intra-dataset (between instances in the same data set) guidance, both of which can be positively or negatively weighted. We show our work offers substantially more flexibility over related methods such as Canonical Correlation Analysis (CCA) and Locality Preserving Projections (LPP) and illustrate its superior performance for supervised and unsupervised learning problems.
A key characteristic of a successful online market is the large participation of agents (producers and consumers) on both sides of the market. While there has been a long line of impressive work on understanding such markets in terms of revenue maximizing (also called max-sum) objectives, particularly in the context of allocating online impressions to interested advertisers, fairness considerations have surprisingly not received much attention in online allocation algorithms. Allocations that are inherently fair to participating entities, we believe, will contribute significantly to retaining current participants and attracting new ones in the long run, thereby enhancing the performance of online markets. We give two generic online allocation algorithms to address this problem. In the first algorithm, we address the max-min fairness objective which is defined as the minimum ratio among all advertisers of the actual revenue obtained by the allocation to given target revenues. The second algorithm considers a hybrid objective of max-sum with a revenue penalty for each advertiser who misses her revenue target. We consider a penalty that is linear in the difference between the target and the actual revenue. For both these objectives, we give online algorithms that achieve a competitive ratio of $(1-\epsilon)$ for any $\epsilon > 0$ assuming an IID input.
An important problem of matrix completion/approximation based on Matrix Factorization (MF) algorithms is the existence of multiple global optima; this problem is especially serious when the matrix is sparse, which is common in real-world applications such as personalized recommender systems. In this work, we clarify data sparsity by bounding the solution space of MF algorithms. We present the conditions that an MF algorithm should satisfy for reliable completion of the unobservables, and we further propose to augment current MF algorithms with extra constraints constructed by compressive sampling on the unobserved values, which is well-motivated by the theoretical analysis. Model learning and optimal solution searching is conducted in a properly reduced solution space to achieve more accurate and efficient rating prediction performances. We implemented the proposed algorithms in the Map-Reduce framework, and comprehensive experimental results on Yelp and Dianping datasets verified the effectiveness and efficiency of the augmented matrix factorization algorithms.
A key challenge in information and knowledge management is to automatically discover the underlying structures and patterns from large collections of extracted information. This paper presents a novel structure-learning method for a new, scalable probabilistic logic called ProPPR. Our approach builds on the recent success of meta-interpretive learning methods in Inductive Logic Programming (ILP), and we further extends it to a framework that enables robust and efficient structure learning of logic programs on graphs: using an abductive second-order probabilistic logic, we show how first-order theories can be automatically generated via parameter learning. To learn better theories, we then propose an iterated structural gradient approach that incrementally refines the hypothesized space of learned first-order structures. In experiments, we show that the proposed method further improves the results, outperforming competitive baselines such as Markov Logic Networks (MLNs) and FOIL on multiple datasets with various settings; and that the proposed approach can learn structures in a large knowledge base in a tractable fashion.
Belief propagation (BP) is a popular method for performing approximate inference on probabilistic graphical models. However, its message updates are time-consuming, and the schedule for updating messages is crucial to its running time and even convergence. In this paper, we propose a new scheduling scheme that selects a set of messages to update at a time and leverages a novel priority to determine which messages are selected. Additionally, an incremental update approach is introduced to accelerate the computation of the priority. As the size of the model grows, it is desirable to leverage the parallelism of a cluster of machines to reduce the inference time. Therefore, we design a distributed framework, Prom, to facilitate the implementation of BP algorithms. We evaluate the proposed scheduling scheme (supported by Prom) via extensive experiments on a local cluster as well as the Amazon EC2 cloud. The evaluation results show that our scheduling scheme outperforms the state-of-the-art counterpart.
Representing words into vectors in continuous space can form up a potentially powerful basis to generate high-quality textual features for many text mining and natural language processing tasks. Some recent efforts, such as the skip-gram model, have attempted to learn word representations that can capture both syntactic and semantic information among text corpus. However, they still lack the capability of encoding the properties of words and the complex relationships among words very well, since text itself often contains incomplete and ambiguous information. Fortunately, knowledge graphs provide a golden mine for enhancing the quality of learned word representations. In particular, a knowledge graph, usually composed by entities (words, phrases, etc.), relations between entities, and some corresponding meta information, can supply invaluable relational knowledge that encodes the relationship between entities as well as categorical knowledge that encodes the attributes or properties of entities. Hence, in this paper, we introduce a novel framework called RC-NET to leverage both the relational and categorical knowledge to produce word representations of higher quality. Specifically, we build the relational knowledge and the categorical knowledge into two separate regularization functions, and combine both of them with the original objective function of the skip-gram model. By solving this combined optimization problem using back propagation neural networks, we can obtain word representations enhanced by the knowledge graph. Experiments on popular text mining and natural language processing tasks, including analogical reasoning, word similarity, and topic prediction, have all demonstrated that our model can significantly improve the quality of word representations.
Uniqueness and independence are two fundamental properties of data. Their enforcement in knowledge systems can lead to higher quality data, faster data service response time, better data-driven decision making and knowledge discovery from data. The applications can be effectively unlocked by providing efficient solutions to the underlying implication problems of keys and independence atoms. Indeed, for the sole class of keys and the sole class of independence atoms the associated finite and general implication problems coincide and enjoy simple axiomatizations. However, the situation changes drastically when keys and independence atoms are combined. We show that the finite and the general implication problems are already different for keys and unary independence atoms. Furthermore, we establish a finite axiomatization for the general implication problem, and show that the finite implication problem does not enjoy a k-ary axiomatization for any k.
Anti-virus systems developed by different vendors often demonstrate strong discrepancies in how they name malware, which signficantly hinders malware information sharing. While existing work has proposed a plethora of malware naming standards, most anti-virus vendors were reluctant to change their own naming conventions. In this paper we explore a new, more pragmatic alternative. We propose to exploit the correlation between malware naming of different anti-virus systems to create their consensus classification, through which these systems can share malware information without modifying their naming conventions. Specifically we present Latin, a novel classification integration framework leveraging the correspondence between participating anti-virus systems as reflected in heterogeneous information sources at instance-instance, instance-name, and name-name levels. We provide results from extensive experimental studies using real malware datasets and concrete use cases to verify the efficacy of Latin in supporting cross-system malware information sharing.
Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of positive and negative relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this Möobius virtual join operation. The Möbius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning.
Matrix factorization (MF) has become the most popular technique for recommender systems due to its promising performance. Recently, distributed (parallel) MF models have received much attention from researchers of big data community. In this paper, we propose a novel model, called distributed stochastic alternating direction methods of multipliers (DS-ADMM), for large-scale MF problems. DS-ADMM is a distributed stochastic variant of ADMM. In particular, we first devise a new data split strategy to make the distributed MF problem fit for the ADMM framework. Then, a stochastic ADMM scheme is designed for learning. Finally, we implement DS-ADMM based on message passing interface (MPI), which can run on clusters with multiple machines (nodes). Experiments on several data sets from recommendation applications show that our DS-ADMM model can outperform other state-of-the-art distributed MF models in terms of both efficiency and accuracy.
How can we scale-up logistic regression, or L1 regularized loss minimization in general, for Terabyte-scale data which do not fit in the memory? How to design the distributed algorithm efficiently? Although there exist two major algorithms for logistic regression, namely Stochastic Gradient Descent (SGD) and Stochastic Coordinate Descent (SCD), they face limitations in distributed environments. Distributed SGD enables data parallelism (i.e., different machines access different part of the input data), but it does not allow feature parallelism (i.e., different machines compute different subsets of the output), and thus the communication cost is high. On the other hand, Distributed SCD allows feature parallelism, but it does not allow data parallelism and thus is not suitable to work in distributed environments.
In this paper we propose DF-DSCD (Data/Feature Distributed Stochastic Coordinate Descent), an efficient distributed algorithm for logistic regression, or L1 regularized loss minimization in general. DF-DSCD allows both data and feature parallelism. The benefits of DF-DSCD are (a) full utilization of the capabilities provided by modern distributing computing platforms like MapReduce to analyze web-scale data, and (b) independence of each machine in updating parameters with little communication cost. We prove the convergence of DF-DSCD both theoretically, and also show empirical evidence that it is scalable, handles very high-dimensional data with up to 29 millions of features, and converges 2.2 times faster than competitors.
Most cross-domain sentiment classification techniques consider a domain as a whole set of opinionated instances for training. However, many online shopping websites organize their data in terms of taxonomy. With multiple domains (or, nodes) organized in a tree-structured representation, we propose a general ensemble algorithm which takes into account: 1) the model application, 2) the model weight and 3) the strategies for selecting the most related models with respect to a target node. The traditional sentiment classification technique SVM and the transfer learning algorithm Spectral Features Alignment (SFA) were applied as our model applications. In addition, the model weight takes the tree information and the similarity between domains into account. Finally, two strategies, cosine function and taxonomy-based regression model (TBRM) are proposed to select the most related models with respect to a target node. Experimental results showed both (cosine function and TBRM) proposed strategies outperform two baselines on an Amazon dataset. Three tasks of the proposed methods surpass the gold standard generated by the in-domain classifiers trained on the labeled data from the target nodes. Good results from the three tasks enable this algorithm to shed some new light on eliminating the major difficulties in transfer learning research: the distribution gap.
Artifact-centric business process models have gained increasing momentum recently due to their ability to combine structural (i.e., data related) with dynamical (i.e., process related) aspects. In particular, two main lines of research have been pursued so far: one tailored to business artifact modeling languages and methodologies, the other focused on the foundations for their formal verification. In this paper, we merge these two lines of research, by showing how recent theoretical decidability results for verification can be fruitfully transferred to a concrete UML-based modeling methodology. In particular, we identify additional steps in the methodology that, in significant cases, guarantee the possibility of verifying the resulting models against rich first-order temporal properties. Notably, our results can be seamlessly transferred to different languages for the specification of the artifact lifecycles.
One of the biggest challenges of commercial search engines is how to handle tail queries, or queries that occur very infrequently. Frequent queries, also known as head queries, are easier to handle largely because their intents are evidenced by abundant click-through data (query logs). Tail queries have little historical data to rely on, which makes them difficult to be learned by ranking algorithms. In this paper, we leverage knowledge from two resources to fill the gap. The first is a general knowledgebase containing different granularities of concepts automatically harnessed from the Web. The second is the click-through data for head queries. From the click-through data, we obtain an understanding of queries that trigger clicks. Then, we show that by extracting single or multi-word expressions from both head and tail queries and mapping them to a common concept space defined by the knowledgebase, we are able to transfer the click information of the head queries to the tail queries. To validate our approach, we conduct large scale experiments on two real data sets. One is a mixture of head and tail queries, and the other contains pure tail queries. We show that our approach effectively improves tail query search relevance.
While it has long been believed in psychology that weather somehow influences human's mood, the debates have been going on for decades about how they are correlated. In this paper, we try to study this long-lasting topic by harnessing a new source of data compared from traditional psychological researches: Twitter. We analyze 2 years' twitter data collected by twitter API which amounts to 10% of all postings and try to reveal the correlations between multiple dimensional structure of human mood with meteorological effects. Some of our findings confirm existing hypotheses, while others contradict them. We are hopeful that our approach, along with the new data source, can shed on the long-going debates on weather-mood correlation.
Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are laden with opinions, their "dirty" nature (as natural language) has discouraged researchers from applying LDA-based opinion model for product review mining. Tweets are often informal, unstructured and lacking labeled data such as categories and ratings, making it challenging for product opinion mining. In this paper, we propose an LDA-based opinion model named Twitter Opinion Topic Model (TOTM) for opinion mining and sentiment analysis. TOTM leverages hashtags, mentions, emoticons and strong sentiment words that are present in tweets in its discovery process. It improves opinion prediction by modeling the target-opinion interaction directly, thus discovering target specific opinion words, neglected in existing approaches. Moreover, we propose a new formulation of incorporating sentiment prior information into a topic model, by utilizing an existing public sentiment lexicon. This is novel in that it learns and updates with the data. We conduct experiments on 9 million tweets on electronic products, and demonstrate the improved performance of TOTM in both quantitative evaluations and qualitative analysis. We show that aspect-based opinion analysis on massive volume of tweets provides useful opinions on products.
Modeling physical activity propagation, such as the activity level and intensity, is the key to prevent the cascades of obesity, and help spread wellness and healthy behavior in a social network. However, there has been lacking of scientific and quantitative study to elucidate how social communication may deliver physical activity interventions. In this work we introduce a Community-level Physical Activity Propagation (CPP) model to analyze physical activity propagation and social influence at different granularities (i.e., individual level and community level). CPP is a novel model which is inspired by the well-known Independent Cascade and Community-level Social Influence models. Given a social network, we utilize a hierarchical approach to detect a set of communities and their reciprocal influence strength of physical activities. CPP provides a powerful tool to discover, summarize, and investigate influence patterns of physical activities in a health social network. The detail experimental evaluation shows not only the effectiveness of our approach but also the correlation of the detected communities with various health outcome measures (i.e., both existing ones and our novel measure, named Wellness score, which is a combination of lifestyle parameters, biometrics, and biomarkers). Our promising results potentially pave a way for knowledge discovery in health social networks.
Recent years have witnessed the rapid prevalence of online serials, which play an important role in our daily entertainment. A critical demand along this line is to predict the popularity of online serials, which can enable a wide range of applications, such as online advertising, and serial recommendation. However, compared with traditional online media such as user-generated content (UGC), online serials have unique characteristics of sequence dependence, release date dependence as well as unsynchronized update regularity. Therefore, the popularity prediction for online serials is a nontrivial task and still under-addressed. To this end, in this paper we present a comprehensive study for predicting the popularity of online serials with autoregressive models. Specifically, we first introduce a straightforward yet effective Naive Autoregressive (NAR) model based on the correlations of serial episodes. Furthermore, we develop a sophisticated model, namely Transfer Autoregressive (TAR) model, to capture the dynamic behaviors of audiences, which can achieve better prediction performance than the NAR model. Indeed, the two models can reveal the popularity generation from different perspectives. In addition, as a derivative of the TAR model, we also design a novel metric, namely favor, for evaluating the quality of online serials. Finally, extensive experiments on two real-world data sets clearly show that both models are effective and outperform baselines in terms of the popularity prediction for online serials. And the new metric performs better than other metrics for quality estimation.
Within the last few years the importance of collaborative ontology-engineering projects, especially in the biomedical domain, has drastically increased. This recent trend is a direct consequence of the growing complexity of these structured data representations, which no single individual is able to handle anymore. For example, the World Health Organization is currently actively developing the next revision of the International Classification of Diseases (ICD), using an OWL-based core for data representation and Web 2.0 technologies to augment collaboration. This new revision of ICD consists of roughly 50,000 diseases and causes of death and is used in many countries around the world to encode patient history, to compile health-related statistics and spendings. Hence, it is crucial for practitioners to better understand and steer the underlying processes of how users collaboratively edit an ontology. Particularly, generating predictive models is a pressing issue as these models may be leveraged for generating recommendations in collaborative ontology-engineering projects and to determine the implications of potential actions on the ontology and community.
In this paper we approach this task by (i) exploring whether regularities and common patterns in user action sequences, derived from change-logs of five different collaborative ontology-engineering projects from the biomedical domain, exist. Based on this information we (ii) model the data using Markov chains of varying order, which are then used to (iii) predict user actions in the sequences at hand.
A Care Pathway is a knowledge-centric process to guide clinicians to provide evidence-based care to patients with specific conditions. One existing problem for care pathways is that they often fail to reflect the best clinical practice as a result of not being adequately updated. A better understanding of the gaps between a care pathway and real practice requires aligning patient records with the pathway. Patient records are unlabeled in practice making it difficult to align them with a care pathway which is inherently complex due to its representation as a hierarchical and declarative process model (HDPM). This paper proposes to solve this problem by developing a Hierarchical Markov Random Field (HMRF) method so that a set of patient records can best fit a given care pathway. We validate the effectiveness of the method with experiments on both synthesized data and real clinical data.
The fast growth of technologies has driven the advancement of our society. It is often necessary to quickly grasp the linkage between different technologies in order to better understand the technical trend. The availability of huge volumes of granted patent documents provides a reasonable basis for analyzing the relationships between technologies. In this paper, we propose a unified framework, named PatentDom, to identify important patents related to key techniques from a large number of patent documents. The framework integrates different types of patent information, including patent content, citations of patents, and temporal relations, and provides a concise yet comprehensive technology summary. The identified key patents enable a variety of patent-related analytical applications, e.g., outlining the technology evolution of a particular domain, tracing a given technique to prior technologies, and mining the technical connection of two given patent documents. Empirical analysis and extensive case studies on a collection of US patent documents demonstrate the efficacy of our proposed framework.
Effective patent valuation is important for patent holders. Forward patent citations, widely used in assessing patent value, have been considered as reflecting knowledge flows, just like paper citations. However, patent citations also carry legal implication, which is important for patent valuation. We argue that patent citations can either be technological citations that indicate knowledge transfer or be legal citations that delimit the legal scope of citing patents. In this paper, we first develop citation-network based methods to infer patent quality measures at either the legal or technological dimension. Then we propose a probabilistic mixture approach to incorporate both the legal and technological dimensions in patent citations, and an iterative learning process that integrates a temporal decay function on legal citations, a probabilistic citation network based algorithm and a prediction model for patent valuation. We learn all the parameters together and use them for patent valuation. We demonstrate the effectiveness of our approach by using patent maintenance status as an indicator of patent value and discuss the insights we learned from this study.
Improvements in information technology have made it easier for industry to communicate with their customers, raising hopes for a scheme that can estimate when customers will want to make purchases. Although a number of models have been developed to estimate the time-varying purchase probability, they are based on very restrictive assumptions such as preceding purchase-event dependence and discrete-time effect of covariates. Our preliminary analysis of real-world data finds that these assumptions are invalid: self-exciting behavior, as well as marketing stimulus and preceding purchase dependence, should be examined as possible factors influencing purchase probability. In this paper, by employing the novel idea of hierarchical time rescaling, we propose a tractable but highly flexible model that can meld various types of intrinsic history dependency and marketing stimuli in a continuous-time setting. By employing the proposed model, which incorporates the three factors, we analyze actual data, and show that our model has the ability to precisely track the temporal dynamics of purchase probability at the level of individuals. It enables us to take effective marketing actions such as advertising and recommendations on timely and individual bases, leading to the construction of a profitable relationship with each customer.
The performance of joins in parallel database management systems is critical for data intensive operations such as querying. Since data skew is common in many applications, poorly engineered join operations result in load imbalance and performance bottlenecks. State-of-the-art methods designed to handle this problem offer significant improvements over naive implementations. However, performance could be further improved by removing the dependency on global skew knowledge and broadcasting. In this paper, we propose PRPQ (partial redistribution & partial query), an efficient and robust join algorithm for processing large-scale joins over distributed systems. We present the detailed implementation and a quantitative evaluation of our method. The experimental results demonstrate that the proposed PRPQ algorithm is indeed robust and scalable under a wide range of skew conditions. Specifically, compared to the state-of-art PRPD method, we achieve 16% - 167% performance improvement and 24% - 54% less network communication under different join workloads.
The last decade has witnessed the prevalence of sensor and GPS technologies that produce a high volume of trajectory data representing the motion history of moving objects. However some characteristics of trajectories such as variable lengths and asynchronous sampling rates make it difficult to fit into traditional database systems that are disk-based and tuple-oriented. Motivated by the success of column store and recent development of in-memory databases, we try to explore the potential opportunities of boosting the performance of trajectory data processing by designing a novel trajectory storage within main memory. In contrast to most existing trajectory indexing methods that keep consecutive samples of the same trajectory in the same disk page, we partition the database into frames in which the positions of all moving objects at the same time instant are stored together and aligned in main memory. We found this column-wise storage to be surprisingly well suited for in-memory computing since most frames can be stored in highly compressed form, which is pivotal for increasing the memory throughput and reducing CPU-cache miss. The independence between frames also makes them natural working units when parallelizing data processing on a multi-core environment. Lastly we run a variety of common trajectory queries on both real and synthetic datasets in order to demonstrate advantages and study the limitations of our proposed storage.
Distribution channel is a system that partners move products from manufacturer to end users. To increase sales, it is quite common for manufacturers to adjust the product prices to partners according to the product volume per deal. However, the price adjustment is like a double-edged sword. It also spurs some partners to form a cheating alliance, where a cheating seller applies for a falsified big deal with a low price and then re-sells the products to the cheating buyers. Since these cheating behaviors are harmful to a healthy ecosystem of distribution channel, we need the automatic method to guide the tedious audit process.
Thus, in this study we propose the method to rank all partners by the degree of cheating, either as seller or buyer. It is mainly motivated by the observation that the sales volumes of a cheating seller and its corresponding cheating buyer are often negatively correlated with each other. Specifically, the proposed framework consists of three parts: 1) an asymmetric correlation measure which is needed to distinguish cheating sellers from cheating buyers; 2) a systematic approach which is needed to remove false positive pairs, i.e., two partners whose sale correlation is purely coincident; 3) finally, a probabilistic model to measure the degree of cheating behaviors for each partner.
Based on the 4-year channel data of an IT company we empirically show how the proposed method outperforms the other baseline ones. It is worth mentioning that with the proposed unsupervised method more than half of the partners in the resultant top-30 ranking list are true cheating partners.
Load curve data in power systems refers to users' electrical energy consumption data periodically collected with meters. It has become one of the most important assets for modern power systems. Many operational decisions are made based on the information discovered in the data. Load curve data, however, usually suffers from corruptions caused by various factors, such as data transmission errors or malfunctioning meters. To solve the problem, tremendous research efforts have been made on load curve data cleansing. Most existing approaches apply outlier detection methods from the supply side (i.e., electricity service providers), which may only have aggregated load data. In this paper, we propose to seek aid from the demand side (i.e., electricity service users). With the help of readily available knowledge on consumers' appliances, we present an appliance-driven approach to load curve data cleansing. This approach utilizes data generation rules and a Sequential Local Optimization Algorithm (SLOA) to solve the Corrupted Data Identification Problem (CDIP). We evaluate the performance of SLOA with real-world trace data and synthetic data. The results indicate that, comparing to existing load data cleansing methods, such as B-spline smoothing, our approach has an overall better performance and can effectively identify consecutive corrupted data. Experimental results also show that our method is robust in various tests.
The availability of large volumes of interaction data and scalable data mining techniques have made possible to study the online behaviour for millions of Web users. Part of the efforts have focused on understanding how users interact and engage with web content. However, the measurement of within-content engagement remains a difficult and unsolved task. This is because of the lack of standardised, well-validated methods for measuring engagement, especially in an online context. To address this gap, we perform a controlled user study where we observe how users respond to online news in the presence or lack of interest. We collect mouse tracking data, which are known to correlate with visual attention, and examine how cursor behaviour can inform user engagement measures. The proposed method does not use any pre-determined concepts to characterise the cursor patterns. We, rather, follow an unsupervised approach and use a large set of features engineered from our data to extract the cursor patterns. Our findings support the connection between gaze and cursor behaviour but also, and more importantly, reveal other dependencies, such as the correlation between cursor activity and experienced affect. Finally, we demonstrate the value of our method by predicting the outcome of online news reading experiences.
Informational needs behind queries, that people issue to search engines, are inherently sensitive to external factors such as breaking news, new models of devices, or seasonal changes as 'black Friday'. Mostly these changes happen suddenly and it is natural to suppose that they may cause a shift in user satisfaction with presented old search results and push users to reformulate their queries. For instance, if users issued the query 'CIKM conference' in 2013 they were satisfied with results referring to the page cikm2013.org and this page gets a majority of clicks. However, the confernce site has been changed and the same query issued in 2014 should be linked to the different page cikm2014.fudan.edu.cn. If the link to the fresh page is not among the retrieved results then users will reformulate the query to find desired information.
In this paper, we examine how to detect changes in user satisfaction if some events affect user information goals but search results remained the same. We formulate a problem using concept drift detection techniques. The proposed method works in an unsupervised manner, we do not rely on any labelling. We report results of a large scale evaluation over real user interactions, that are collected by a commercial search engine within six months. The final datasets consist of more than sixty millions log entries. The results of our experiments demonstrate that by using our method we can accurately detect changes in user behavior. The detected drifts can be used to enhance query auto-completion, user satisfaction metrics, and recency ranking.
Due to the advent of social media and web 2.0, we are faced with a deluge of information; recently, research efforts have focused on filtering out noisy, irrelevant information items from social media streams and in particular have attempted to automatically identify and summarise events. However, due to the heterogeneous nature of such social media streams, these efforts have not reached fruition. In this paper, we investigate how images can be used as a source for summarising events. Existing approaches have considered only textual summaries which are often poorly written, in a different language and slow to digest. Alternatively, images are "worth 1,000 words" and are able to quickly and easily convey an idea or scene. Since images in social media can also be noisy, irrelevant and repetitive, we propose new techniques for their automatic selection, ranking and presentation. We evaluate our approach on a recently created social media event data set containing 365k tweets and 50 events, for which we extend by collecting 625k related images. By conducting two crowdsourced evaluations, we firstly show how our approach overcomes the problems of automatically collecting relevant and diverse images from noisy microblog data, before highlighting the advantages of multimedia summarisation over text based approaches.
Crowd based online work is leveraged in a variety of applications such as semantic annotation of images, translation of texts in foreign languages, and labeling of training data for machine learning models. However, annotating large amounts of data through crowdsourcing can be slow and costly. In order to improve both cost and time efficiency of crowdsourcing we examine alternative reward mechanisms compared to the "Pay-per-HIT" scheme commonly used in platforms such as Amazon Mechanical Turk. To this end, we explore a wide range of monetary reward schemes that are inspired by the success of competitions, lotteries, and games of luck. Our large-scale experimental evaluation with an overall budget of more than 1,000 USD and with 2,700 hours of work spent by crowd workers demonstrates that our alternative reward mechanisms are well accepted by online workers and lead to substantial performance boosts.
This paper addresses the problem of joint modeling of multimedia components in different media forms. We consider the information retrieval task across both text and image documents, which includes retrieving relevant images that closely match the description in a text query and retrieving text documents that best explain the content of an image query. A greedy dictionary construction approach is introduced for learning an isomorphic feature space, to which cross-modality data can be adapted while data smoothness is guaranteed. The proposed objective function consists of two reconstruction error terms for both modalities and a Maximum Mean Discrepancy (MMD) term that measures the cross-modality discrepancy. Optimization of the reconstruction terms and the MMD term yields a compact and modality-adaptive dictionary pair. We formulate the joint combinatorial optimization problem by maximizing variance reduction over a candidate signal set while constraining the dictionary size and coefficients' sparsity. By exploiting the submodularity and the monotonicity property of the proposed objective function, the optimization problem can be solved by a highly efficient greedy algorithm, and is guaranteed to be at least a (e - 1)=/e≈0.632- approximation to the optimum. The proposed method achieves state-of-the-art performance on the Wikipedia dataset.
This paper describes a probabilistic latent variable model that is designed to detect human values such as justice or freedom that a writer has sought to reflect or appeal to when participating in a public debate. The proposed model treats the words in a sentence as having been chosen based on specific values; values reflected by each sentence are then estimated by aggregating values associated with each word. The model can determine the human values for the word in light of the influence of the previous word. This design choice was motivated by syntactic structures such as noun+noun, adjective+noun, and verb+adjective. The classifier based on the model was evaluated on a test collection containing 102 manually annotated documents focusing on one contentious political issue - Net neutrality, achieving the highest reported classification effectiveness for this task. We also compared our proposed classifier with human second annotator. As a result, the proposed classifier effectiveness is statistically comparable with human annotators.
When consuming content, users typically encounter entities that they are not familiar with. A common scenario is when users want to find information about entities directly within the content they are consuming. For example, when reading the book "Adventures of Huckleberry Finn", a user may lose track of the character Mary Jane and want to find some paragraph in the book that gives relevant information about her. The way this is achieved today is by invoking the ubiquitous Find function ("Ctrl-F"). However, this only returns exact-matching results without any relevance ranking, leading to a suboptimal user experience.
How can we go beyond the Ctrl-F function? To tackle this problem, we present algorithms for semantic matching and relevance ranking that enable users to effectively search and understand entities that have been defined in the content that they are consuming, which we call locally-defined entities. We first analyze the limitations of standard information retrieval models when applied to searching locally-defined entities, and then we propose a novel semantic entity retrieval model that addresses these limitations. We also present a ranking model that leverages multiple novel signals to model the relevance of a passage. A thorough experimental evaluation of the approach in the real-word application of searching characters within e-books shows that it outperforms the baselines by 60%+ in terms of NDCG.
With the popularity of social media platforms such as Facebook and Twitter, the amount of useful data in these sources is rapidly increasing, making them promising places for information acquisition. This research aims at the customized organization of a social media corpus using focused topic hierarchy. It organizes the contents into different structures to meet with users' different information needs (e.g., "iPhone 5 problem" or "iPhone 5 camera"). To this end, we introduce a novel function to measure the likelihood of a topic hierarchy, by which the users' information need can be incorporated into the process of topic hierarchy construction. Using the structure information within the generated topic hierarchy, we then develop a probability based model to identify the representative contents for topics to assist users in document retrieval on the hierarchy. Experimental results on real world data illustrate the effectiveness of our method and its superiority over state-of-the-art methods for both information organization and retrieval tasks.
In large networks, the connected triples are useful for solving various tasks including link prediction, community detection, and spam filtering. Existing works in this direction concern mostly with the exact or approximate counting of connected triples that are closed (aka, triangles). Evidently, the task of triple sampling has not been explored in depth, although sampling is a more fundamental task than counting, and the former is useful for solving various other tasks, including counting. In recent years, some works on triple sampling have been proposed that are based on direct sampling, solely for the purpose of triangle count approximation. They sample only from a uniform distribution, and are not effective for sampling triples from an arbitrary user-defined distribution. In this work we present two indirect triple sampling methods that are based on Markov Chain Monte Carlo (MCMC) sampling strategy. Both of the above methods are highly efficient compared to a direct sampling-based method, specifically for the task of sampling from a non-uniform probability distribution. Another significant advantage of the proposed methods is that they can sample triples from networks that have restricted access, on which a direct sampling based method is simply not applicable.
Subgraph search is very useful in many real-world applications. However, users may be overwhelmed by the masses of matches. In this paper, we propose subgraph skyline search problem, denoted as S3, to support more complicated analysis over graph data. Specifically, given a large graph G and a query graph q, we want to find all the subgraphs g in G, such that g is graph isomorphic to q and not dominated by any other subgraphs. In order to improve the efficiency, we devise a hybrid feature encoding incorporating both structural and numeric features. Moreover, we present some optimizations based on partitioning strategy. We also propose a skylayer index to facilitate the dynamic subgraph skyline computation. Extensive experiments over real dataset confirm the effectiveness and efficiency of our algorithm.
Within-Network Classification (WNC) techniques are designed for applications where objects to be classified and those with known labels are interlinked. For WNC tasks like web page classification, the homophily principle succeeds by assuming that linked objects, represented as adjacent vertices in a network, are likely to have the same labels. However, in other tasks like chemical structure completion, recent works suggest that the label of a vertex should be related to the local structure it resides in, rather than equated with those of its neighbors. These works also propose structure-aware vertex features or methods to deal with such an issue.
In this paper, we demonstrate that frequent neighborhood patterns, originally studied in the pattern mining literature, serve as a strong class of structure-aware features and provide satisfactory effectiveness in WNC. In addition, we identify the problem that the neighborhood pattern miner indiscriminately mines patterns of all radiuses, while heuristics and experiments both indicate that patterns with a large radius take much time only to bring negligible effectiveness gains. We develop a specially designed algorithm capable of working under radius threshold constraints, by which patterns with a large radius are not mined at all. Experiments suggest that our algorithm helps with the trade-off between efficiency and effectiveness in WNC tasks.
We improve the state-of-the-art method for the compression of web and other similar graphs by introducing an elegant technique which further exploits the clustering properties observed in these graphs. The analysis and experimental evaluation of our method shows that it outperforms the currently best method of Boldi et al. by achieving a better compression ratio and retrieval time. Our method exhibits vast improvements on certain families of graphs, such as social networks, by taking advantage of their compressibility characteristics, and ensures that the compression ratio will not worsen for any graph, since it easily falls back to the state-of-the-art method.
The data-cleaning-as-a-service (DCaS) paradigm enables users to outsource their data and data cleaning needs to computationally powerful third-party service providers. It raises several security issues. One of the issues is how the client can protect the private information in the outsourced data. In this paper, we focus on data deduplication as the main data cleaning task, and design two efficient privacy-preserving data-deduplication methods for the DCaS paradigm. We analyze the robustness of our two methods against the attacks that exploit the auxiliary frequency distribution and the knowledge of the encoding algorithms. Our empirical study demonstrates the efficiency and effectiveness of our privacy preserving approaches.
We propose a new local data perturbation method called Aroma. We first show that Aroma is sound in its privacy protection. For that, we devise a realistic privacy game, called the exposure test. We prove that the αβ algorithm, a previously proposed method that is most closely related to Aroma, performs poorly under the exposure test and fails to provide sufficient privacy in practice. Moreover, any data protection method that satisfies ε-differential privacy will succeed in the test. By proving that Aroma satisfies ε-differential privacy, we show that Aroma offers strong privacy protection. We then demonstrate the utility of Aroma by proving that its estimator has significantly smaller errors than the previous state-of-the-art algorithms such as αβ, AM, and FRAPP. We carry out a systematic empirical study using real-world data to evaluate Aroma, which shows its clear advantages over previous methods.
We study provisioning and job reconfiguration techniques for adapting to execution environment changes when processing data streams on cluster-based deployments. By monitoring the performance of an executing job, we identify computation and communication bottlenecks. In such cases we reconfigure the job by reallocating its tasks to minimize the communication cost. Our work targets data-intensive applications where the inter-node transfer latency is significant. We aim to minimize the transfer latency while keeping the nodes below some computational load threshold. We propose a scalable centralized scheme that employs fast allocation heuristics. Our techniques are based on a general group-based job representation that is commonly found in many distributed data stream processing frameworks. Using this representation we devise linear-time task allocation algorithms that improve existing quadratic-time solutions in practical cases. We have implemented and evaluated our proposals using both synthetic and real-world scenarios. Our results show that our algorithms: (a) exhibit significant allocation throughput while producing near-optimal allocations, and (b) significantly improve existing task-level approaches.
Truth discovery is a long-standing problem for assessing the validity of information from various data sources that may provide different and conflicting information. With the increasing prominence of data streams arising in a wide range of applications such as weather forecast and stock price prediction, effective techniques for truth discovery in data streams are demanded. However, existing work mainly focuses on truth discovery in the context of static databases, which is not applicable in applications involving streaming data. This motivates us to develop new techniques to tackle the problem of truth discovery in data streams.
In this paper, we propose a probabilistic model that transforms the problem of truth discovery over data streams into a probabilistic inference problem. We first design a streaming algorithm that infers the truth as well as source quality in real time. Then, we develop a one-pass algorithm, in which the inference of source quality is proved to be convergent and the accuracy is further improved. We conducted extensive experiments on real datasets which verify both the efficiency and accuracy of our methods for truth discovery in data streams.
Query auto-completion (QAC) is a prominent feature of modern search engines. It is aimed at saving user's time and enhancing the search experience. Current QAC models mostly rank matching QAC candidates according to their past popularity, i.e., frequency. However, query popularity changes over time and may vary drastically across users. Hence, rankings of QAC candidates should be adjusted accordingly. In previous work time-sensitive QAC models and user-specific QAC models have been developed separately. Both types of QAC model lead to important improvements over models that are neither time-sensitive nor personalized. We propose a hybrid QAC model that considers both of these aspects: time-sensitivity and personalization.
Using search logs, we return the top N QAC candidates by predicted popularity based on their recent trend and cyclic behavior. We use auto-correlation to detect query periodicity by long-term time-series analysis, and anticipate the query popularity trend based on observations within an optimal time window returned by a regression model. We rerank the returned top N candidates by integrating their similarities with a user's preceding queries (both in the current session and in previous sessions by the same user) on a character level to produce a final QAC list. Our experimental results on two real-world datasets show that our hybrid QAC model outperforms state-of-the-art time-sensitive QAC baseline, achieving total improvements of between 3% and 7% in terms of MRR.
Query latency is an important performance measure of any search engines because it directly affects search users' satisfaction. The key challenge is how to efficiently retrieve top-K ranked results for a query. Current search engines process queries in either the conjunctive or disjunctive modes. However, there is still a large performance gap between these two modes since the conjunctive mode is more efficient with lower search accuracy while the disjunctive mode is more effective but requires more time to process the queries.
In this paper, we propose a novel query evaluation method that aims to achieve a better balance between the efficiency and effectiveness of top-K query processing. The basic idea is to prioritize candidate documents based on the number of the matched query terms in the documents as well as the importance of the matched terms. We propose a simple priority function and then discuss how to implement the idea based on a decision tree. Experimental results over both Web and Twitter collections show that the proposed method is able to narrow the performance gap with the conjunctive and disjunctive modes when K is larger or the length of a query is longer. In particular, compared with one of the fastest existing query processing methods, the propose method can achieve a speedup of 2 with marginal loss in the retrieval effectiveness on the Web collection.
Top-K query processing is one of the most important problems in large-scale Information Retrieval systems. Since query processing time varies for different queries, an accurate run-time performance prediction is critical for online query scheduling and load balancing, which could eventually reduce the query waiting time and improve the throughput. Previous studies estimated the query processing time based on the combination of term-level features. Unfortunately, these features were often selected arbitrarily, and the linear combination of these features might not be able to accurately capture the complexity in the query processing.
In this paper, we propose a novel analytical performance modeling framework for top-K query processing. Our goal is to provide a systematic way of identifying important features for the efficiency prediction and then develop a general framework for estimating the query processing time. Specifically, we divide the query processing into three stages, identify useful features and discuss how to use them to model the query processing time for each stage. After that, we propose to fit the model using a step-by-step strategy and compute the approximated feature values based on easily obtained statistics. Experimental results on TREC collections show that the developed performance model can predict the query processing time more accurately than the state of the art efficiency predictor, in particular for the dynamic pruning methods.
Compression is widely exploited in retrieval systems, such as search engines and text databases, to lower both retrieval costs and system latency. In particular, compression of repositories can reduce storage requirements and fetch times, while improving caching. One of the most effective techniques is relative Lempel-Ziv, RLZ, in which a RAM-resident dictionary encodes the collection. With RLZ, a specified document can be decoded independently and extremely fast, while maintaining a high compression ratio. For terabyte-scale collections, this dictionary need only be a fraction of a per cent of the original data size. However, as originally described, RLZ uses a static dictionary, against which encoding of new data may be inefficient. An obvious alternative is to generate a new dictionary solely from the new data. However, this approach may not be scalable because the combined RAM-resident dictionary will grow in proportion to the collection.
In this paper, we describe effective techniques for extending the original dictionary to manage new data. With these techniques, a new auxiliary dictionary, relatively limited in size, is created by interrogating the original dictionary with the new data. Then, to compress this new data, we combine the auxiliary dictionary with some parts of the original dictionary (the latter in fact encoded as pointers into that original dictionary) to form a second dictionary. Our results show that excellent compression is available with only small auxiliary dictionaries, so that RLZ can feasibly transmit and store large, growing collections.
In the medical domain, information retrieval systems can be used for identifying cohorts (i.e. patients) required for clinical studies. However, a challenge faced by such search systems is to retrieve the cohorts whose medical histories cover the inclusion criteria specified in a query, which are often complex and include multiple medical conditions. For example, a query may aim to find patients with both 'lupus nephritis' and 'thrombotic thrombocytopenic purpura'. In a typical best-match retrieval setting, any patient exhibiting all of the inclusion criteria should naturally be ranked higher than a patient that only exhibits a subset, or none, of the criteria. In this work, we extend the two main existing models for ranking patients to take into account the coverage of the inclusion criteria by adapting techniques from recent research into coverage-based diversification. We propose a novel approach for modelling the coverage of the query inclusion criteria within the records of a particular patient, and thereby rank highly those patients whose medical records are likely to cover all of the specified criteria. In particular, our proposed approach estimates the relevance of a patient, based on the mixture of the probability that the patient is retrieved by a patient ranking model for a given query, and the likelihood that the patient's records cover the query criteria. The latter is measured using the relevance towards each of the criteria stated in the query, represented in the form of sub-queries. We thoroughly evaluate our proposed approach using the test collection provided by the TREC 2011 and 2012 Medical Records track. Our results show significant improvements over existing strong baselines.
With the rapid development of Web 2.0 and the Internet of things, predicting relationships in heterogeneous networks has evolved as a heated research topic. Traditionally, people analyze existing relationships in heterogeneous networks that relate in a particular way to a target relationship of interest to predict the emergence of the target relationship. However most existing methods are incapable of systematically identifying relevant relationships useful for the prediction task, especially those relationships involving multiple objects of heterogeneous types, which may not rest on a simple path in the concerned heterogeneous network. Another problem with the current practice is that the existing solutions often ignore the dynamic evolution of the network structure after the introduction of newly emerged relationships. To overcome the first limitation, we propose a new algorithm that can systematically and comprehensively detect relevant relationships useful for the prediction of an arbitrarily given target relationship through a disciplined graph searching process. To address the second limitation, the new algorithm leverages a series of temporally-sensitive features for the relationship occurrence prediction via a supervised learning approach. To explore the effectiveness of the new algorithm, we apply the prototype implementation of the algorithm on the DBLP bibliographic network to predict the author citation relationships and compare the algorithm performance with that of a state-of-the-art peer method and a series of baseline methods. The comparison shows consistently higher prediction accuracy under a range of prediction scenarios.
Prior art search or recommending citations for a patent application is a challenging task. Many approaches have been proposed and shown to be useful for prior art search. However, most of these methods do not consider the network structure for integrating and diffusion of different kinds of information present among tied patents in the citation network. In this paper, we propose a method based on a time-aware random walk on a weighted network of patent citations, the weights of which are characterized by contextual similarity relations between two nodes on the network. The goal of the random walker is to find influential documents in the citation network of a query patent, which can serve as candidates for drawing query terms and bigrams for query refinement. The experimental results on CLEF-IP datasets (CLEF-IP 2010 and CLEF-IP 2011) show the effectiveness of encoding contextual similarities (common classification codes, common inventor, and common applicant) between nodes in the citation network. Our proposed approach can achieve significantly better results in terms of recall and Mean Average Precision rates compared to strong baselines of prior art search.
Ownership and use of multiple devices such as desktop computers, smartphones, and tablets is increasing rapidly. Search is popular and people often perform search tasks that span device boundaries. Understanding how these devices are used and how people transition between them during information seeking is essential in developing search support for a multi-device world. In this paper, we study search across devices and propose models to predict aspects of cross-device search transitions. We characterize multi-device search across four device types, including aspects of search behavior on each device (e.g., topics of interest) and characteristics of device transitions. Building on the characterization, we learn models to predict various aspects of cross-device search, including the next device used for search. This enables many applications. For example, accurately forecasting the device used for the next query lets search engines proactively retrieve device-appropriate content (e.g., short documents for smartphones), while knowledge of the current device combined with device-specific topical interest models may assist in better query-sense disambiguation. %to help the searcher once they transition to the target device.
Open information extraction approaches have led to the creation of large knowledge bases from the Web. The problem with such methods is that their entities and relations are not canonicalized, leading to redundant and ambiguous facts. For example, they may store {Barack Obama, was born, Honolulu and {Obama, place of birth, Honolulu}. In this paper, we present an approach based on machine learning methods that can canonicalize such Open IE triples, by clustering synonymous names and phrases.
We also provide a detailed discussion about the different signals, features and design choices that influence the quality of synonym resolution for noun phrases in Open IE KBs, thus shedding light on the middle ground between "open" and "closed" information extraction systems.
Knowledge bases capture millions of entities such as people, companies or movies. However, their knowledge of named events like sports finals, political scandals, or natural disasters is fairly limited, as these are continuously emerging entities. This paper presents a method for extracting named events from news articles, reconciling them into canonicalized representation, and organizing them into fine-grained semantic classes to populate a knowledge base. Our method captures similarity measures among news articles in a multi-view attributed graph, considering textual contents, entity occurrences, and temporal ordering. For distilling canonicalized events from this raw data, we present a novel graph coarsening algorithm based on the information-theoretic principle of minimum description length. The quality of our method is experimentally demonstrated by extracting, organizing, and evaluating 25,000 events from a corpus of 300,000 heterogeneous news articles.
In traditional multi-instance learning (MIL), instances are typically represented by using a single feature view. As MIL becoming popular in domain specific learning tasks, aggregating multiple feature views to represent multi-instance bags has recently shown promising results, mainly because multiple views provide extra information for MIL tasks. Nevertheless, multiple views also increase the risk of involving redundant views and irrelevant features for learning. In this paper, we formulate a new cross-view feature selection problem that aims to identify the most representative features across all feature views for MIL. To achieve the goal, we design a new optimization problem by integrating both multi-view representation and multi-instance bag constraints. The solution to the objective function will ensure that the identified top-m features are the most informative ones across all feature views. Experiments on two real-world applications demonstrate the performance of the cross-view feature selection for content-based image retrieval and social media content recommendation.
This paper addresses the problem of automatically learning to classify texts by exploiting information derived from meta-level features (i.e., features derived from the original bag-of-words representation). We propose new meta-level features derived from the class distribution, the entropy and the within-class cohesion observed in the k nearest neighbors of a given test document x, as well as from the distribution of distances of x to these neighbors. The set of proposed features is capable of transforming the original feature space into a new one, potentially smaller and more informed. Experiments performed with several standard datasets demonstrate that the effectiveness of the proposed meta-level features is not only much superior than the traditional bag-of-word representation but also superior to other state-of-art meta-level features previously proposed in the literature. Moreover, the proposed meta-features can be computed about three times faster than the existing meta-level ones, making our proposal much more scalable. We also demonstrate that the combination of our meta features and the original set of features produce significant improvements when compared to each feature set used in isolation.
Given an noisy or sampled snapshot of a network, like a contact-network or the blogosphere, in which an infection (or meme/virus) has been spreading for some time, what are the best nodes to immunize (vaccinate)? Manipulating graphs via node removal by itself is an important problem in multiple different domains like epidemiology, public health and social media. Moreover, it is important to account for uncertainty as typically surveillance data on who is infected is limited or the data is sampled. Efficient algorithms for such a problem can help public-health experts take more informed decisions.
In this paper, we study the problem of designing vaccine-distribution algorithms under an uncertain environment, with known information consisting of confirmed cases as well as a probability distribution of unknown cases. We formulate the NP-Hard Uncertain Data-Aware Vaccination problem, and design multiple efficient algorithms for factorizable distributions (including a novel sub-quadratic algorithm) which naturally take into account the uncertainty, while providing robust solutions. Finally, we show the effectiveness and scalability of our methods via extensive experiments on real datasets, including large epidemiological and social networks.
Community detection has been one of the fundamental problems in network analysis. Results from community detection (for example, grouping of products by latent category) can also serve as information nuggets to other business applications, such as product recommendation or taxonomy building. Because several real networks are naturally directed, e.g. World Wide Web, some recent studies proposed algorithms for detecting various types of communities in a directed network. However, few of them considered that nodes play two different roles, source and terminal, in a directed network.
In this paper, we adopt a novel concept of communities, directional community, and propose a new algorithm based on Markov Clustering to detect directional communities. We then first compare our algorithm, Dual R-MCL, on synthetic networks with two recent algorithms also designed for detecting directional communities. We show that Dual R-MCL can detect directional communities with significantly higher accuracy and 3x to 25x faster than the two other algorithms. Second, we compare a set of directed network community detection algorithms on a one-day Twitter interaction network and demonstrate that Dual R-MCL can generate clusters more correctly matched to hashtags. Finally, we exhibit our algorithm's capacity to identify directional communities from product description networks, where nodes are otherwise not directly connected.
Results indicate that directional communities exist in real networks, and Dual R-MCL can effectively detect these directional communities. We believe it will enable the discovery of interesting components in a diverse types of networks where existing methods cannot, and it manifests strong application values.
We describe an optimal randomized MapReduce algorithm for the problem of triangle enumeration that requires O(E3/2/(M√m) rounds, where m denotes the expected memory size of a reducer and M the total available space. This generalizes the well-known vertex partitioning approach proposed in (Suri and Vassilvitskii, 2011) to multiple rounds, significantly increasing the size of the graphs that can be handled on a given system. We also give new theoretical (high probability) bounds on the work needed in each reducer, addressing the "curse of the last reducer". Indeed, our work is the first to give guarantees on the maximum load of each reducer for an arbitrary input graph. Our experimental evaluation shows the scalability of our approach, that it is competitive with existing methods improving the performance by a factor up to 2X, and that it can significantly increase the size of datasets that can be processed.
Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. Reducing latency and improving the cost to serve is quite important, but optimizing this service call graph is particularly challenging due to the volume of data and the graph's non-uniform and dynamic nature.
In this paper, we present a framework to detect hotspots in a service-oriented architecture. The framework is general, in that it can handle arbitrary objective functions. We show that finding the optimal set of hotspots for a metric, such as latency, is NP-complete and propose a greedy algorithm by relaxing some constraints. We use a pattern mining algorithm to rank hotspots based on the impact and consistency. Experiments on real world service call graphs from LinkedIn, the largest online professional social network, show that our algorithm consistently outperforms baseline methods.
The skyline operator returns records in a dataset that provide optimal trade-offs of multiple dimensions. It is an expensive operator whose query performance can greatly benefit from materialization. However, a skyline can be executed over any subspace of dimensions, and the materialization of all subspace skylines, called the skycube, dramatically multiplies data size. Existing methods for skycube compression sacrifice too much query performance; so, we present a novel hashing- and bitstring-based compressed data structure that supports orders of magnitude faster query performance.
We formulate a problem that arises in unstructured enterprise information management, and has high commercial impact: retrieve knowledge-rich documents in a large textual collection of technical documents. We call such documents principal documents.
We exploit the properties of large sparse text collections in order to address this problem. It is known that the centroids of document clusters on such collections form so-called "concept vectors" for the collection. However, typically these centroids do not correspond to documents in the collection. How then should they be used for retrieving documents? An immediate approach is to collect documents that are closest to the centroid, which we call CTC. We also propose an algorithm called PrinDocs. The key insight behind PrinDocs is the following: replace distance functions by coverage. In other words, instead of finding the "closest" documents to a concept vector, find those that "cover" the concept vector. PrinDocs employs greedy weighted set covering and uses the concept decomposition offered by centroids, but does not use the cosine distance on documents.
We compare CTC and PrinDocs for retrieving knowledge-rich documents in enterprise unstructured technical collections. We demonstrate that PrinDocs comprehensively outperforms CTC. Our work suggests that coverage based approaches might be preferable to distance based ones for similar retrieval tasks.
Semantic technologies aim to facilitate machine-to-machine communication and are attracting more and more interest from both academia and industry, especially in the emerging Internet of Things (IoT). In this paper, we consider large-scale information sharing scenarios among mobile objects in IoT by leveraging semantic techniques. We propose to broadcast Linked Data on-air using RDF format to allow simultaneous access to the information and to achieve better scalability. We introduce a novel air indexing method to reduce the information access latency and energy consumption. To build air indexes, we firstly map RDF triples in the Linked Data into points in a 3D space and build B+-trees based on 3D Hilbert curve mappings for all of the 3D points. We then convert these trees into linear sequences so that they can be broadcast over a wireless channel. A novel search algorithm is also designed to efficiently evaluate queries against the air indexes. Experiments show that our indexing method outperforms the air indexing method based on traditional 3D R-trees.
The Internet of Things (IoT) envisions smart objects collecting and sharing data at a global scale via the Internet. One challenging issue is how to disseminate data to relevant data consumers efficiently. In this paper, we leverage semantic technologies which can facilitate machine-to-machine communications, such as Linked Data, to build an efficient information dissemination system for semantic IoT. The system integrates Linked Data streams generated from various data collectors and disseminates matched data to relevant data consumers based on Basic Graph Patterns (BGPs) registered in the system by those consumers. To efficiently match BGPs against Linked Data streams, we introduce two types of matching, namely semantic matching and pattern matching, by considering whether the matching process supports semantic relatedness computation. Two new data structures, namely MVR-tree and TP-automata, are introduced to suit these types of matching respectively. Experiments show that an MVR-tree designed for semantic matching can achieve a twofold increase in throughput compared with the naive R-tree based method. TP-automata, as the first approach designed for pattern matching over Linked Data streams, also provides two to three orders of magnitude improvements on throughput compared with semantic matching approaches.
Trajectory data does not only show the location of users over a period of time, but also reveals a high level of detail regarding their lifestyle, preferences and habits. Hence, it is highly susceptible to privacy concerns. Trajectory privacy has become a key research topic when sharing/exchanging trajectory datasets. Most existing studies focus on protecting trajectory data through obfuscating, anonymising or perturbing the data with the aim to maximize user privacy. Although such approaches appear plausible, our work suggests that precise trajectory information can be inferred even from other sources of data. We consider the case in which a location service provider only shares POI query results of users with third parties instead of exchanging users' raw trajectory data to preserve privacy. We develop an inference algorithm and show that it can effectively approximate original trajectories using solely the POI query results.
Real-time entity resolution (ER) is the process of matching a query record in sub-second time with records in a database that represent the same real-world entity. To facilitate real-time matching on large databases, appropriate indexing approaches are required to reduce the search space. Most available indexing techniques are based on batch algorithms that work only with static databases and are not suitable for real-time ER. In this paper, we propose a forest-based sorted neighborhood index that uses multiple index trees with different sorting keys to facilitate real-time ER for read-most databases. Our technique aims to reduce the effect of errors and variations in attribute values on matching quality by building several distinct index trees. We conduct an experimental evaluation on two large real-world data sets, and multiple synthetic data sets with various data corruption rates. The results show that our approach is scalable to large databases and that using multiple trees gives a noticeable improvement on matching quality with only a small increase in query time. Our approach also achieves over one order of magnitude faster indexing and querying times, as well as higher matching accuracy, compared to another recently proposed real-time ER technique.
Research on cognitive science indicates that humans often use different criteria for route selection. An alternative type of spatial proximity search on road networks recently has been proposed to find the easiest-to-reach neighboring object with the smallest navigation complexity. This paper presents an evaluation to compare the effectiveness of easiest-to-reach neighbor query against a classic nearest neighbor query in a real-world setting. Our user study demonstrates usability of the new spatial query type and suggests people may not always care about travel distance most. To provide flexibility to accommodate different requirements, we also show how to achieve tradeoff between navigation complexity and travel distance for advanced navigational assistance.
Privacy-preserving record linkage (PPRL) is the process of identifying records that correspond to the same real-world entities across several databases without revealing any sensitive information about these entities. Various techniques have been developed to tackle the problem of PPRL, with the majority of them only considering linking two databases. However, in many real-world applications data from more than two sources need to be linked. In this paper we consider the problem of linking data from three or more sources in an efficient and secure way. We propose a protocol that combines the use of Bloom filters, secure summation, and Dice coefficient similarity calculation with the aim to identify all records held by the different data sources that have a similarity above a certain threshold. Our protocol is secure in that no party learns any sensitive information about the other parties' data, but all parties learn which of their records have a high similarity with records held by the other parties. We evaluate our protocol on a large dataset showing the scalability, linkage quality, and privacy of our protocol.
RFID-based localization and tracking has some promising potentials. By combining localization with its identification capability, existing applications can be enhanced and new applications can be developed. In this paper, we investigate a tag-free indoor localizing and tracking problem (e.g., people tracking) without requiring subjects to carry any tags or devices in a pure passive environment. We formulate localization as a classification task. In particular, we model the received signal strength indicator (RSSI) of passive tags using multivariate Gaussian Mixture Model (GMM), and use the Expectation Maximization (EM) to learn the maximum likelihood estimates of the model parameters. Several other learning-based probabilistic approaches are also explored in the localization problem. To track a moving subject, we propose GMM based Hidden Markov Model (HMM) and k Nearest Neighbor (kNN) based HMM approaches. We conduct extensive experiments in a testbed formed by passive RFID tags, and the experimental results demonstrate the effectiveness and accuracy of our approach.
We propose a root stemmer for the Modern Standard Arabic (MSA) language in an attempt to enhance the performance of Arabic Information Retrieval (AIR). The new Simple Arabic Stemmer (SAS) is based on the Quran morphology, since the Quran was a key source for the derivation of Arabic morphological rules. The stemmer is developed by decomposing all of the Quran words and studying their internal morphological structure including the roots, the patterns, and the affixes employed in the generation process. We were able to construct a relatively small lexicon capable of finding the root for most of the MSA vocabulary. Using the TREC corpus and queries, we test our approach against two well-known root stemmers, Khoja and Sebawai. The results show that SAS gives an improvement in terms of precision.
Phrase queries are a key functionality of modern search engines. Beyond that, they increasingly serve as an important building block for applications such as entity-oriented search, text analytics, and plagiarism detection. Processing phrase queries is costly, though, since positional information has to be kept in the index and all words, including stopwords, need to be considered.
We consider an augmented inverted index that indexes selected variable-length multi-word sequences in addition to single words. We study how arbitrary phrase queries can be processed efficiently on such an augmented inverted index. We show that the underlying optimization problem is NP-hard in the general case and describe an exact exponential algorithm and an approximation algorithm to its solution. Experiments on ClueWeb09 and The New York Times with different real-world query workloads examine the practical performance of our methods.
The field of Cross-Language Information Retrieval (CLIR) addresses the problem of finding documents in some language that are relevant to a question posed in a different language. Retrieving answers to questions written using formal vocabulary from collections of informal documents, as with many types of social media, is a largely unexplored subfield of CLIR. Because formal and informal content are often intermingled, CLIR systems that excel at finding formal content may tend to select formal over informal content. To measure this effect, a test collection annotated for both relevance and informality is needed. This paper describes the development of a small test collection for this task, with questions posed in formal English and the documents consisting of intermixed formal and informal Arabic. Experiments with this collection show that dialect classification can help to recognize informal content, thus improving precision. At the same time, the results indicate that neither dialect-tuned morphological analysis nor a lightweight CLIR approach that minimizes propagation of translation errors yet yield a reliable improvement in recall for informal content when compared to a straightforward document translation architecture.
The information retrieval (IR) community strives to make evaluation more centered on real users and their needs. The living labs evaluation paradigm, i.e., observing users in their natural task environments, offers great promise in this regard. Yet, progress in an academic setting has been limited. This paper presents the first living labs for the IR community benchmarking campaign initiative, taking as test two use-cases: local domain search on a university website and product search on an e-commerce site. There are many challenges associated with this setting, including incorporating results from experimental search systems into live production systems, and obtaining sufficiently many impressions from relatively low traffic sites. We propose that head queries can be used to generate result lists offline, which are then interleaved with results of the production system for live evaluation. An API is developed to orchestrate the communication between commercial parties and benchmark participants. This campaign acts to progress the living labs for IR evaluation methodology, and offers important insight into the role of living labs in this space.
Advances in neural network language models have demonstrated that these models can effectively learn representations of words meaning. In this paper, we explore a variation of neural language models that can learn on concepts taken from structured ontologies and extracted from free-text, rather than directly from terms in free-text.
This model is employed for the task of measuring semantic similarity between medical concepts, a task that is central to a number of techniques in medical informatics and information retrieval. The model is built with two medical corpora (journal abstracts and patient records) and empirically validated on two ground-truth datasets of human-judged concept pairs assessed by medical professionals. Empirically, our approach correlates closely with expert human assessors (≈0.9) and outperforms a number of state-of-the-art benchmarks for medical semantic similarity.
The demonstrated superiority of this model for providing an effective semantic similarity measure is promising in that this may translate into effectiveness gains for techniques in medical information retrieval and medical informatics (e.g., query expansion and literature-based discovery).
Can we effectively influence aggregate user behavior in a cluster based retrieval (CBR) system by tuning its parameters? This question combines parameter tuning with models of user behavior. To address this question, we propose an approach based on three components: user model, criterion metric, and sensitivity analysis. We then demonstrate this approach on one of the most frequently asked questions to designers and operators of CBR systems in enterprises: namely, "suggest a value for k." Both the users and the system desire a value that is likely to maximize user satisfaction, and sway them towards a cluster based examination of their retrieved result set (rather than prefer the original ranked retrieved list). Based on observed user behavior in CBR systems, we posit a two-stage user model. We isolate its core element, which is a "query coverage metric." We then perform an empirical sensitivity analysis of this metric. Our analysis reveals that this metric is, surprisingly, robust to changes in k (i.e., insensitive to k) in a wide range around its de-facto value. We conclude that in cases where our model approximates user behavior, the system cannot substantially increase the chances of the user resorting to CBR by tuning k. This has practical implications on the design and day-to-day operation of CBR systems. Similar analyses can be carried out for other parameters.
Suggesting venues to a user in a given geographic context is an emerging task that is currently attracting a lot of attention. Existing studies in the literature consist of approaches that rank candidate venues based on different features of the venues and the user, which either focus on modeling the preferences of the user or the quality of the venue. However, while providing insightful results and conclusions, none of these studies have explored the relative effectiveness of these different features. In this paper, we explore a variety of user-dependent and venue-dependent features and apply state-of-the-art learning to rank approaches to the problem of contextual suggestion in order to find what makes a venue relevant for a given context. Using the test collection of the TREC 2013 Contextual Suggestion track, we perform a number of experiments to evaluate our approach. Our results suggest that a learning to rank technique can significantly outperform a Language Modelling baseline that models the positive and negative preferences of the user. Moreover, despite the fact that the contextual suggestion task is a personalisation task (i.e. providing the user with personalised suggestions of venues), we surprisingly find that user-dependent features are less effective than venue-dependent features for estimating the relevance of a suggestion.
Modern relevance models consider a wide range of criteria in order to identify those documents that are expected to satisfy the user's information need. With growing dimensionality of the underlying relevance spaces the need for sophisticated score combination and estimation schemes arises. In this paper, we investigate the use of copulas, a model family from the domain of robust statistics, for the formal estimation of the probability of relevance in high-dimensional spaces. Our experiments are based on the MSLR-WEB10K and WEB30K datasets, two annotated, publicly available samples of hundreds of thousands of real Web search impressions, and suggest that copulas can significantly outperform linear combination models for high-dimensional problems. Our models achieved a performance on par with that of state-of-the-art machine learning approaches.
We investigate how time intervals of interest to a query can be identified automatically based on pseudo-relevant documents, taking into account both their publication dates and temporal expressions from their contents. Our approach is based on a generative model and is able to determine time intervals at different temporal granularities (e.g., day, month, or year). We evaluate our approach on twenty years' worth of newspaper articles from The New York Times using two novel testbeds consisting of temporally unambiguous and temporally ambiguous queries, respectively.
Over the past years, Twitter has earned a growing reputation as a hub for communication, and events advertisement and tracking. However, several recent research studies have shown that Twitter users (and microblogging platforms' users in general) are increasingly posting microblogs containing questions seeking answers from their readers. To help those users answer or route their questions, the problem of question identification in tweets has been studied over English tweets; up to our knowledge, no study has attempted it over Arabic (not to mention dialectal Arabic) tweets.
In this paper, we tackle the problem of identifying answer-seeking questions in different dialects over a large collection of Arabic tweets. Our approach is 2-stage. We first used a rule-based filter to extract tweets with interrogative questions. We then leverage a binary classifier (trained using a carefully-developed set of features) to detect tweets with answer-seeking questions. In evaluating the classifier, we used a set of randomly-sampled dialectal Arabic tweets that were labeled using crowdsourcing. Our approach achieved a relatively-good performance as a first study of that problem on the Arabic domain, exhibiting 64% recall with 80% precision in identifying tweets with answer-seeking questions.
Past work showed that significant inconsistencies between retrieval results occurred on different test collections, even when one of the test collections contained only a subset of the documents in the other. However, the experimental methodologies in that paper made it hard to determine the cause of the inconsistencies. Using a novel methodology that eliminates the problems with uneven distribution of relevant documents, we confirm that observing a statistically significant improvement between two IR systems can be strongly influenced by the choice of documents in the test collection. We investigate two possible causes of this problem of test collections. Our results show that collection size and document source have a strong influence in the way that a test collection will rank one retrieval system relative to another. This is of particular interest when constructing test collections, as we show that using different subsets of a collection produces differing evaluation results.
Current movie title retrieval models, such as IMDB, mainly focus on utilizing structured or semi-structured data. However, user queries for searching a movie title are often based on the movie plot, rather than its metadata. As a solution to this problem, our movie title retrieval model proposes a new way of elaborately utilizing associative relations between multiple key terms that exist in the movie plot, in order to improve search performance when users enter more than one keyword. More specifically, the proposed model exploits associative networks of key terms, called knowledge structures, derived from the movie plots. Using the search query terms entered by Amazon Mechanical Turk users as the golden standard, experiments were conducted to compare the proposed retrieval model with the extant state-of-the-art retrieval models. The experiment results show that the proposed retrieval model consistently outperforms the baseline models. The findings have practical implications for semantic search of movie titles particularly, and of online entertainment contents in general.
Due to the ability to preserve semantic similarity in Hamming space, supervised hashing has been extensively studied recently. Most existing approaches encourage two dissimilar samples to have maximum Hamming distance. This may lead to an unexpected consequence that two unnecessarily similar samples would have the same code if they are both dissimilar with another sample. Besides, in existing methods, all labeled pairs are treated with equal importance without considering the semantic gap, which is not conducive to thoroughly leverage the supervised information. We present a general framework for supervised hashing to address the above two limitations. We do not toughly require a dissimilar pair to have maximum Hamming distance. Instead, a soft constraint which can be viewed as a regularization to avoid over-fitting is utilized. Moreover, we impose different weights to different training pairs, and these weights can be automatically adjusted in the learning process. Experiments on two benchmarks show that the proposed method can easily outperform other state-of-the-art methods.
Multi-label classification is supervised learning, where an instance may be assigned with multiple categories (labels) simultaneously. Recently, a method called Probabilistic Classifier Chain (PCC) was proposed with numerous appealing properties, such as conceptual simplicity, flexibility, and theoretical justification. Nevertheless, PCC suffers from high inference complexity. To address this problem, we propose a novel inference method with gibbs sampling. An acceleration scheme is proposed to accelerate this method further. Our proposed method is based on our claim that PCC is a special case of Bayesian network. This claim may inspire more inference algorithms for PCC. Experiments with real-world data sets show effectiveness of our proposed method.
Ranking plays an important role in information retrieval system. In recent years, a kind of research named 'learning to rank' becomes more and more popular, which applies machine learning technology to solve ranking problems. Lots of ranking models belonged to learning to rank have been proposed, such as Regression, RankNet, and ListNet. Inspired by this, we proposed a novel learning to rank algorithm named GPQ in this paper, in which genetic programming was employed to directly optimize Q-measure evaluation metric. Experimental results on OHSUMED benchmark dataset indicated that our method GPQ could be competitive with Ranking SVM, SVMMAP and ListNet, and improve the ranking accuracies.
Pseudo-relevance feedback (PRF) has proven to be an effective strategy for improving retrieval accuracy. In this paper, we revisit a PRF method based on statistical language models, namely the divergence minimization model (DMM). DMM not only has apparently sound theoretical foundation, but also has been shown to satisfy most of the retrieval constraints. However, it turns out to perform surprisingly poorly in many previous experiments. We investigate the cause, and reveal that DMM inappropriately tackles the entropy of the feedback model, which generates highly skewed feedback model. To address this problem, we propose a maximum-entropy divergence minimization model (MEDMM) by introducing an entropy term to regularize DMM. Our experiments on various TREC collections demonstrate that MEDMM not only works much better than DMM, but also outperforms several other state of the art PRF methods, especially on web collections. Moreover, unlike existing PRF models that have to be combined with the original query to perform well, MEDMM can work effectively even without being combined with the original query.
Today's web search systems present users with heterogeneous information coming from sources of different types, also known as verticals. Evaluating such systems is an important but complex task, which is still far from being solved. In this paper we examine the hypothesis that the use of models that capture user search behavior on heterogeneous result pages helps to improve the quality of offline metrics. We propose two vertical-aware metrics based on user click models for federated search and evaluate them using query logs of the Yandex search engine. We show that depending on the type of vertical, the proposed metrics have higher correlation with online user behavior than other state-of-the-art techniques.
Accurate estimation of query aspect weights is an important issue to improve the performance of explicit search result diversification algorithms. For the first time in the literature, we propose using post-retrieval query performance predictors (QPPs) to estimate, for each aspect, the retrieval effectiveness on the candidate document set, and leverage these estimations to set the aspect weights. In addition to utilizing well-known QPPs from the literature, we also introduce three new QPPs that are based on score distributions and hence, can be employed for online query processing in real-life search engines. Our exhaustive experiments reveal that using QPPs for aspect weighting improves almost all state-of-the-art diversification algorithms in comparison to using a uniform weight estimator. Furthermore, the proposed QPPs are comparable or superior to the existing predictors in the context of aspect weighting.
A major challenge in Cross-Language Information Retrieval (CLIR) is the adoption of translation knowledge in retrieval models, as it affects the term weighting which is known to highly impact the retrieval performance. In this paper, we present an analytical study of using translation knowledge in CLIR. In particular, by adopting axiomatic analysis framework, we formulate the impacts of translation knowledge on document ranking as constraints that any cross-language retrieval model should satisfy. We then consider the state-of-the-art CLIR methods and check whether they satisfy these constraints. Finally, we show through empirical evaluation that violating one of the constraints harms the retrieval performance significantly which calls for further investigation.
We report a preliminary study of mobile Web behaviour in a large indoor retail space. By analysing a Web log collected over a 1 year period at an inner city shopping mall in Sydney, Australia, we found that 1) around 60% of registered Wi-Fi users actively browse the Internet, and the rest 40% do not, with around 10% of these users using Web search engines. Around 70% of this Web activity in the investigated mall come from frequent visitors; 2) the content that indoor users search for is different from the content they consume while browsing; 3) the popularity of future indoor search queries can be predicted with a simple theoretical model based on past queries treated as a weighted directed graph. The work described in this paper underpins applications such as the prediction of users' information needs, retail recommendation systems, and improving the mobile Web search experience.
Given a current news article, we wish to create a succinct query reflecting its content, which may be used to follow the news story over a period of days, or even weeks. In part, the need for succinct queries is occasioned by limitations of commercial social media search engines, which can perform poorly with longer queries. We start by applying established key phrase extraction methods to the article, creating an initial set of candidate query terms. We then generate a series of probe queries, each a subset of these candidate terms, which we apply to search current social media streams. By analyzing the results of these probes, we rank and trim the candidate set to create a succinct query. We present an experimental study of this method based on a collection of news articles taken from March-April 2014, with the resulting succinct queries used to re-query social media one week later.
Canonical correlation analysis (CCA) has been extensively employed in various real-world applications of multi-label annotation. However, two major challenges are raised by the classical CCA. First, CCA frequently fails to remove noisy and irrelevant features. Second, CCA cannot effectively capture correlations between multiple labels, which are especially beneficial for multi-label learning. In this paper, we propose a novel framework that integrates joint sparsity and low-rank shared subspace into the least-squares formulation of CCA. Under this framework, multiple label interactions can be uncovered by the shared structure of the input features and a few highly discriminative features can be decided via structured sparsity inducing norm. Owing to the inclusion of the non-smooth row sparsity, a new efficient iterative algorithm is derived with proved convergence. The empirical studies on several popular web image and movie data collections consistently deliver the effectiveness of our new formulation in comparison with competing algorithms.
Query Performance prediction aims to evaluate the effectiveness of the results returned by a search system in response to a query without any relevance information. In this paper, we propose a method that considers both magnitude and variance of scores of the ranked list of results to measure the performance of a query. Using six different TREC test sets, we compare our predictor with three of the state-of-the-art techniques. The experimental results show that our method is very competitive. Pairwise comparisons with each of the three other methods show that our predictor performs better in more data sets.
Incorporating semantic information into document representation is effective and potentially significant to improve retrieval performance. Recently, log-bilinear language model (LBL), as a form of neural language model, has been proved to be an effective way to learn semantic word representations, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LBL to improve as-hoc retrieval. We propose a log-bilinear document language model (LB-DM) within the language modeling framework. The key idea is to learn semantically oriented representations for words, and estimate document language models based on these representations. Noise-constrictive estimation is employed to perform fast training on large document collections. Experiment results on standard TREC collections show that LB-DM performs better than translation language model and LDA-based retrieval model.
Similarity search, or finding approximate nearest neighbors, is an important technique in various large scale information retrieval applications such as document retrieval. Many recent research demonstrate that hashing methods can achieve promising results for large scale similarity search due to its computational and memory efficiency. However, most existing hashing methods ignore the hidden semantic structure of documents but only use the keyword features (e.g., tf-idf) in hashing codes learning. This paper proposes a novel sparse semantic hashing (SpSH) approach that explores the hidden semantic representation of documents in learning their corresponding hashing codes. In particular, a unified framework is designed for ensuring the hidden semantic structure among the documents by a sparse coding model, while at the same time preserving the document similarity via graph Laplacian. An iterative coordinate descent procedure is then proposed for solving the optimization problem. Extensive experiments on two large scale datasets demonstrate the superior performance of the proposed research over several state-of-the-art hashing methods.
Owing to the portable and excellent phone camera, people now prefer to take photos and upload them by mobile phone. Content based image retrieval is effective for users to obtain relevant information about a photo. Taking the limited bandwidth and instability into account, we propose an effective scalable mobile image retrieval approach in this paper. The proposed mobile image retrieval algorithm first determines the relevant photos according to visual similarity in mobile end, then mines salient visual words by exploring saliency from multiple relevant images, and finally we determine the contribution order of salient visual words for scalable retrieval. In server terminal, spatial verification is performed to re-rank the results. Compared to the existing approaches of mobile image retrieval, our approach transmits less data and reduces the computational cost of spatial verification. Most importantly, when the bandwidth is limited, we can transmit a part of features according their contributions to retrieval. Experimental results show the effectiveness of the proposed approach.
Lack of high quality relevance labels is a common challenge in the early stage of search engine development. In media search, due to the high recruiting and training cost, the labeling process is usually conducted by a small number of human judges. Consequently, the generated labels are often limited and biased. On the contrary, the click data that is extracted from a large population of real users is massive and less biased. However, the click data also contains considerable noise. Therefore, more and more researchers have begun to focus on combining those two resources to generate a better ground-truth approximation. In this paper, we present a novel method of generating the relevance labels for media search. The method is based on a generative model that considers human judgment, position, and click status as observations generated from a hidden relevance with multinomial prior. The model considers the position bias with a requirement that the click status depends on both the hidden relevance and the position. We infer the model parameters by using a Gibbs sampling procedure with hyper-parameter optimization. From experiments on the Xbox's data, the newly inferred relevance labels significantly increase the data volume for ranker training and have demonstrated superior performance compared to using the limited human labels only, the click-through-rates only, and the heuristic combination of the two.
Recent research has shown that the improvement of mean retrieval effectiveness (e.g., MAP) may sacrifice the retrieval stability across queries, implying a tradeoff between effectiveness and stability. The evaluation of both effectiveness and stability are often based on a baseline model, which could be weak or biased. In addition, the effectiveness-stability tradeoff has not been systematically or quantitatively evaluated over TREC participated systems. The above two problems, to some extent, limit our awareness of such tradeoff and its impact on developing future IR models. In this paper, motivated by a recently proposed bias-variance based evaluation, we adopt a strong and unbiased "baseline", which is a virtual target model constructed by the best performance (for each query) among all the participated systems in a retrieval task. We also propose generalized bias-variance metrics, based on which a systematic and quantitative evaluation of the effectiveness-stability tradeoff is carried out over the participated systems in the TREC Ad-hoc Track (1993-1999) and Web Track (2010-2012). We observe a clear effectiveness-stability tradeoff, with a trend of becoming more obvious in more recent years. This implies that when we pursue more effective IR systems over years, the stability has become problematic and could have been largely overlooked.
Selecting and aggregating different types of content from multiple vertical search engines is becoming popular in web search. The user vertical intent, the verticals the user expects to be relevant for a particular information need, might not correspond to the vertical collection relevance, the verticals containing the most relevant content. In this work we propose different approaches to define the set of relevant verticals based on document judgments. We correlate the collection-based relevant verticals obtained from these approaches to the real user vertical intent, and show that they can be aligned relatively well. The set of relevant verticals defined by those approaches could therefore serve as an approximate but reliable ground-truth for evaluating vertical selection, avoiding the need for collecting explicit user vertical intent, and vice versa.
In a multi-document settings, graph-based extractive summarization approaches build a similarity graph out of sentences in each cluster of documents then use graph centrality approaches to measure the importance of sentences. The similarity is computed between each pair of sentences. However, it is not clear if such approach captures high-order relations among more than two sentences or can differentiate between descriptive sentences of the cluster in comparison with other clusters. In this paper, we propose to model sentences as hyperedges and words as vertices using a hypergraph and combine it with topic signatures to differentiate between descriptive sentences and non-descriptive sentences. To rank sentences, we propose a new random walk over hyperedges that will prefer descriptive sentences of the cluster when measuring their centrality scores. Our approach outperform a number of baseline in the DUC 2001 dataset using the ROUGE metric.
Determining cascade size and the factors affecting cascade size are two fundamental research problems in social network analysis. The commonly considered independent cascade model, when applied to social networks such as Digg, produces a phase-transition phenomenon where the cascade is either very small or very large. This phenomenon can be explained based on the concept of Giant Propagation Component (GPC). The GPC is defined as a maximally connected component, such that, by applying the independent cascade model, once any node of the component is infected, most of the remaining nodes in the component will eventually become infected with a high probability. While GPC exists in social networks, the phase-transition phenomenon, is not observed in the actual cascade size distribution when the information propagation is due to actions such as ``like'' or ``dig''.
This paper hypothesizes that the cascade process, i.e., the likeliness of a node being infected changes over time and depends on how far away the node is from the seed. Furthermore, each node will not be exactly independently considered for infection from each of its infected friends, because the chance of information propagation through ``like'' or ``dig'' does not necessarily increase when there are more friends like/dig the information. To this end, we develop and simulate a new non-independent infection cascade process. The experiment results show that the proposed cascade process generates power-law like cascade size distribution without phase transition, which resembles much better the real-world cascade distribution observed in the Digg social network.
Even today, most interfaces to document clustering present clusters to users and applications as a "bag of descriptive terms"? a technique that was proposed two decades ago. Consequently, users and applications are not able to obtain sophisticated structural knowledge that is indeed lying hidden in document clusters. In particular, the structural information about interaction of concepts that the cluster speaks about is completely missing from the bag of terms presentation. As the needs of unstructured information management increase, this shortcoming is coming into sharper focus. To address this shortcoming, we propose a rich representation of document clusters that surfaces the concept interactions within a cluster into the representation. We show that these interactions give a "shape" to the cluster. This "shape" is conveniently captured using a directed, colored, vertex-weighted graph, called the shape graph or, simply, the shape of the cluster. We show that shapes convey important structural information about document clusters, and can be computed efficiently.
For online reviews, sentiment explanations refer to the sentences that may suggest detailed reasons of sentiment, which are very important for applications in review mining like opinion summarization. In this paper, we address the problem of ranking sentiment explanations by formulating the process as two subproblems: sentence informativeness ranking and structural sentiment analysis. Tractable inference in joint prediction is performed through dual decomposition. Preliminary experiments on publicly available data demonstrate that our approach obtains promising performance.
It has been shown, both theoretically and empirically, that reasoning about large and expressive ontologies is computationally hard. Moreover, due to the different reasoning algorithms and optimisation techniques employed, each reasoner may be efficient for ontologies with different characteristics. Based on recently-developed prediction models for various reasoners for reasoning performance, we present our work in developing a meta-reasoner that automatically selects from a number of state-of-the-art OWL reasoners to achieve optimal efficiency. Our preliminary evaluation shows that the meta-reasoner significantly and consistently outperforms 6 state-of-the-art reasoners and it achieves a performance close to the hypothetical gold standard reasoner.
Semantic spaces, a useful learning framework for lexical resources, are typically treated as black boxes and applied using geometric and linear algebraic processing tools. We have found that topological methods are useful for exploring the makeup of a semantic space.
Sentiment word identification (SWI) is of high relevance to sentiment analysis technologies and applications. Currently most SWI methods heavily rely on sentiment seed words that have limited sentiment information. Even though there emerge non-seed approaches based on sentiment labels of documents, but in which the context information has not been fully considered. In this paper, based on matrix factorization with co-occurrence neighbor regularization which is derived from context, we propose a novel non-seed model called CONR for SWI. Instead of seed words, CONR exploits two important factors: sentiment matching and sentiment consistency for sentiment word identification. Experimental results on four publicly available datasets show that CONR can outperform the state of-the-art methods.
In this research we propose to derive new features based on data samples' local information with the aim of improving the performance of general supervised learning algorithms. The creation of new features is inspired by the measure of average precision which is known to be a robust measure that is insensitive to the number of retrieved items in information retrieval. We use the idea of average precision to weight the neighbours of an instance and show that this weighting strategy is insensitive to the number of neighbours in the locality. Information captured in the new features allows a general classifier to learn additional useful peripheral knowledge that are helpful in building effective classification models. We comprehensively evaluate our method on real datasets and the results show substantial improvements in the performance of classifiers including SVM, Bayesian networks, random forest, and C4.5.
With the booming of online social networks, social trust has been used to cluster users in recommender systems. It has been proven to improve the recommendation accuracy when trust communities are integrated into memory-based collaborative filtering algorithms. However, existing trust community mining methods only consider the trust relationships, regardless of the distrust information. In this paper, considering both the trust and distrust relationships, a SVD signs based community mining method is proposed to process the trust relationship matrix in order to discover the trust communities. A modified trust metric which considers a given user's expertise level in a community is presented to obtain the indirect trust values between users. Then some missing ratings of the given user are complemented by the weighted average preference of his/her trusted neighbors selected in the same community during the random walk procedures. Finally, the prediction for a given item is generated by the conventional collaborative filtering. The comparison experiments on Epinions data set demonstrate that our approach outperforms other state-of-the-art methods in terms of RMSE and RC.
Kernel SVM is prohibitively expensive when dealing with large nonlinear data. While ensembles of linear classifiers have been proposed to address this inefficiency, these methods are time-consuming or lack robustness. We propose an efficient classifier for nonlinear data using a new iterative learning algorithm, which partitions the data into clusters, and then trains a linear SVM for each cluster. These two steps are combined into a graphical model, with the parameters estimated efficiently using the EM algorithm. During training, clustered multi-task learning is used to capture the relatedness among the multiple linear SVMs and avoid overfitting. Experimental results on benchmark datasets show that our method outperforms state-of-the-art methods. During prediction, it also obtains comparable classification performance to kernel SVM, with much higher efficiency.
With the prevalence of the Web and social media, users increasingly express their preferences online. In learning these preferences, recommender systems need to balance the trade-off between exploitation, by providing users with more of the "same", and exploration, by providing users with something "new" so as to expand the systems' knowledge. Multi-armed bandit (MAB) is a framework to balance this trade-off. Most of the previous work in MAB either models a single bandit for the whole population, or one bandit for each user. We propose an algorithm to divide the population of users into multiple clusters, and to customize the bandits to each cluster. This clustering is dynamic, i.e., users can switch from one cluster to another, as their preferences change. We evaluate the proposed algorithm on two real-life datasets.
Unlabeled high-dimensional text-image web news data are produced every day, presenting new challenges to unsupervised feature selection on multi-view data. State-of-the-art multi-view unsupervised feature selection methods learn pseudo class labels by spectral analysis, which is sensitive to the choice of similarity metric for each view. For text-image data, the raw text itself contains more discriminative information than similarity graph which loses information during construction, and thus the text feature can be directly used for label learning, avoiding information loss as in spectral analysis. We propose a new multi-view unsupervised feature selection method in which image local learning regularized orthogonal nonnegative matrix factorization is used to learn pseudo labels and simultaneously robust joint $l_{2,1}$-norm minimization is performed to select discriminative features. Cross-view consensus on pseudo labels can be obtained as much as possible. We systematically evaluate the proposed method in multi-view text-image web news datasets. Our extensive experiments on web news datasets crawled from two major US media channels: CNN and FOXNews demonstrate the efficacy of the new method over state-of-the-art multi-view and single-view unsupervised feature selection methods.
Recent business studies have shown that social technologies can significantly improve productivity within enterprises by improving access to information, ideas, and collaborators. A manifestation of the growing adoption of enterprise social technologies is the increasing use of enterprise virtual discussions to engage customers and employees. In this paper we present an enterprise discussion analysis system which seeks to enable rapid interactive inference of insights from virtual online enterprise discussions. Rapid understanding is facilitated by extracting a hierarchy of key concepts, which represent a multi-faceted thematic categorization of discussion content, and by identifying high-quality thematic exemplar comments. The concept hierarchy and exemplar comments are presented through an intuitive web user-interface which allows an analyst to quickly navigate through the main concepts and the most relevant comments extracted from the discussion. We present a preliminary validation of system efficacy through user surveys provided to test users.
Medical knowledge extraction has great potential to improve the treatment quality of hospitals. In this paper, we propose a clinical problem-action relation extraction method. It is based on clinical semantic units and event causality patterns in order to present a chronological view of a patient's problem and a physician's action. Based on our observation, a clinical semantic unit is defined as a conceptual medical knowledge for a problem and/or action. Since a clinical event is a basic concept of the problem-action relation, events are detected from clinical texts based on conditional random fields. A clinical semantic unit is segmented from a sentence based on time expressions and inherent structure of events. Then, a clinical semantic unit is classified into a problem and/or action relation based on event causality features in support vector machines. The experimental result on Korean medical collection shows 78.8% in F-measure when given the answer of clinical events. This result shows that the proposed method is effective for extracting clinical problem-action relations.
Identifying user tasks from query logs has garnered considerable interest from the research community lately. Several approaches have been proposed to extract tasks from search sessions. Current approaches segment a user session into disjoint tasks using features extracted from query, session or clicked document text. However, user tasks most often than not are entity centric and text based features will not exploit entities directly for task extraction. In this work, we explore entity specific task extraction from search logs. We evaluate the quality of extracted tasks with Session track data. Empirical evaluation shows that terms associated with entity oriented tasks can not only be used to predict terms in user sessions but also improve retrieval when used for query expansion.
Retail transaction data conveys rich preference information on brands and goods from customers. How to mine the transaction data to provide personalized recommendation to customers becomes a critical task for retailers. Previous recommendation methods either focus on the user-product matrix and ignore the transactions, or only use the partial information of transactions, leading to inferior performance in recommendation. Inspired by association rule mining, we introduce association pattern as a basic unit to capture the correlation between products from both intra- and intertransactions. A Probabilistic model over the Association Patterns (PAP for short) is then employed to learn the potential shopping interests and also to provide personalized recommendations. Experimental results on two real world retail data sets show that our proposed method can outperform the state-of-the-art recommendation methods.
MOOCs attract diverse users with varying habits. Identifying those patterns through clickstream analysis could enable more effective personalized support for student information seeking and learning in that online context. We propose a novel method to characterize types of sessions in MOOCs by mining the habitual behaviors of students within individual sessions. We model learning sessions as a distribution of activities and activity sequences with a topical N-gram model. The representation offers insights into what groupings of habitual student behaviors are associated with higher or lower success in the course. We also investigate how context information, such as time of day or a user's demographic information, is associated with the types of learning sessions.
A recent area in which recommender systems have shown their value is in online discussion forums and question-answer sites. Earlier work in this space has focused on the problem of matching participants to opportunities but has not adequately addressed the problem that in these social contexts, multiple dimensions of constraints must be satisfied, including limitations on capacity and minimal requirements for expertise. In this work, we propose such a constrained question recommendation problem with load balance constraints in discussion forums and use flow based model to generate the optimal solution. In particular, to address the introduced computation complexity, we investigate the concept of submodularity of the objective function and propose a specific submodular method to give an approximated solution. We present experiments conducted on two Massive Open Online Course (MOOC) discussion forum datasets, and demonstrate the effectiveness and efficiency of our submodular method in solving constrained question recommendation tasks.
Previous work studied one-class collaborative filtering (OCCF) problems including pointwise methods, pairwise methods, and content-based methods. The fundamental assumptions made on these approaches are roughly the same. They regard all missing values as negative. However, this is unreasonable since the missing values actually are the mixture of negative and positive examples. A user does not give a positive feedback on an item probably only because she/he is unaware of the item, but in fact, she/he is fond of it. Furthermore, content-based methods, e.g. collaborative topic regression (CTR), usually require textual content information of items. This cannot be satisfied in some cases. In this paper, we exploit latent Dirichlet allocation (LDA) model on OCCF problem. It assumes missing values unknown and only models the observed data, and it also does not need content information of items. In our model items are regarded as words and users are considered as documents and the user-item feedback matrix denotes the corpus. Experimental results show that our proposed method outperforms the previous methods on various ranking-oriented evaluation metrics.
This paper proposes a novel bootstrapping based framework jointed with automatic refinement to extract opinion words and targets. We employ a reasonable set of opinion seed words and pre-defined rules to start bootstrapping. We leverage statistical word co-occurrence and dependency patterns for propagation between opinion words and targets. A Sentiment Graph Model (SGM) is constructed to evaluate these opinion relations. Furthermore, we employ Automatic Rule Refinement (ARR) to refine the rules to extract false results. By using false results pruning and ARR process, we can efficiently alleviate the error propagation problem in traditional bootstrapping-based methods. Preliminary evaluation shows the effectiveness of our method.
Learning users' preferences is critical to enable personalized recommendation services in various online applications such as e-commerce, entertainment and many others. In this paper, we study on how to learn users' preferences from abundant online activities, e.g., browsing and examination, which are usually called implicit feedbacks since they cannot be interpreted as users' likes or dislikes on the corresponding products directly. Pairwise preference learning algorithms are the state-of-the-art methods for this important problem, but they have two major limitations of low accuracy and low efficiency caused by noise in observed feedbacks and non-optimal learning steps in update rules. As a response, we propose a novel adaptive pairwise preference learning algorithm, which addresses the above two limitations in a single algorithm with a concise and general learning scheme. Specifically, in the proposed learning scheme, we design an adaptive utility function and an adaptive learning step for the aforementioned two problems, respectively. Empirical studies show that our algorithm achieves significantly better results than the state-of-the-art method on two real-world data sets.
It is insufficient to search temporal text by only focusing on either time attribute or keywords today as we pay close attention to the evolution of event with time. Both temporal and textual constraints need to be considered in one single query, called Top-k Interval Keyword Query (TIKQ).In this paper, we presents a cloud-based system named INK that supports efficient execution of TIKQs with appropriate effectiveness on Hadoop and HBase. In INK, an Adaptive Index Selector (AIS) is devised to choose the better execution plan for various TIKQs adaptively based on the proposed cost model, and leverage two novel hybrid index modules (TriI and IS-Tree) to combine keyword and interval filtration seamlessly.
In recent years, community structure has attracted increasing attention in social network analysis. However, performances of multifarious approaches to community detection are seldom evaluated in a suite of systematic measurements. Furthermore, we can hardly find works which reveal diverse features based on the detected community structure. In this paper, we build a tool called CoDEM to make both quality evaluations of community detection and an in-depth mining for pivotal nodes inside communities. This tool integrates several effective approaches to community detection, establishes an overall evaluation system and gets the multi-dimensional ranking for the local importance of nodes. Moreover, the tool is built with a friendly user interface.
The rapidly increasing RDF data in the Linked Open Data (LOD) community project is a valuable resource for obtaining domain knowledge. However, RDF data of specific topics also shows a trend of being more decentralized and fragmented, which makes it difficult and inefficient for the users to get an overview of a specific topic and retrieve the desired information. In this paper, we demonstrate a novel system called KFM, which can aggregate the distributed RDF data of a topic according to the facets of this topic. KFM provides a new way for users to obtain and explore domain knowledge in the LOD cloud.
Topic modeling is a machine learning technique that identifies latent topics in a text corpus. There are several existing tools that allow end-users to create and explore topic models using graphical user interfaces. In this paper, we present a visual analytics system for dynamic topic models that goes beyond the existing breed of tools. First, it decouples the Web-based user interface from the underlying data sets, enabling exploration of arbitrary text data sets in the Web browser. Second, it allows users to explore dynamic topic models, while existing tools are often limited to static topic models. Finally, it comes with a tool server in the backend that allows the design and execution of scientific workflows to build topic models from any data source. The system is demonstrated by building and exploring a dynamic topic model of CIKM proceedings published since 2001.
In recent years, time series data are everywhere across different industry, which creates a huge demand on time series data analysis, such as pattern search. Meanwhile, it is increasingly realized that only when pattern search results together with information from relational tables could be used in a programming-free way, can they perform analysis on time series conveniently. Hence, casual users highly demand that queries involving pattern search could be performed via SQLs. However, existing database products supporting time series data type lack the capability to perform pattern searches on time series data. This paper presents SearchonTS, an extendable framework for in-database pattern search on time series data. It provides a series of interfaces so that time series index and pattern search can be added and performed in a uniformed and query optimized manner. SearchonTS is implemented as an extension on Informix, which is a database product in IBM software product series. It targets a future release of IBM Informix. We have implemented index-based pattern search for Euclid Distance(ED) via SearhonTS to demonstrate its usability for developers. And real scenario is also provided to show SQL involving pattern search so that users can have a more clear experience of the convenience.
This paper describes an advanced news analytics and exploration system that allows users to visualize trends of entities like politicians, countries, and organizations in continuously updated news articles. Our system improves state-of-the-art text analytics by linking ambiguous names in news articles to entities in knowledge bases like Freebase, DBpedia or YAGO. This step enables indexing entities and interpreting the contents in terms of entities. This way, the analysis of trends and co-occurrences of entities gains accuracy, and by leveraging the taxonomic type hierarchy of knowledge bases, also in expressiveness and usability. In particular, we can analyze not only individual entities, but also categories of entities and their combinations, including co-occurrences with informative text phrases. Our Web-based system demonstrates the power of this approach by insightful anecdotic analysis of recent events in the news.
Smartphones are ubiquitous and becoming more and more sophisticated, with ever-growing computing, networking and sensing powers. How can we help the users form a healthy habit by sending a reminder if s/he is sitting too long? How can we localize where we are inside a building and/or find the reception desk? Recognizing the physical activities (e.g., sitting, walking, jogging, etc) is a core building block to answer these questions and many more.
We present AcRe, a human activity recognition application on smartphone. AcRe takes the motion data from different sensors on smartphones as inputs (e.g., accelerometer, compass, etc), and predicts a user's motion activities (e.g., walking upstairs, standing, sitting, etc) in real-time. It provides some additional functionalities, such as incorporating a user's feedback, daily activity summerization, etc. The application is built on iOS 7.0 and will be released soon in Apple's App Store. We will invite the audience to experiment with our AcRe in terms of its effectiveness, efficiency and applicability to various domains and the potential for further improvements.
In this demo, we present Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and flexible data-parallel substrate - the Hyracks framework. Cleanix supports various data cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication, and conflict resolution. We demonstrate that Cleanix is a practical tool that supports effective and efficient data cleaning at the large scale.
Internet of Things (IoT) is an emerging paradigm where physical objects are connected and communicated over the Web. Its capability in assimilating the virtual world and the physical one offers many exciting opportunities. However, how to realize a smooth, seamless integration of the two worlds remains an interesting and challenging topic. In this paper, we showcase an IoT prototype system that enables seamless integration of the virtual and the physical worlds and efficient management of things of interest (TOIs), where services and resources offered by things can be easily monitored, visualized, and aggregated for value-added services by users. This paper presents the motivation, system design, implementation, and demonstration scenario of the system.
CrewScout is an expert-team finding system based on the concept of skyline teams and efficient algorithms for finding such teams. Given a set of experts, CrewScout finds all k-expert skyline teams, which are not dominated by any other k-expert teams. The dominance between teams is governed by comparing their aggregated expertise vectors. The need for finding expert teams prevails in applications such as question answering, crowdsourcing, panel selection, and project team formation. The new contributions of this paper include an end-to-end system with an interactive user interface that assists users in choosing teams and an demonstration of its application domains.
Wikipedia has become one of the best sources for creating and sharing a massive volume of human knowledge. Much effort has been devoted to generating and enriching the structured data by automatic information extraction from unstructured text in Wikipedia. Most, if not all, of the existing work share the same paradigm, that is, starting with information extraction over the unstructured text data, followed by supervised machine learning. Although remarkable progresses have been made, this paradigm has its own limitations in terms of effectiveness, scalability as well as the high labeling cost.
We present WiiCluster, a scalable platform for automatically generating infobox for articles in Wikipedia. The heart of our system is an effective cluster-then-label algorithm over a rich set of semi-structured data in Wikipedia articles: linked entities. It is totally unsupervised and thus does not require any human label. It is effective in generating semantically meaningful summarization for Wikipedia articles. We further propose a cluster-reuse algorithm to scale up our system. Overall, our WiiCluster is able to generate nearly 10 million new facts. We also develop a web-based platform to demonstrate WiiCluster, which enables the users to access and browse the generated knowledge.
Wearable devices such as Google Glass are receiving increasing attention and look set to become part of our technical landscape over the next few years. At the same time, lifelogging is a topic that is growing in popularity with a host of new devices on the market that visually capture life experience in an automated manner. In this paper, we describe a visual lifelogging solution for Google Glass that is designed to capture life experience in rich visual detail, yet maintain the privacy of unknown bystanders.
We present the approach called negative face blurring and evaluate it on a collection of lifelogging data of around nine thousand pictures from Google Glass.
Today's data management systems increasingly need to support both tensor-algebraic operations (for analysis) as well as relational-algebraic operations (for data manipulation and integration). Tensor decomposition techniques are commonly used for discovering underlying structures of multi-dimensional data sets. However, as the relevant data sets get large, existing in-memory schemes for tensor decomposition become increasingly ineffective and, instead, memory-independent solutions, such as in-database analytics, are necessitated. We introduce an in-database analytic system for efficient implementations of in-database tensor decompositions on chunk-based array data stores, so called, TensorDB. TensorDB includes static in-database tensor decomposition and dynamic in-database tensor decomposition operators. TensorDB extends an array database and leverages array operations for data manipulation and integration. TensorDB supports complex data processing plans where multiple relational algebraic and tensor algebraic operations are composed with each other.
TweetMogaz is a news portal platform that generates news reports from social media content. It uses an adaptive information filtering technique for tracking tweets relevant to news topics, such as politics and sports in some regions. Relevant tweets for each topic are used to generate a comprehensive report about public reaction toward events happening. Showing a news report about an entire topic may be suboptimal for some users, since users prefer story-oriented presentation. In this demonstration, we present a technique for identifying stories within a stream of microblogs on a given topic. Detected tweets on a news story are used to generate a dynamic pseudo-article that gets its content updated in real-time based on trends on Twitter. Pseudo-article consists of a title, front-page image, set of tweets on the story, and links to external news articles. The platform is running live and tracks news on hot topics including Egyptian politics, Syrian conflict, and international sports.
This paper presents TWinChat, a Twitter and Web user interactive chat system to support simultaneous communication between microbloggers and Web users in real-time through both the contents of microblogs and Web pages. TWinChat provides a question answering interface attached to Web pages, which allows Web users to chat with Twitter users in real-time while presenting tweets that are associated with Web pages, i.e., simultaneous cross-media communication. In order to map heterogeneous media, the system extracts relationship between tweets and Web pages by generating queries based on location names. Thus, our system can effectively present messages from Web users to help Twitter users immediately obtain useful information or knowledge, and it also can effectively present tweets from the Twitter users to help the Web users easily grasp the current situation in real-time.
Large amounts of data often require expensive and time-consuming analysis. Therefore, highly scalable and efficient techniques are necessary to process, analyze and discover useful information. Database sampling has proven to be a powerful method to surpass these limitations. Using only a sample of the original large database brings the benefit of obtaining useful information faster, at the potential expense of lower accuracy. In this paper, we demonstrate \vfds, a novel fast database sampling system that maintains the referential integrity of the data. The system is developed over the open-source database management system, MySQL. We present various scenarios to demonstrate the effectiveness of VFDS in approximate query answering, sample size, and execution time, on both real and synthetic databases.
This demo presents exploratory keyword search over data graphs by means of semantic facets. The demo starts with a keyword search over data graphs. Answers are first ranked by an existing search engine that considers their textual relevance and semantic structure. The user can then explore the answers through facets of structural patterns (i.e., schemas) as well as through other features. A particular way of presenting answers in a compact form is also supported and is applicable when looking for a single entity that connects the keywords. The demo is based on a working prototype that users can try on their own. It includes five data graphs that are quite diversified. In particular, three of them were generated from relational databases and two - from RDF triples. The demo shows that the system enables users to easily and quickly perform various search tasks by means of exploration, filtering and summarization.
Our slogan for the proposed Clairvoyant system is "with several clicks, the future is in your hand, the plan comes into your mind". Clairvoyant is to predict the future of new videos with only few data. The core function in the system is the novel shifted shape match prediction algorithm, based on a K-Nearest Neighbor model. Tons of experiments have been conducted on the open data sets. The experimental results confirms that the proposed SSMP algorithm is promising and outperforms the baselines with significant improvements on various evaluation methods. A demonstration video has been published at http://1drv.ms/1nyH3hD.
Inventory management refers to tracing inventory levels, orders and sales of a retailing business. In the current retailing market, a tremendous amount of data regarding stocked goods (items) in an inventory will be generated everyday. Due to the increasing volume of transaction data and the correlated relations of items, it is often a non-trivial task to efficiently and effectively manage stocked goods. In this demo, we present an intelligent system, called iMiner, to ease the management of enormous inventory data. We utilize distributed computing resources to process the huge volume of inventory data, and incorporate the latest advances of data mining technologies into the system to perform the tasks of inventory management, e.g., forecasting inventory, detecting abnormal items, and analyzing inventory aging. Since 2014, iMiner has been deployed as the major inventory management platform of ChangHong Electric Co., Ltd, one of the world's largest TV selling companies in China.
The advent of social media has facilitated the study of information diffusion, expressing information spreading and influence among users on social graphs. In this demo paper, we present a system for real-time analysis of information diffusion on Twitter; it constructs the so-called information cascades that capture how information is being propagated from user to user. We face the challenge of managing and presenting large and fast-evolving graph data. For this purpose, we have developed methods for computing and visualizing information flow dynamically, offering rich structural and temporal information. The interface offers the possibility to interact with the dynamic, evolving cascades and gives valuable insights in terms of how information propagates on real-time and how users are influenced from each other.
Social networks have become an important information source. Due to their unprecedented success, these systems have to face an exponentially increasing amount of user generated content. As a consequence, finding relevant users or data matching specific interests is a challenging. We present RecLand, a recommender system that takes advantage of the social graph topology and of the existing contextual information to recommend users. The graphical interface of RecLand shows recommendations that match the topical interests of users and allows to tune the parameters to adapt the recommendations to their needs.
This demonstration presents MeowsReader, a real-time news ranking and filtering prototype. MeowsReader illustrates how a general class of continuous top-k queries offers a suitable abstraction for modeling and implementing real-time search services over highly dynamic information streams combining keyword search and realtime web signals about information items. Users express their interest by simple text queries and continuously receive the best matching results in an alert-like environment. The main innovative feature are dynamic item scores which take account of information decay, real-time web attention and other online user feedback. Additionally, a trends detection mechanism automatically generates trending entities from the input streams, which can smoothly be added to user profiles in form of keyword queries.
We present a distributed academic search and mining system? AMiner-mini. The system offers intra- and inter- university level academic search and mining services. It integrates academic data from multiple sources and performs disambiguation for people names, which is a fundamental issue for searching people. We employ a two-phases approach that formalizes the disambiguation problem into a HMRF framework, which significantly improves the disambiguation performance. Based on the disambiguation results, AMiner-mini offers a people search function, which returns experts (or related researchers) for a given query by the user. The user can also choose different metrics to rank the search results and explore the results from different dimensions. The system is designed in a distributed structure. It can be deployed in a university as a stand-alone system for finding the right people who are working on a research topic. Multiple distributed systems can be also connected via Web services and perform search or mining in an asynchronous way and return the combination results. We have deployed the system in Tsinghua University and feedback from university academic users shows that the system worked well and achieved its primary objective.
We present DEESSE [1], a tool that enables an exploratory and serendipitous exploration - at entity level, of the content of two different social media: Wikipedia, a user-curated online encyclopedia, and Yahoo Answers, a more unconstrained question/answering forum. DEESSE represents the content of each source as an entity network, which is further enriched with metadata about sentiment, writing quality, and topical category. Given a query entity, entity results are retrieved from the network by employing an algorithm based on a random walk with restart to the query. Following the emerging paradigm of composite retrieval, we organize the results into topically coherent bundles instead of showing them in a simple ranked list.
The Entity Linking (EL) problem consists in automatically linking short fragments of text within a document to entities in a given Knowledge Base like Wikipedia. Due to its impact in several text-understanding related tasks, EL is an hot research topic. The correlated problem of devising the most relevant entities mentioned in the document, a.k.a. salient entities (SE), is also attracting increasing interest. Unfortunately, publicly available evaluation datasets that contain accurate and supervised knowledge about mentioned entities and their relevance ranking are currently very poor both in number and quality. This lack makes very difficult to compare different EL and SE solutions on a fair basis, as well as to devise innovative techniques that relies on these datasets to train machine learning models, in turn used to automatically link and rank entities. In this demo paper we propose a Web-deployed tool that allows to crowdsource the creation of these datasets, by supporting the collaborative human annotation of semi-structured documents. The tool, called Elianto, is actually an open source framework, which provides a user friendly and reactive Web interface to support both EL and SE labelling tasks, through a guided two-step process.
We present SmartVenues, a system that recommends nearby venues to a user who visits or lives in a city. SmartVenues models the variation over time of each venue's level of attendance, and uses state-of-the-art time series forecasting algorithms to predict the future attendance of these venues. We use the predicted levels of attendance to infer the popularity of a venue at future points in time, and to provide the user with recommendations at different times of the day. If the users log in with their Facebook account, the recommendations are personalised using the pages they "like". In this demonstrator, we detail the architecture of the system and the data that we collect in real-time to be able to perform the predictions. We also present two different interfaces that build upon our system to display the recommendations: a web-based application and a mobile application.
Temporal information retrieval has been a topic of great interest in recent years. Despite the efforts that have been conducted so far, most popular search engines remain underdeveloped when it comes to explicitly considering the use of temporal information in their search process. In this paper we present GTE-Rank, an online searching tool that takes time into account when ranking time-sensitive query web search results. GTE-Rank is defined as a linear combination of topical and temporal scores to reflect the relevance of any web page both in topical and temporal dimensions. The resulting system can be explored graphically through a search interface made available for research purposes.
Topics automatically derived by topic models are not always easy and clearly interpretable by humans. The most probable top words of a topic may leave room for ambiguous interpretations, especially when the top words are exclusively nouns. We demonstrate how part-of-speech (POS) tagging and co-location analysis of terms can be used to derive linguistic frames that yield more interpretable topic representations. The so-called topic frames are demonstrated as feature of the TopicExplorer system that allows to explore document collections using topic models, visualizations and key word search. Demo versions of TopicExplorer are available at http://topicexplorer.informatik.uni-halle.de/ .
We present CONDOR, a tool for managing constraints towards improved data quality. As increasing amounts of heterogeneous data are being generated, integrity constraints are the primary tool for enforcing data integrity. It is essential that an accurate and up-to-date set of constraints exist to validate that the correct application semantics are being enforced. We consider the widely used constraint, functional dependencies (FDs). CONDOR is an integrated system that identifies inconsistent data values (along with suggestions for clean values), and generates repairs to both the data and/or FDs to resolve inconsistencies. We extend the set of FD repair operations proposed in past work, by (1) adding a set of attributes to an FD; (2) transforming an FD to a conditional functional dependency (CFD); and (3) identifying redundant attributes in an FD. Our demonstration will showcase the visualization and interactive features of CONDOR to help users determine the best repairs that resolve the underlying inconsistencies to improve data quality.
Held each year in conjunction with one of the largest data management conferences, CIKM, the Eighth ACM International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO 14) is organized to bring together researchers interested in development and application of cutting-edge biomedical and healthcare technology. The purpose of DTMBIO is to foster discussions regarding the state-of-the-art applications of data and text mining on biomedical research problems. DTMBIO 14 will help scientists navigate emerging trends and opportunities in the evolving area of informatics related techniques and problems in the context of biomedical research.
Massive amounts of data are being generated on social media sites, such as Twitter and Facebook. These data can be used to better understand people (e.g., personality traits, perceptions, and preferences) and predict their behavior. As a result, a deeper understanding of users and their behavior can benefit a wide range of intelligent applications, such as advertising, social recommender systems, and personalized knowledge management. These applications will also benefit individual users themselves and optimize their experience across a wide variety of domains, such as retail, healthcare, and education. Since mining and understanding user behavior from social media often requires interdisciplinary effort, including machine learning, text mining, human-computer interaction, and social science, our workshop aims to bring together researchers and practitioners from multiple fields to discuss the creation of deeper models of individual users by mining the content that they publish and the social networking behavior that they exhibit.
There is an increasing amount of structure on the Web as a result of modern Web languages, user tagging and annotation, emerging robust NLP tools, and an ever growing volume of linked data. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. The goal of the ESAIR'14 workshop remains to advance the general research agenda on this core problem, with an explicit focus on one of the most challenging aspects to address in the coming years. The main remaining challenge is on the user's side - the potential of rich document annotations can only be realized if matched by more articulate queries exploiting these powerful retrieval cues - and a more dynamic approach is emerging by exploiting new forms of query autosuggest. How can the query suggestion paradigm be used to encourage searcher to articulate longer queries, with concepts and relations linking their statement of request to existing semantic models? How do entity results and social network data in "graph search" change the classic division between searchers and information and lead to extreme personalization - are you the query? How to leverage transaction logs and recommendation, and how adaptive should we make the system? What are the privacy ramifications and the UX aspects - how to not creep out users?
The LocWeb 2014 workshop continues a successful workshop series at the intersection of geospatial search, information management, and Web architecture with a focus towards location-aware information access. The workshop reflects a multitude of fields that demand and utilize location features, featuring presentations that look at the topic of location on the Web from an interdisciplinary perspective, including new approaches dealing with or utilizing geospatial information.
PIKM workshop offers to Ph.D. students the possibility to bring their work to an international and interdisciplinary research community, and create a network of young researchers to exchange and develop new and promising ideas. Similarly to the CIKM, PIKM workshop covers a wide range of topics in the areas of databases, information retrieval and knowledge management.
The ACM 1st International Workshop on Privacy and Security of Big Data (PSBD 2014), held in Shanghai, China on November 7, 2014, in conjunction with the ACM 23rd International Conference on Information and Knowledge Management (CIKM 2014), presents research on privacy and security of big data, an emerging challenge in actual database and data mining research. PSBD 2014 program has two interesting s/essions on (i) scalable privacy-preserving and security-control methods for big data processing, and (ii) user-oriented and data-oriented privacy methods for big data processing, plus a panel discussing current challenges and future research perspectives of privacy and security of big data.
We organize and present the 5th version of the International Workshop on Web-scale Knowledge Representation, Retrieval and Reasoning (Web-KR 2014) as a continuous effort to discuss and provide possible theories and techniques to deal with the barriers for knowledge processing at Web scale. This workshop was held in conjunction with the 2014 ACM International Conference on Information and Knowledge Management (CIKM 2014) on November 3rd, 2014 in Shanghai, China. Compared to previous workshops under the same title, accepted papers of this workshop covers even wider topics in the field. The contributions focus on semantic knowledge extraction, representation, knowledge clustering, inconsistency checking, entity relatedness and linking, query suggestions, etc. Many new approaches are proposed to investigate these topics in the context of Web-scale resources. This summary introduces the major contributions of accepted papers in the Web-KR 2014 workshop.