Over the past few years, we’ve seen a huge interest in applying AI techniques to develop investment strategies both in academia and the finance industry. However, we note that generating returns is not always the sole investment objective. Take large pension funds for example, they are considerably more risk-averse as opposed to profit-seeking. With this observation, we propose a Risk-balanced Deep Portfolio Constructor (RDPC) that takes risk into explicit consideration. RDPC is an end-to-end reinforcement learning-based transformer trained to optimize both returns and risk, with a hard attention mechanism that learns the relationship between asset pairs, imitating the powerful pairs trading strategy widely adopted by many investors. Experiments on real-world data show that RDPC achieves state-of-the-art performance not just on risk metrics such as maximum drawdown, but also on risk-adjusted returns metrics including Sharpe ratio and Calmar ratio.
Graph representation learning (also known as network embedding) has been extensively researched with varying levels of granularity, ranging from nodes to graphs. While most prior work in this area focuses on node-level representation, limited research has been conducted on graph-level embedding, particularly for dynamic or temporal networks. However, learning low-dimensional graph-level representations for dynamic networks is critical for various downstream graph retrieval tasks such as temporal graph similarity ranking, temporal graph isomorphism, and anomaly detection. In this paper, we present a novel method for temporal graph-level embedding that addresses this gap. Our approach involves constructing a multilayer graph and using a modified random walk with temporal backtracking to generate temporal contexts for the graph’s nodes. We then train a “document-level’’ language model on these contexts to generate graph-level embeddings. We evaluate our proposed model on five publicly available datasets for the task of temporal graph similarity ranking, and our model outperforms baseline methods. Our experimental results demonstrate the effectiveness of our method in generating graph-level embeddings for dynamic networks.
Given a large corpus of HTML-based emails (or websites, posters, documents) collected from the web, how can we train a model capable of learning from such rich heterogeneous data for HTML-based style recommendation tasks such as recommending useful design styles or suggesting alternative HTML designs? To address this new learning task, we first decompose each HTML document in the corpus into a sequence of smaller HTML fragments where each fragment may consist of a set of HTML entities such as buttons, images, textual content (titles, paragraphs) and stylistic entities such as background-style, font-style, button-style, among others. From these HTML fragments, we then derive a single large heterogeneous hypergraph that captures the higher-order dependencies between HTML fragments and entities in such fragments, both within the same HTML document as well as across the HTML documents in the corpus. We then formulate this new HTML style recommendation task as a hypergraph representation learning problem and propose an approach to solve it. Our approach is able to learn effective low-dimensional representations of the higher-order fragments that consist of sets of heterogeneous entities as well as low-dimensional representations of the individual entities themselves. We demonstrate the effectiveness of the approach across several design style recommendation tasks. To the best of our knowledge, this work is the first to develop an ML-based model for the task of HTML-based email style recommendation.
Representations of temporal networks arising from a stream of edges lie at the heart of models learned on it and its performance on downstream applications. While previous work on dynamic modeling and embedding have focused on representing a stream of timestamped edges using a time-series of graphs based on a specific time-scale τ (e.g., 1 month), we introduce the notion of an ϵ -graph time-series that uses a fixed number of edges for each graph, and show its effectiveness in capturing fundamental structural graph statistics over time. The results indicate that the ϵ -graph time-series representation effectively captures the structural properties of the graphs across time whereas the commonly used τ -graph time-series representation captures the frequency of edges and temporal patterns with respect to their arrival in the application time. These results have many important implications especially on the design of new GNN-based models for temporal networks as well as for understanding existing models and their limitations.
Visualization recommendation systems make understanding data more accessible to users of all skill levels by automatically generating visualizations for users to explore. However, most existing visualization recommendation systems focus on ranking all possible visualizations based on the attributes or encodings, which makes it difficult to find the most relevant insights. We therefore introduce a novel class of insight-based visualization recommendation systems that automatically rank and recommend groups of related insights as well as the most important insights within each group. Our approach combines results from different learning-based methods to discover insights automatically and generalizes to a variety of attribute types (e.g., categorical, numerical, and temporal), including non-trivial combinations of these attribute types. To demonstrate the utility of this approach, we implemented a insight-centric visualization recommendation system, SpotLight, and conducted a user study with twelve participants, which showed that users are able to quickly find and understand relevant insights in unfamiliar data.
Email-based scams pose a threat to the personally identifiable information and financial safety of all email users. Within a university environment, the risks are potentially greater: traditional students (i.e., within an age range typical of college students) often lack the experience and knowledge of older email users. By understanding the topics, temporal trends, and other patterns of scam emails targeting universities, these institutions can be better equipped to reduce this threat by improving their filtering methods and educating their users. While anecdotal evidence suggests common topics and trends in these scams, the empirical evidence is limited. Observing that large universities are uniquely positioned to gather and share information about email scams, we built a corpus of 5,155 English language scam emails scraped from information security websites of five large universities in the United States. We use Latent Dirichlet Allocation (LDA) topic modelling to assess the landscape and trends of scam emails sent to university addresses. We examine themes chronologically and observe that topics vary over time, indicating changes in scammer strategies. For example, scams targeting students with disabilities have steadily risen in popularity since they first appeared in 2015, while password scams experienced a boom in 2016 but have lessened in recent years. To encourage further research to mitigate the threat of email scams, we release this corpus for others to study.
Comparative decisions, such as picking between two cars or deciding between two hiking trails, require the users to visit multiple webpages and contrast the choices along relevant aspects. Given the impressive capabilities of pre-trained large language models [4, 11], we ask whether they can help automate such analysis. We refer to this task as extractive aspect-based contrastive summarization which involves constructing a structured summary that compares the choices along relevant aspects. In this paper, we propose a novel method called STRUM for this task that can generalize across domains without requiring any human-written summaries or fixed aspect list as supervision. Given a set of relevant input webpages, STRUM solves this problem using two pre-trained T5-based  large language models: first one fine-tuned for aspect and value extraction , and second one fine-tuned for natural language inference . We showcase the abilities of our method across different domains, identify shortcomings, and discuss questions that we believe will be critical in this new line of research.
The large volumes of data on the Internet provides new opportunities for scientific discovery, especially promoting data-driven open science research. However, due to lack of accurate semantic markups, finding relevant data is still difficult. To address this problem, we develop a one-stop dataset service called DataExpo and propose a deep learning method for automatic metadata ingestion. In this demo paper, we describe the system architecture, and how DataExpo facilitates dataset discovery, search and recommendation. Up till now, DataExpo has indexed over 960,000 datasets from more than 27,000 repositories in the context of Deep-time Digital Earth Program. Demo visitors can explore our service via https://dataexpo.acemap.info.
In this paper, we present a Multi-Task model for Recommendation and Churn prediction (MT) in the retail banking industry. The model leverages a hard parameter-sharing framework and consists of a shared multi-stack encoder with multi-head self-attention and two fully connected task heads. It is trained to achieve two multi-class classification tasks: predicting product churn and identifying the next-best products (NBP) for users, individually. Our experiments demonstrate the superiority of the multi-task model compared to its single-task versions, reaching top-1 precision at 78.1% and 77.6%, for churn and NBP prediction respectively. Moreover, we find that the model learns a coherent and expressive high-level representation reflecting user intentions related to both tasks. There is a clear separation between users with acquisitions and users with churn. In addition, acquirers are more tightly clustered compared to the churners. The gradual separability of churning and acquiring users, who diverge in intent, is a desirable property. It provides a basis for model explainability, critical to industry adoption, and also enables other downstream applications. These potential additional benefits, beyond reducing customer attrition and increasing product use–two primary concerns of businesses, make such a model even more valuable.
Clustering is widely employed in various applications as it is one of the most useful data mining techniques. In performing clustering, a similarity measure, which defines how similar a pair of data objects are, plays an important role. A similarity measure is employed by considering a target dataset’s characteristics. Current similarity measures (or distances) do not reflect the distribution of data objects in a dataset at all. From the clustering point of view, this fact may limit the clustering accuracy. In this paper, we propose c-affinity, a new notion of a similarity measure that reflects the distribution of objects in the given dataset from a clustering point of view. We design c-affinity between any two objects to have a higher value as they are more likely to belong to the same cluster by learning the data distribution. We use random walk with restart (RWR) on the k-nearest neighbor graph of the given dataset to measure (1) how similar a pair of objects are and (2) how densely other objects are distributed between them. Via extensive experiments on sixteen synthetic and real-world datasets, we verify that replacing the existing similarity measure with our c-affinity improves the clustering accuracy significantly.
This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay1, serving more than 100 million daily active users. To explicitly characterize user intent, we propose AlipayKG, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user’s next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
We present Mirror, an open-source platform for data exploration and analysis powered by large language models. Mirror offers an intuitive natural language interface for querying databases, and automatically generates executable SQL commands to retrieve relevant data and summarize it in natural language. In addition, users can preview and manually edit the generated SQL commands to ensure the accuracy of their queries. Mirror also generates visualizations to facilitate understanding of the data. Designed with flexibility and human input in mind, Mirror is suitable for both experienced data analysts and non-technical professionals looking to gain insights from their data.1
Generating and maintaining API documentation with integrity and consistency can be time-consuming and expensive for evolving APIs. To solve this problem, several approaches have been proposed to automatically generate high-quality API documentation based on a combination of knowledge from different web sources. However, current researches are weak in handling unpopular APIs and cannot generate structured API documentation. Hence, in this poster, we propose a hybrid technique(namely gDoc) for the automatic generation of structured API documentation. We first present a fine-grained search-based strategy to generate the description for partial API parameters via computing the relevance between various APIs, ensuring the consistency of API documentation. Then, we employ the cross-modal pretraining Seq2Seq model M6 to generate a structured API document for each API, which treats the document generation problem as a translation problem. Finally, we propose a heuristic algorithm to extract practical parameter examples from API request logs. The experiments evaluated on the online system show that this work’s approach significantly improves the effectiveness and efficiency of API document generation.
OpenAPI indicates a behavior where producers offer Application Programming Interfaces (APIs) to help end-users access their data, resources, and services. Generally, API has many parameters that need to be entered. However, it is challenging for users to understand and document these parameters correctly. This paper develops an API workbench to help users learn and debug APIs. Based on this workbench, much exploratory work has been proposed to reduce the overhead of learning and debugging APIs. We explore the knowledge, such as parameter characteristics (e.g., enumerability) and constraints (e.g., maximum/minimum value), from the massive API call logs to narrow the range of parameter values. Then, we propose a fine-grained approach to enrich the API documentation by extracting dependency knowledge between APIs. Finally, we present a learning-based prediction method to predict API execution results before the API is called, significantly reducing user debugging cycles. The experiments evaluated on the online system show that this work’s approach substantially improves the user experience of debugging OpenAPIs.
The MIND dataset, one of the most-popular real-world news datasets, has been used in many news recommendation researches. They all employ the impression log as training data to train their models (i.e., impression-based training). In this paper, we claim that the impression log has the preference bias issue and thus should not be used for model training in news recommendation. We validate our claim via extensive experiments; we also demonstrate the dramatic improvement of recommendation accuracy in five existing state-of-the-art models up to 82.1% by simply ignoring the impression log. We believe this surprising result provides a new insight towards better model training for both researchers and practitioners working in a news recommendation area.
Social media generates a rich source of text data with intrinsic user attributes (e.g., age, gender), where different parties benefit from disclosing them. Attribute inference can be cast as a text classification problem, which, however, suffers from labeled data scarcity. To address this challenge, we propose a data-limited learning model to distill knowledge on adversarial reprogramming of a visual transformer (ViT) for attribute inferences. Not only does this novel cross-modal model transfers the powerful learning capability from ViT, but also leverages unlabeled texts to reduce the demand on labeled data. Experiments on social media datasets demonstrate the state-of-the-art performance of our model on data-limited attribute inferences.
Spot instances offered by major cloud vendors allow users to use cloud instances cost-effectively but with the risk of sudden instance interruption. To enable efficient use of spot instances by users, cloud vendors provide various datasets that reflect the current status of spot instance services, such as savings ratio, interrupt ratio, and instant availability. However, this information is scattered, and they require distinct access mechanisms and pose query constraints. Hence, ordinary users find it difficult to use the dataset to optimize spot instance usage. To resolve this issue, we propose a multi-cloud spot instance dataset service that is publicly available. This will help cloud users and system researchers to use spot instances from multiple cloud vendors to build a cost-efficient and reliable environment expediting cloud system research.
With the growing abundance of short text content on websites, analyzing and comprehending these short texts has become a crucial task. Topic modeling is a widely used technique for analyzing short text documents and uncovering the underlying topics. However, traditional topic models face difficulties in accurately extracting topics from short texts due to limited content and their sparse nature. To address these issues, we propose an Embedding-based topic modeling (EmTM) approach that incorporates word embedding and hierarchical clustering to identify significant topics. Experimental results demonstrate the effectiveness of EmTM on two datasets comprising web short texts, Snippet and News. The results indicate a superiority of EmTM over baseline topic models by its exceptional performance in both classification accuracy and topic coherence metrics.
Many people find it difficult to comprehend basic charts on the web, let alone make effective decisions from them. To address this gap, several ML models aim to automatically detect useful insights from charts and narrate them in a simpler textual format. However, most of these solutions can only detect basic factual insights (a.k.a. descriptive insights) that are already present in the chart, which may help with chart comprehension, but not decision-making. In this work, we study whether more advanced predictive and investigative insights can help users understand what will happen next and what actions they should take. These advanced insights can help decision-makers better understand the reasons behind anomaly events, predict future unfolding trends, and recommend possible actions for optimizing business outcomes. Through a study with 18 participants, we found that predictive and investigative insights lead to more insights recorded by users on average and better effectiveness ratings.
Developing semantically-aware web services requires comprehensive and accurate ontologies. Evaluating an existing ontology or adapting it is a labor-intensive and complex task for which no automated tools exist. Nevertheless, in this paper we propose a tool that aims at making this vision come true, i.e., we present a tool for the automated evaluation of ontologies that allows one to rapidly assess an ontology’s coverage of a domain and identify specific problems in the ontology’s structure. The tool evaluates the domain coverage and correctness of parent-child relations of a given ontology based on domain information derived from a text corpus representing the domain. The tool provides both overall statistics and detailed analysis of sub-graphs of the ontology. In the demo, we show how these features can be used for the iterative improvement of an ontology.
Product search engines (PSEs) play an essential role in retail websites as they make it easier for users to retrieve relevant products within large catalogs. Despite the continuous progress that has led to increasingly accurate search engines, a limited focus has been given to their performance on queries with negations. Indeed, while we would expect to retrieve different products for the queries “iPhone 13 cover with ring” and “iPhone 13 cover without ring”, this does not happen in popular PSEs with the latter query containing results with the unwanted ring component. The limitation of modern PSEs in understanding negations motivates the need for further investigation.
In this work, we start by defining the negation intent in users queries. Then, we design a transformer-based model, named Negation Detector for Queries (ND4Q), that reaches optimal performance in negation detection (+95% on accuracy metrics). Finally, having built the first negation detector for product search queries, we propose a negation-aware filtering strategy, named Filtering Irrelevant Products (FIP). The promising experimental results in improve the PSE relevance performance using FIP (+9.41% on nDCG@16 for queries where the negation starts with "without") pave the way to additional research effort towards negation-aware PSEs.
Recent studies have exploited advanced generative language models to generate Natural Language Explanations (NLE) for why a certain text could be hateful. We propose the Chain of Explanation (CoE) Prompting method, using the heuristic words and target group, to generate high-quality NLE for implicit hate speech. We improved the BLUE score from 44.0 to 62.3 for NLE generation by providing accurate target information. We then evaluate the quality of generated NLE using various automatic metrics and human annotations of informativeness and clarity scores.
Generative AI (e.g., Generative Adversarial Networks – GANs) has become increasingly popular in recent years. However, Generative AI introduces significant concerns regarding the protection of Intellectual Property Rights (IPR) (resp. model accountability) pertaining to images (resp. toxic images) and models (resp. poisoned models) generated. In this paper, we propose an evaluation framework to provide a comprehensive overview of the current state of the copyright protection measures for GANs, evaluate their performance across a diverse range of GAN architectures, and identify the factors that affect their performance and future research directions. Our findings indicate that the current IPR protection methods for input images, model watermarking, and attribution networks are largely satisfactory for a wide range of GANs. We highlight that further attention must be directed towards protecting training sets, as the current approaches fail to provide robust IPR protection and provenance tracing on training sets.
In a globalized marketplace, one could access products or services from almost anywhere. However, resolving which product in one language corresponds to another product in a different language remains an under-explored problem. We explore this from two perspectives. First, given two products of different languages, how to assess their similarity that could signal a potential match. Second, given products from various languages, how to arrive at a multi-partite clustering that respects cardinality constraints efficiently. We describe algorithms for each perspective and integrate them into a promising solution validated on real-world datasets.
Graph neural networks have achieved state-of-the-art performance on graph-related tasks. Previous methods observed that GNNs’ performance degrades as the number of layers increases and attributed this phenomenon to over-smoothing caused by the stacked propagation. However, we proved experimentally and theoretically that it is overfitting rather than propagation that causes performance degradation. We propose a novel framework: layer-adaptive GNN (LAGNN) consisting of two modules: adaptive layer selection and random Droplayer, which can adaptively determine the number of layers and thus alleviate overfitting. We attached this general framework to two representative GNNs and achieved consistency improvements on six representative datasets.
Market sentiment analysis on social media content requires knowledge of both financial markets and social media jargon, which makes it a challenging task for human raters. The resulting lack of high-quality labeled data stands in the way of conventional supervised learning methods. Instead, we approach this problem using semi-supervised learning with a large language model (LLM). Our pipeline generates weak financial sentiment labels for Reddit posts with an LLM and then uses that data to train a small model that can be served in production. We find that prompting the LLM to produce Chain-of-Thought summaries and forcing it through several reasoning paths helps generate more stable and accurate labels, while using a regression loss further improves distillation quality. With only a handful of prompts, the final model performs on par with existing supervised models. Though production applications of our model are limited by ethical considerations, the model’s competitive performance points to the great potential of using LLMs for tasks that otherwise require skill-intensive annotation.
We present RDF Playground: a web-based tool to assist those who wish to learn or teach about the Semantic Web. The tool integrates functionalities relating to the key features of RDF, allowing users to specify an RDF graph in Turtle syntax, visualise it as an interactive graph, query it using SPARQL, reason over it using OWL 2 RL, and to validate it using SHACL or ShEx. The tool further provides the ability to import and explore data from the Web through a graph-based Linked Data browser. We discuss the design and functionality of the tool, its implementation, and the results of a usability study considering students from a Web of Data course that used it for lab assignments. We conclude with a discussion of these results, as well as future directions that we envisage for improving the tool.
Knowledge Hypergraphs, as the generalization of knowledge graphs, have attracted increasingly widespread attention due to their friendly compatibility with real-world facts. However, link prediction in knowledge hypergraph is still an underexplored field despite the ubiquity of n-ary facts in the real world. Several recent representative embedding-based knowledge hypergraph link prediction methods have proven to be effective in a series of benchmarks, however, they only consider the position (or role) information, ignoring the neighborhood structure among entities and rich semantic information within each fact. To this end, we propose a model named EnhancE for effective link prediction in knowledge hypergraphs. On the one hand, a more expressive entity representation is obtained with both position and neighborhood information added to the initial embedding. On the other hand, rich semantic information of the involved entities within each tuple is incorporated into relation embedding for enhanced representation. Extensive experimental results over real datasets of both knowledge hypergraph and knowledge graph demonstrate the excellent performance of EnhancE compared with a variety of state-of-the-art baselines.
In this work, we demonstrate how to setup a Wikidata SPARQL endpoint on commodity hardware resources. We achieve this by using a novel triple store called qEndpoint, which uses a read-only partition based on HDT and a write partition based on RDF4J. We show that qEndpoint can index and query the entire Wikidata dump (currently 17 billion triples) on a machine with 600GB SSD, 10 cores and 10GB of RAM, while keeping the query performance comparable with other SPARQL endpoints indexing Wikidata.
In a nutshell, we present the first SPARQL endpoint over Wikidata that can run on commodity hardware while preserving the query run time of existing implementations. Our work goes in the direction of democratizing the access to Wikidata as well as to other large-scale Knowledge Graphs published on the Web. The source code of qEndpoint along with the query workloads are publicly available.
This article introduces Metadatamatic, an open-source, online, user-friendly tool for generating the description of a knowledge base. It supports the description of any RDF dataset via a user-friendly web form that does not require prior knowledge of the vocabularies begin used, and can enrich the description with automatically generated statistics if the dataset is accessible from a public SPARQL endpoint. We discuss the models and methods behind the tool, and present some initial results suggesting that Metadatamatic can help in increasing the visibility of public knowledge bases.
As one of the most basic services of the Internet, DNS has suffered a lot of attacks. Existing attack detection methods rely on the learning of malicious samples, so it is difficult to detect new attacks and long-period attacks. This paper transforms the DNS data flow into time series, and proposes a DNS anomaly detection method based on graph attention network and graph embedding (GAT-DNS). GAT-DNS establishes a multivariate time series model to depict the DNS service status. When the actual flow of a feature exceeds the predicted range, it is considered that abnormal DNS behavior is found. In this paper, vertex dependency is proposed to describe the dependency between features. The features with high vertex dependency values are deleted to achieve model compression. This improves the system efficiency. Experiments on open data sets show that compared with the latest AD-Bop and QLAD methods, GAT-DNS method not only improves the precision, recall and F1 value, but also improves the time efficiency of the model.
Data-centric NLP is a highly iterative process requiring careful exploration of text data throughout entire model development lifecycle. Unfortunately, existing data exploration tools are not suitable to support data-centric NLP because of workflow discontinuity and lack of support for unstructured text. In response, we propose Weedle, a seamless and customizable exploratory text analysis system for data-centric NLP. Weedle is equipped with built-in text transformation operations and a suite of visual analysis features. With its widget, users can compose customizable dashboards interactively and programmatically in computational notebooks.
Millions of websites have started to annotate structured data within their HTML pages using the schema.org vocabulary. Popular entity types annotated with schema.org terms are products, local businesses, events, and job postings. The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013 and offers the extracted data for public download in the form of the schema.org data set series. The latest release in the series consists of 106 billion RDF quads describing 3.1 billion entities. The entity descriptions originate from 12.8 million different websites. From a Web Science perspective, the data set series lays the foundation for analyzing the adoption process of schema.org annotations on the Web over the past decade. From a machine learning perspective, the annotations provide a large pool of training data for tasks such as product matching, product or job categorization, information extraction, or question answering. This poster gives an overview of the content of the Web Data Commons schema.org data set series. It highlights trends in the adoption of schema.org annotations on the Web and discusses how the annotations are being used as training data for machine learning applications.
Analogical QA task is a challenging natural language processing problem. When two word pairs are similar in their relationships, we refer to their relations as analogous. Although the analogy method based on word embedding is well developed, the analogy reasoning is far beyond this scope. At present, the methods based on pre-trained language models have explored only the tip of the iceberg. In this paper, we proposed a multi-task learning method for analogical QA task. First, we obtain word-pair representations by leveraging the output embeddings of the [MASK] token in the pre-trained language model. The representations are prepared for two tasks. The first task aims to train an analogical classifier by supervised learning. The second task is an auxiliary task based on relation clustering to generate relation pseudo-labels for word pairs and train relation classifier. Our method guides the model to analyze the relation similarity in analogical reasoning without relation labels. The experiments show that our method achieve excellent performance on four analogical reasoning datasets without the help of external corpus and knowledge. In the most difficult data set E-KAR, it has increased by at least 4%.
Questions on class cardinality comparisons are quite tricky to answer and come with its own challenges. They require some kind of reasoning since web documents and knowledge bases, indispensable sources of information, rarely store direct answers to questions, such as, “Are there more astronauts or Physics Nobel Laureates?” We tackle questions on class cardinality comparison by tapping into three sources for absolute cardinalities as well as the cardinalities of orthogonal subgroups of the classes. We propose novel techniques for aggregating signals with partial coverage for more reliable estimates and evaluate them on a dataset of 4005 class pairs, achieving an accuracy of 83.7%.
We present Templet: an online question answering (QA) system for Wikidata. Templet is based on the collaboratively-edited repository QAWiki, which collects questions in multiple natural languages along with their corresponding structured queries. Templet generates templates from question–query pairs on QAWiki by replacing key entities with identifiers. Using autocompletion, the user can type a question in natural language, select a template, and again using autocompletion, select the entities they wish to insert into the template’s placeholders, generating a concrete question, query and results. The main objectives of Templet are: (i) to enable users to answer potentially complex questions over Wikidata using natural language templates and autocompletion; (ii) to encourage users to collaboratively create new templates via QAWiki, which in turn can benefit not only Templet, but other QA systems.
Open-source software (OSS) plays a vital role in the modern software ecosystem. However, the maintenance and sustainability of OSS projects can be challenging. In this paper, we present the CrOSSD project, which aims to build a database of OSS projects and measure their current project “health” status. In the project, we will use both quantitative and qualitative metrics to evaluate the health of OSS projects. The quantitative metrics will be gathered through automated crawling of meta information such as the number of contributors, commits and lines of code. Qualitative metrics will be gathered for selected “critical” projects through manual analysis and automated tools, including aspects such as sustainability, funding, community engagement and adherence to security policies. The results of the analysis will be presented on a user-friendly web platform, which will allow users to view the health of individual OSS projects as well as the overall health of the OSS ecosystem. With this approach, the CrOSSD project provides a comprehensive and up-to-date view of the health of OSS projects, making it easier for developers, maintainers and other stakeholders to understand the health of OSS projects and make informed decisions about their use and maintenance.
Various targets keep coming up on social media, and most of them lack labeled data. In this paper, we focus on zero-shot and few-shot stance detection, which aims to identify stances with few or even no training instances. In order to solve the lack of labeled data and implicit stance expression, we propose a self-supervised data augment approach based on coreference resolution. The method is specific for stance detection to generate more stable data and reduce the variance within and between classes to achieve a balance between validity and robustness. Considering the diversity of comments, we propose a novel multi-task stance detection framework of target-related fragment extraction and stance detection, which can enhance attention on target-related fragments and reduce the noise of other fragments. Experiments show that the proposed approach achieves state-of-the-art performance in zero-shot and few-shot stance detection.
Python is a popular programming language for web development. However, optimizing the performance of Python web applications is a challenging task for developers. This paper presents a new approach to measuring the potential performance gains of upgraded Python web applications. Our approach is based on the provision of an interactive service that assists developers in optimizing their Python code through changes to the underlying system. The service uses profiling and visualization techniques to identify performance bottlenecks. We demonstrate and evaluate the effectiveness of our approach through a series of experiments on real-world Python web applications, measuring performance differences in between versions and the benefits of migrating at a reduced cost. The results show promising improvement in performance without any required code changes.
The relationship-based approach is an efficient solution strategy for distributed RDF data management. The schema of tables can directly affect the system’s storage efficiency and query performance. Most current approaches are based on fixed schema(e.g., VP, ExtVP). When facing large-scale RDF datasets and complex SPARQL queries requiring many joins, such methods suffer from problems such as long pre-processing time and poor query performance. Schemas with Pareto Optimality between the system’s space consumption and query efficiency are needed but also hard to find. Therefore, we propose mStore, a prototype system with flexible schemas based on schema mining. The intuition behind our approach is that we want to divide the combinations of predicates with higher relevance into the same schema, which can replace costly joins with low-cost selects, improving the query performance. The results show that our system performs better on complex query workloads while reducing the pre-processing time overhead compared to systems with fixed schema partitioning strategies.
Fact-checking is an important tool in fighting online misinformation. However, it requires expert human resources, and thus does not scale well on social media because of the flow of new content. Crowdsourcing has been proposed to tackle this challenge, as it can scale with a smaller cost, but it has always been studied in controlled environments. In this demo, we present the Community Notes Observatory, an online system to evaluate the first large-scale effort of crowdsourced fact-checking deployed in practice. We let demo attendees search and analyze tweets that are fact-checked by Community Notes users and compare the crowd’s activity against professional fact-checkers. The attendees will explore evidence of i) differences in how the crowd and experts select content to be checked, ii) how the crowd and the experts retrieve different resources to fact-check, and iii) the edge the crowd shows in fact-checking scalability and efficiency as compared to expert checkers.
Existing recommendation methods based on multi-interest frameworks effectively model users from multiple aspects to represent complex user interests. However, more research still needs to be done on the behavior of users shopping for others. We propose a Multi-Demander Recommendation (MDR) model to learn different people’s interests from a sequence of actions. We first decouple the feature embeddings of items to learn the static preferences of different demanders. Next, a weighted directed global graph is constructed to model the associations among item categories. We partition short sequences by time intervals and look up category embeddings from the graph to capture dynamic intents. Finally, preferences and intentions are combined with learning the interests of different demanders. The conducted experiments demonstrate that our model improves the accuracy of recommendations.
Ontology-mediated query answering (OMQA) consists in asking database queries on a knowledge base (KB); a KB is a set of facts, the KB’s database, described by domain knowledge, the KB’s ontology.
FOL-rewritability is the main OMQA technique: it reformulates a query w.r.t. the KB’s ontology so that the evaluation of the reformulated query on the KB’s database computes the correct answers. However, because this technique embeds the domain knowledge relevant to the query into the reformulated query, a reformulated query may be complex and its optimization is the crux of efficiency.
We showcase OptiRef that implements a novel, general optimization framework for efficient query answering on datalog ±, description logic, existential rules, OWL and RDF/S KBs. OptiRef optimizes reformulated queries by rapidly computing, based on a KB’s database summary, simpler (contained) queries with the same answers. We demonstrate OptiRef’s effectiveness on well-established benchmarks: performance is significantly improved in general, up to several orders of magnitude in the best cases!
Sarcasm is usually emotional and topical. Mining the characteristics of sarcasm semantics in different emotional tendencies and topic expressions helps gain insight into the sarcasm cause. Most of the existing work detect sarcasm or topic label based on a supervised learning framework, which requires heavy data annotation work. To overcome the above challenges, inspired by the multi-task learning framework, this paper proposes an unsupervised knowledge-enhanced prompt method. This method uses the similarity interaction mechanism to mine the hidden relationship between the sarcasm cause and topic, which integrates external knowledge, such as syntax and emotion, into the prompting and generation process. Additionally, it identifies the sarcasm cause and topic simultaneously. Experimental results on a real-world dataset verify the effectiveness of the proposed model.
The research on multimedia retrieval has lasted for several decades. However, past efforts generally focused on single-media retrieval, where the queries and retrieval results belong to the same media (platform) type, such as social media platforms or search engines. In single-media retrieval, users have to select search media or options based on search characteristics such as contents, time, or spatial distance, they might be unable to retrieve correct results mixed in other media if they carelessly forget to select. In this study, we propose a cross-media retrieval system using suggestion generation methods to integrate three search characteristics of the Web (textual content-based retrieval), SNS (timeliness), and map (spatial distance-aware retrieval). In our previous research, we attempted to improve search efficiency using clustering methods to provide search results to users through related terms, etc. In this paper, we focus on the search efficiency of multiple search media. We utilize Google search engine to obtain the retrieval content from the Web, Twitter to obtain timely information from SNSs, and Google Maps to get geographical information from maps. We apply the obtained retrieval results to analyze the similarities between them by clustering. Then, we generate relevant suggestions and provide them to users. Moreover, we validate the effectiveness of the search results generated by our proposed system.
The web is a treasure trove for data that is increasingly used by computer scientists for building large machine learning models as well as non-computer scientists for social studies or marketing analyses. As such, web-crawling is an essential tool for both computational and non-computational scientists to conduct research. However, most of the existing web crawler frameworks and software products either require professional coding skills without an easy-to-use graphic user interface or are expensive and limited in features. They are thus not friendly to newbies and inconvenient for complicated web-crawling tasks.
In this paper, we present an easy-to-use visual web crawler system, EasySpider, for designing and executing web crawling tasks without coding. The workflow of a new web crawling task can be visually programmed by following EasySpider’s visual wizard on the target webpages using an intuitive point-and-click interface. The generated crawler task can then be easily invoked locally or as a web service. Our EasySpider is cross-platform and flexible to adapt to different web-resources. It also supports advanced configuration for complicated tasks and extension. The whole system is open-sourced and transparent for free-access at GitHub 1, which avoids possible privacy leakage.
The inclusion of explicit personas in generation models has gained significant attention as a means of developing intelligent dialogue agents. However, large pretrained generation models often produce inconsistent responses with persona. We investigate the model generation behavior to identify signs of inconsistency and observe inconsistent behavior patterns. In this work, we introduce contrastive learning into persona consistent dialogue generation, building on the idea that humans learn not just from positive feedback, but also from identifying and correcting undesirable behaviors. According to the inconsistent patterns, we design two strategies to construct high-quality negative samples, which are critical for contrastive learning efficacy. Experimental results demonstrate that our method can effectively improve the consistency of the responses while improving its dialogue quality on both automatic and human evaluation.
Many facts change over time, Time-sensitive Question Answering(TSQA) answers questions about time evolution facts to test the model’s ability in the dimension of the time. The existing methods obtain the representations of questions and documents and then compute their similarity to find the answer spans. These methods perform well in simple moment questions, but they are difficult to solve hard duration problems that need temporal relations and temporal numeric comparisons. In this paper, we propose Temporal-aware Multitask Learning (TML) with three auxiliary tasks to tackle with them. First, we propose a temporal-aware sequence labeling task to help the model distinguish the temporal expressions by detecting temporal types of tokens in the document. Then a temporal-aware masked language modeling task is used to capture the temporal relation between events based on the context. Furthermore, temporal-aware order learning is proposed to inject the ability of numeric comparison into the model. We carried out comprehensive experiments on the TimeQA benchmark, aiming to evaluate the performance of our proposed methodology in handling temporal question answering. TML significantly outperforms the baselines by a relative 10% on the two splits of the dataset.
Today, disinformation (i.e. deliberate misinformation) is omnipresent in all web communication channels. There is a developing explosion in this socially disruptive mode of web-based information. Increasingly, we have seen various countries developing advanced methods to spread their targeted disinformation. To address this flood of disinformation will require refined strategies to capture and evaluate the messages. In this paper, we present both a quantitative and a qualitative analysis of online social and information networks to better evaluate the characteristics of the disinformation campaigns. We focus on the case of Russian-generated disinformation, which has been developed to an elevated level. We demonstrate an effective approach based on a new dataset to study the Russian campaign composed of 14497 cases of dis-information and the corresponding counter-dis-information. Although this case is of high current relevance, there is very limited published evaluation. We provide a novel analysis and present a methodology to characterize this disinformation. We based our investigation on a Spherical k-means algorithm to determine the main topics of the disinformation and to discover the key trends. We employ distilBERT algorithm and achieve a high accuracy F1-score of 98.8 demonstrating good quantitative capabilities. We propose the methodology as a template for further exploration and analysis.
Today online social networks have a high impact in our society as more and more people use them for communicating with each other, express their opinions, participating in public discussions, etc. In particular, Twitter is one of the most popular social network platforms people mainly use for political discussions. This attracted the interest of many research studies that analyzed social phenomena on Twitter, by collecting data, analysing communication patterns, and exploring the structure of user networks. While previous works share many common methodologies for data collection and analysis, these are mainly re-implemented every time by researchers in a custom way. In this paper, we introduce PyPoll an open-source Python library that operationalizes common analysis tasks for Twitter discussions. With PyPoll users can perform Twitter graph mining, calculate the polarization index and generate interactive visualizations without needing third-party tools. We believe that PyPoll can help researchers automate their tasks by giving them methods that are easy to use. Also, we demonstrate the use of the library by presenting two use cases; the PyPoll visualization app, an online application for graph visualizing and sharing, and the Political Lighthouse, a Web portal for displaying the polarization in various political topics on Twitter.
Knowledge graphs (KGs) are vast collections of machine-readable information, usually modeled in RDF and queried with SPARQL. KGs have opened the door to a plethora of applications such as Web search or smart assistants that query and process the knowledge contained in those KGs. An important, but often disregarded, aspect of querying KGs is query provenance: explanations of the data sources and transformations that made a query result possible. In this article we demonstrate, through a Web application, the capabilities of SPARQLprov, an engine-agnostic method that annotates query results with how-provenance annotations. To this end, SPARQLprov resorts to query rewriting techniques, which make it applicable to already deployed SPARQL endpoints. We describe the principles behind SPARQLprov and discuss perspectives on visualizing how-provenance explanations for SPARQL queries.
Research on web security and privacy frequently relies on tools that analyze a set of websites. One major obstacle to the judicious analysis is the employment of a rock-solid and feature-rich web crawler. For example, the automated analysis of ad-malware campaigns on websites requests crawling a vast set of domains on multiple real web browsers, while simultaneously mitigating bot detections and applying user interactions on websites. Further, the ability to attach various threat analysis frameworks lacks current tooling efforts in web crawling and analyses.
In this paper we introduce Katti, which overcomes several of today’s technical hurdles in web crawling. Our tool employs a distributed task queue that efficiently and reliably handles both large crawling and threat analyses requests. Katti extensively collects all available web data through an integrated person-in-the-middle proxy. Moreover, Katti is not limited to a specific use case, allowing users to easily customize our tool to their individual research intends.
Formality and politeness are two of the most commonly studied stylistic dimensions of language that convey, authority, amount of shared context, and social distances among the communicators and are known to affect user behavior significantly. Formality in the text refers to the type of language used in situations when the speaker is very careful about the choice of words and sentence structure. In this paper, we propose a graph-induced transformer network (GiTN) to detect formality and politeness in text automatically. The proposed model exploits the latent linguistic features present in the text to identify the aforementioned stylistic factors. The proposed model is evaluated with multiple datasets across domains. We found that the proposed model’s performance surpasses most baseline systems.
Accessing large-scale structured datasets such as WDC or CORD-191 is very challenging. Even if one topic (e.g. COVID-19 vaccine efficacy) is of interest, all topical tables in different sources/papers have hundreds of different schemas, depending on the authors, which significantly complicates both finding and querying them.
Here we demonstrate a scalable Meta-profiler system, capable of constructing a structured standardized interface to a topic of interest in large-scale (semi-)structured datasets. This interface, that we call Meta-profile represents a multi-dimensional meta-data summary for a selected topic of interest, accumulating all differently structured representations of the topical tables in the dataset. Such Meta-profiles can be used as a rich visualization as well as a robust structural query interface simplifying access to large-scale (semi-)structured data for different user segments, such as data scientists and end users.
The increasing application of Artificial Intelligence and Machine Learning models poses potential risks of unfair behaviour and, in the light of recent regulations, has attracted the attention of the research community. Several researchers focused on seeking new fairness definitions or developing approaches to identify biased predictions. These approaches focus solely on a discrete and limited space; only a few analyze the minimum variations required in the user characteristics to ensure a positive outcome for the individuals (counterfactuals). In that direction, the methodology proposed in this paper aims to unveil unfair model behaviors using counterfactual reasoning in the case of fairness under unawareness. The method also proposes two new metrics that analyse the (estimated) sensitive information of counterfactual samples with the help of an external oracle. Experimental results on three data sets show the effectiveness of our approach for disclosing unfair behaviour of state-of-the-art Machine Learning and debiasing models. Source code is available at https://github.com/giandos200/WWW-23-Counterfactual-Fair-Opportunity-Poster-.
Algospeak refers to social media users intentionally altering or substituting words when creating or sharing online content, for example, using ‘le$bean’ for ‘lesbian’. This study discusses the characteristics of algospeak as a computer-mediated language phenomenon on TikTok with regards to users’ algorithmic literacy and their awareness of how the platform’s algorithms work. We then present results from an interview study with TikTok creators on their motivations to utilize algospeak. Our results indicate that algospeak is used to oppose TikTok’s algorithmic moderation system in order to prevent unjust content violations and shadowbanning when posting about benign yet seemingly unwanted subjects on TikTok. In this, we find that although algospeak helps to prevent consequences, it often impedes the creation of quality content. We provide an adapted definition of algospeak and new insights into user-platform interactions in the context of algorithmic systems and algorithm awareness.
Wikidata Atlas is an online system that allows users to explore Wikidata items on an interactive global map; for example, users can explore the global distribution of all lighthouses described by Wikidata. Designing such a system poses challenges in terms of scalability, where some classes have hundreds of thousands of instances; efficiency, where visualisations are generated live; freshness, where we want changes on Wikidata to be reflected as they happen in the system; and usability, where we aim for the system to be accessible for a broad audience. Herein we describe the design and implementation of the system in light of these challenges.
Linked-data principles are more and more adopted to integrate and publish semantically described open data using W3C standards resulting in a large amount of available resources . In particular, meteorological sensor data have been uplifted into public RDF graphs, such as WeKG-MF which offers access to a large set of meteorological variables described through spatial and temporal dimensions. Nevertheless, these resources include huge numbers of raw observations that are tedious to be explored and reused by lay users. In this paper, we leverage WeKG-MF to compute important agro-meteorological parameters and views with SPARQL queries. As a result, we deployed a LOD platform as a web application to allow users to navigate, consume and produce linked datasets of agro-meterological parameters calculated on-the-fly.
The privacy of the data provided by available sources is one of the major concerns of our era. In order to address this challenge, the W3C has promoted recommendations to allow expressing privacy policies. One of these recommendations is the Open Digital Rights Language (ODRL) vocabulary. Although this standard has wide adoption, it is not suitable in domains such as IoT, Ubiquitous and Mobile Computing, or discovery. The reason behind is the fact that ODRL privacy policies are not able to cope with dynamic information that may come from external sources of data and, therefore, these policies can not define privacy restrictions upon data that is not already written in the policy beforehand. In this demo paper, a solution to this challenge is presented. It is shown how ODRL policies can overcome the aforementioned limitation by being combined with a mapping language for RDF materialisation. The article shows how ODRL policies are able to consider data coming from an external data source when they are solved, in particular, a weather forecast API that provides temperature values. The demonstration defines an ODRL policy that grants access to a resource only when the temperature of the API is above a certain value.
Communicating ontologies to potential users is still a difficult and time-consuming task. Even for small ones, users need to invest time to determine whether to reuse them. Providing diagrams together with the ontologies facilitates the task of understanding the model from a user perspective. While some tools are available for depicting ontologies, and the code could also be inspected using ontology editors’ graphical interfaces, in many cases, the diagrams are too big or complex. The main objective of this demo is to present Devos, a system to generate ontology diagrams based on different strategies for summarizing the ontology.
Computational notebook environments have drawn broad attention in data-centric research applications, e.g., virtual research environment, for exploratory data analysis and algorithm prototyping. Vanilla computational notebook search solutions have been proposed but they do not pay much attention to the information needs of scientific researchers. Previous studies either treat computational notebook search as a code search problem or focus on content-based computational notebook search. The queries being considered are neither research-concerning nor diversified whereas researchers’ information needs are highly specialized and complex. Moreover, relevance evaluation for computational notebooks is tricky and unreliable since computational notebooks contain fragments of text and code and are usually poorly organized. To solve the above challenges, we propose a computational notebook search system for virtual research environment (VRE), i.e., CNSVRE, with scientific query reformulation and computational notebook summarization. We conduct a user study to demonstrate the effectiveness, efficiency, and satisfaction with the system.
We introduce MediSage, an AI decision support assistant for medical professionals and caregivers that simplifies the way in which they interact with different modalities of electronic health records (EHRs) through a conversational interface. It provides step-by-step reasoning support to an end-user to summarize patient health, predict patient outcomes and provide comprehensive and personalized healthcare recommendations. MediSage provides these reasoning capabilities by using a knowledge graph that combines general purpose clinical knowledge resources with recent-most information from the EHR data. By combining the structured representation of knowledge with the predictive power of neural models trained over both EHR and knowledge graph data, MediSage brings explainability by construction and represents a stepping stone into the future through further integration with biomedical language models.
As machine learning (ML) is increasingly integrated into our everyday Web experience, there is a call for transparent and explainable web-based ML. However, existing explainability techniques often require dedicated backend servers, which limit their usefulness as the Web community moves toward in-browser ML for lower latency and greater privacy. To address the pressing need for a client-side explainability solution, we present WebSHAP, the first in-browser tool that adapts the state-of-the-art model-agnostic explainability technique SHAP to the Web environment. Our open-source tool is developed with modern Web technologies such as WebGL that leverage client-side hardware capabilities and make it easy to integrate into existing Web ML applications. We demonstrate WebSHAP in a usage scenario of explaining ML-based loan approval decisions to loan applicants. Reflecting on our work, we discuss the opportunities and challenges for future research on transparent Web ML. WebSHAP is available at https://github.com/poloclub/webshap.
Child sexual abuse (CSA) is a pervasive issue in both online and physical contexts. Social media has grown in popularity as a platform for offering awareness, support, and community for those seeking help or advice regarding CSA. One popular social media platform in which such communities has formed is Reddit. In this study, we use both LDA and a reflexive thematic analysis to understand the types of engagements users have with subreddits aimed at CSA awareness. Through the reflexive thematic analysis, we identified six themes including strong negative emotions and phrasing, seeking help, personal experiences and their impact, measurement strategies to prevent abuse, provisioning of support, and the problematic nature of the Omegle platform. This research has implications for those creating awareness materials around CSA safety as well as for child advocacy groups.
This paper describes a system for monitoring in real time the level of violence on Web platforms through the use of an artificial intelligence model to classify textual data according to their content. The system was successfully implemented and tested during the electoral campaign period of the Brazilian 2022 elections by using it to monitor the attacks directed to thousands of candidates on Twitter. We show that, despite an accurate and absolute quantification of violence is not feasible, the system yields differential measures of violence levels that can be useful for understanding human behavior online.
This paper proposes RealGraph+, an improved version of RealGraph that processes large-scale real-world graphs efficiently in a single machine. Via a preliminary analysis, we observe that the original RealGraph does not fully utilize the IO bandwidth provided by NVMe SSDs, a state-of-the-art storage device. In order to increase the IO bandwidth, we equip RealGraph+ with three optimization strategies to issue more-frequent IO requests: (1) User-space IO, (2) Asynchronous IO, and (3) SIMD processing. Via extensive experiments with four graph algorithms and six real-world datasets, we show that (1) each of our strategies is effective in increasing the IO bandwidth, thereby reducing the execution time; (2) RealGraph+ with all of our strategies improves the original RealGraph significantly; (3) RealGraph+ outperforms state-of-the-art single-machine-based graph engines dramatically; (4) it shows performance comparable to or even better than those of other distributed-system-based graph engines.
Users are exposed to a large volume of harmful content that appears daily on various social network platforms. One solution to users’ protection is developing online moderation tools using Machine Learning (ML) techniques for automatic detection or content filtering. On the other hand, the processing of user data requires compliance with privacy policies. In this paper, we propose a framework for developing content moderation tools in a privacy-preserving manner where sensitive information stays on the users’ device. For this purpose, we apply Differentially Private Federated Learning (DP–FL), where the training of ML models is performed locally on the users’ devices, and only the model updates are shared with a central entity. To demonstrate the utility of our approach, we simulate harmful text classification on Twitter data in a distributed FL fashion– but the overall concept can be generalized to other types of misbehavior, data, and platforms. We show that the performance of the proposed FL framework can be close to the centralized approach – for both the DP–FL and non–DP FL. Moreover, it has a high performance even if a small number of clients (each with a small number of tweets) are available for the FL training. When reducing the number of clients (from fifty to ten) or the tweets per client (from 1K to 100), the classifier can still achieve AUC. Furthermore, we extend the evaluation to four other Twitter datasets that capture different types of user misbehavior and still obtain a promising performance (61% – 80% AUC). Finally, we explore the overhead on the users’ devices during the FL training phase and show that the local training does not introduce excessive CPU utilization and memory consumption overhead.
Pinterest fashion and home decor searchers often have different style tastes. Some existing work adopts users’ past engagement to infer style preference. These methods cannot help users discover new styles. Other work requires users to provide text or visual signals to describe their style preference, but users often are not familiar with style terms and do not have the right image to start with. In this paper, we propose a reinforcement learning (RL) method to help users explore and exploit style space without requiring extra user input. Experimental results show that our method improves the success rate of Pinterest fashion and home decor searches by 34.8%.
Recent studies have alarmed that many online hate speeches are implicit. With its subtle nature, the explainability of the detection of such hateful speech has been a challenging problem. In this work, we examine whether ChatGPT can be used for providing natural language explanations (NLEs) for implicit hateful speech detection. We design our prompt to elicit concise ChatGPT-generated NLEs and conduct user studies to evaluate their qualities by comparison with human-written NLEs. We discuss the potential and limitations of ChatGPT in the context of implicit hateful speech research.
Graph representation learning models have demonstrated great capability in many real-world applications. Nevertheless, prior research reveals that these models can learn biased representations leading to unfair outcomes. A few works have been proposed to mitigate the bias in graph representations. However, most existing works require exceptional time and computing resources for training and fine-tuning. In this demonstration, we propose a framework FairMILE for efficient fair graph representation learning. FairMILE allows the user to efficiently learn fair graph representations while preserving utility. In addition, FairMILE can work in conjunction with any unsupervised embedding approach based on the user’s preference and accommodate various fairness constraints. The demonstration will introduce the methodology of FairMILE, showcase how to set up and run this framework, and demonstrate our effectiveness and efficiency to the audience through both quantitative metrics and visualization.
Edge computing and cloud computing have been utilized in many AI applications in various fields, such as computer vision, NLP, autonomous driving, and smart cities. To benefit from the advantages of both paradigms, we introduce HiDEC, a hierarchical deep neural network (DNN) inference framework with three novel features. First, HiDEC enables the training of a resource-adaptive DNN through the injection of multiple early exits. Second, HiDEC provides a latency-aware inference scheduler, which determines which input samples should exit locally on an edge device based on the exit scores, enabling inference on edge devices with insufficient resources to run the full model. Third, we introduce a dual thresholding approach allowing both easy and difficult samples to exit early. Our experiments on image and text classification benchmarks show that HiDEC significantly outperforms existing solutions.
In recent years, the use of bicycle as a healthy and economical means of transportation has been promoted worldwide. In addition, with the increase in bicycle commuting due to the COVID-19, the use of bicycles are attracting attention as a last-mile means of transportation in Mobility as a Service(MaaS). To help ensure a safe and comfortable ride using a smartphone mounted on a bicycle, this study focuses on analyzing facial expressions while riding to determine potential comfort along the route with the surrounding environment and to provide a map that users can explicitly feedback(FB) after riding. Combining the emotions of facial expressions while riding and FB, we annotate comfort to different locations. Afterwards, we verify the relationship between locations with high level of comfort based on the acquired data and the surrounding environment of those locations using Google Street View(GSV).
This paper introduces a structure-aware method to segment web pages into chunks based on their web structures. We utilize large language models to select chunks correspond to a given intent and generate the abstractive summary. Experiments on a food pantry dataset developed for mitigating food insecurity show that the proposed framework is promising.
Social media posts may go viral and reach large numbers of people within a short period of time. Such posts may threaten the public dialogue if they contain misleading content, making their early detection highly crucial. Previous works proposed their own metrics to annotate if a tweet is viral or not in order to automatically detect them later. However, such metrics may not accurately represent viral tweets or may introduce too many false positives. In this work, we use the ground truth data provided by Twitter’s "Viral Tweets" topic to review the current metrics and also propose our own metric. We find that a tweet is more likely to be classified as viral by Twitter if the ratio of retweets to its author’s followers exceeds some threshold. We found this threshold to be 2.16 in our experiments. This rule results in less false positives although it favors smaller accounts. We also propose a transformers-based model to early detect viral tweets which reports an F1 score of 0.79. The code and the tweet ids are publicly available at: https://github.com/tugrulz/ViralTweets
Recommender systems have achieved impressive results on benchmark datasets. However, the numbers are often influenced by assumptions made on the data and evaluation mode. This work questions and revises these assumptions, to study and improve the quality, particularly for the difficult case of search-based recommendations. Users start with a personally liked item as a query and look for similar items that match their tastes. User satisfaction requires discovering truly unknown items: new authors of books rather than merely more books of known writers. We propose a unified system architecture that combines interaction-based and content-based signals and leverages language models for Transformer-powered predictions. We present new techniques for selecting negative training samples, and investigate their performance in the underexplored search-based evaluation mode.
Graph neural networks have achieved state-of-the-art performance on graph-related tasks through layer-wise neighborhood aggregation. Previous works aim to achieve powerful capability via designing injective neighborhood aggregation functions in each layer, which is difficult to determine and numerous additional parameters make it difficult to train these models. It is the input space and the aggregation function that achieve powerful capability at the same time. Instead of designing complexity aggregation functions, we propose a simple and effective framework, namely MV-GNN, to improve the model expressive power via constructing the new input space. Precisely, MV-GNN samples multi-view subgraphs for each node, and any GNN model can be applied to these views. The representation of target node is finally obtained via aggregating all views injectively. Two typical GNNs (i.e., GCN and GAT) are adopted as base models in the proposed framework, and we demonstrate the effectiveness of MV-GNN through extensive experiments.
The core challenge in numerous real-world applications is to match an inquiry to the best document from a mutable and finite set of candidates. Existing industry solutions, especially latency-constrained services, often rely on similarity algorithms that sacrifice quality for speed. In this paper we introduce a generic semantic learning-to-rank framework, Self-training Semantic Cross-attention Ranking (sRank). This transformer-based framework uses linear pairwise loss with mutable training batch sizes and achieves quality gains and high efficiency, and has been applied effectively to show gains on two industry tasks at Microsoft over real-world large-scale data sets: Smart Reply (SR) and Ambient Clinical Intelligence (ACI). In Smart Reply, sRank assists live customers with technical support by selecting the best reply from predefined solutions based on consumer and support agent messages. It achieves 11.7% gain in offline top-one accuracy on the SR task over the previous system, and has enabled 38.7% time reduction in composing messages in telemetry recorded since its general release in January 2021. In the ACI task, sRank selects relevant historical physician templates that serve as guidance for a text summarization model to generate higher quality medical notes. It achieves 35.5% top-one accuracy gain, along with 46% relative ROUGE-L gain in generated medical notes.
Query categorization at customer-to-customer e-commerce platforms like Facebook Marketplace is challenging due to the vagueness of search intent, noise in real-world data, and imbalanced training data across languages. Its deployment also needs to consider challenges in scalability and downstream integration in order to translate modeling advances into better search result relevance. In this paper we present HierCat, the query categorization system at Facebook Marketplace. HierCat addresses these challenges by leveraging multi-task pre-training of dual-encoder architectures with a hierarchical inference step to effectively learn from weakly supervised training data mined from searcher engagement. We show that HierCat not only outperforms popular methods in offline experiments, but also leads to 1.4% improvement in NDCG and 4.3% increase in searcher engagement at Facebook Marketplace Search in two weeks of online A/B testing.
Product attributes can display the selling points of products, helping users find their desired products in search results. However, product attributes are typically incomplete. In e-commerce, products have multimodal features, including original attributes, images, and texts. How to make full use of the multimodal data to complete the missing attributes is the key challenge. To this end, we propose MPKGAC, a powerful three-stream framework that handles multimodal product data for attribute completion. We build a multimodal product knowledge graph (KG) from the multimodal features, and then convert the attribute completion problem into a multimodal KG completion task. MPKGAC encodes each modality separately, fuses them adaptively, and integrates multimodal decoders for prediction. Experiments show that MPKGAC outperforms the best baseline by 6.2% in Hit@1. MPKGAC is employed to enrich selling points of the women’s clothing industry at Alibaba.com.cn and improves the click-through rate (CTR) by a relative 2.14%.
The click behavior is the most widely-used user positive feedback in recommendation. However, simply considering each click equally in training may suffer from clickbaits and title-content mismatching, and thus fail to precisely capture users’ real satisfaction on items. Dwell time could be viewed as a high-quality quantitative indicator of user preferences on each click, while existing recommendation models do not fully explore the modeling of dwell time. In this work, we focus on reweighting clicks with dwell time in recommendation. Precisely, we first define a new behavior named valid read, which helps to select high-quality click instances for different users and items via dwell time. Next, we propose a normalized dwell time function to reweight click signals in training for recommendation. The Click reweighting model achieves significant improvements on both offline and online evaluations in real-world systems.
We study the problem of cross-domain click-through rate (CTR) prediction for recommendation at Taobao. Cross-domain CTR prediction has been widely studied in recent years, while most attempts ignore the continual learning setting in industrial recommender systems. In light of this, we present a necessary but less-studied problem named Continual Transfer Learning (CTL), which transfers knowledge from a time-evolving source domain to a time-evolving target domain. We propose an effective and efficient model called CTNet to perform CTR prediction under the CTL setting. The core idea behind CTNet is to treat source domain representations as external knowledge for target domain CTR prediction, such that the continually well-trained source and target domain parameters can be preserved and reused during knowledge transfer. Extensive offline experiments and online A/B testing at Taobao demonstrate the efficiency and effectiveness of CTNet. CTNet is now fully deployed online at Taobao bringing significant improvements.
We study a particular matching task we call Music Cold-Start Matching. In short, given a cold-start song request, we expect to retrieve songs with similar audiences and then fastly push the cold-start song to the audiences of the retrieved songs to warm up it. However, there are hardly any studies done on this task. Therefore, in this paper, we will formalize the problem of Music Cold-Start Matching detailedly and give a scheme. During the offline training, we attempt to learn high-quality song representations based on song content features. But, we find supervision signals typically follow power-law distribution causing skewed representation learning. To address this issue, we propose a novel contrastive learning paradigm named Bootstrapping Contrastive Learning (BCL) to enhance the quality of learned representations by exerting contrastive regularization. During the online serving, to locate the target audiences more accurately, we propose Clustering-based Audience Targeting (CAT) that clusters audience representations to acquire a few cluster centroids and then locate the target audiences by measuring the relevance between the audience representations and the cluster centroids. Extensive experiments on the offline dataset and online system demonstrate the effectiveness and efficiency of our method. Currently, we have deployed it on NetEase Cloud Music, affecting millions of users.
Taobao Search consists of two phases: the retrieval phase and the ranking phase. Given a user query, the retrieval phase returns a subset of candidate products for the following ranking phase. Recently, the paradigm of pre-training and fine-tuning has shown its potential in incorporating visual clues into retrieval tasks. In this paper, we focus on solving the problem of text-to-multimodal retrieval in Taobao Search. We consider that users’ attention on titles or images varies on products. Hence, we propose a novel Modal Adaptation module for cross-modal fusion, which helps assigns appropriate weights on texts and images across products. Furthermore, in e-commerce search, user queries tend to be brief and thus lead to significant semantic imbalance between user queries and product titles. Therefore, we design a separate text encoder and a Keyword Enhancement mechanism to enrich the query representations and improve text-to-multimodal matching. To this end, we present a novel vision-language (V+L) pre-training methods to exploit the multimodal information of (user query, product title, product image). Extensive experiments demonstrate that our retrieval-specific pre-training model (referred to as MAKE) outperforms existing V+L pre-training methods on the text-to-multimodal retrieval task. MAKE has been deployed online and brings major improvements on the retrieval system of Taobao Search.
When a customer sees a movie recommendation, she may buy the ticket right away, which is the immediate feedback that helps improve the recommender system. Alternatively, she may choose to come back later and this long-term feedback is also modeled to promote user retention. However, the long-term feedback comes with non-trivial challenges in understanding user retention: the complicated correlation between current demands and follow-up demands, coupled with the periodicity of services. For instance, before the movie, the customer buys popcorn through the App, which temporally correlates with the initial movie recommendation. Days later, she checks the App for new movies, as a weekly routine. To address this complexity in a more fine-grained revisit modeling, we propose Time Aware Service Sequential Recommendation (TASSR) for user retention, which is equipped with a multi-task design and an In-category TimeSeqBlock module. Large-scale online and offline experiments demonstrate its significant advantages over competitive baselines.
Embedding-based retrieval (EBR) methods are widely used in modern recommender systems thanks to its simplicity and effectiveness. However, along the journey of deploying and iterating on EBR in production, we still identify some fundamental issues in existing methods. First, when dealing with large corpus of candidate items, EBR models often have difficulties in balancing the performance on distinguishing highly relevant items (positives) from both irrelevant ones (easy negatives) and from somewhat related yet not competitive ones (hard negatives). Also, we have little control in the diversity and fairness of the retrieval results because of the “greedy” nature of nearest vector search. These issues compromise the performance of EBR methods in large-scale industrial scenarios.
This paper introduces a simple and proven-in-production solution to overcome these issues. The proposed solution takes a divide-and-conquer approach: the whole set of candidate items are divided into multiple clusters and we run EBR to retrieve relevant candidates from each cluster in parallel; top candidates from each cluster are then combined by some controllable merging strategies. This approach allows our EBR models to only concentrate on discriminating positives from mostly hard negatives. It also enables further improvement from a multi-tasking learning (MTL) perspective: retrieval problems within each cluster can be regarded as individual tasks; inspired by recent successes in prompting and prefix-tuning, we propose an efficient task adaption technique further boosting the retrieval performance within each cluster with negligible overheads. The presented solution has been deployed in Kuaishou, one of the most popular short-video streaming platforms in China with hundreds of millions of active users.
Cloud failures have been a major threat to the reliability of cloud services. Many failure prediction approaches have been proposed to predict cloud failures before they actually occur, so that proactive actions can be taken to ensure service reliability. In industrial practice, existing failure prediction approaches mainly focus on utilizing state-of-the-art time series models to enhance the performance of failure prediction but neglect the training strategy. However, as curriculum learning points out, models perform better when they are trained with data in an order of easy-to-difficult. In this paper, we propose EDITS, a novel training strategy for cloud failure prediction, which greatly improves the performance of the existing cloud failure prediction models. Our experimental results on industrial and public datasets show that EDITS can obviously enhance the performance of cloud failure prediction model. In addition, EDITS also outperforms other curriculum learning methods. More encouragingly, our proposed EDITS has been successfully applied to Microsoft 365 and Azure online service systems, and has obviously reduced financial losses caused by cloud failures.
For training implicit collaborative filtering (ICF) models, hard negative sampling (HNS) has become a state-of-the-art solution for obtaining negative signals from massive uninteracted items. However, selecting appropriate hardness levels for personalized recommendations remains a fundamental, yet underexplored, problem. Previous HNS works have primarily adjusted the hardness level by tuning a single hyperparameter. However, applying the same hardness level to each user is unsuitable due to varying user behavioral characteristics, the quantity and quality of user records, and different consistencies of models’ inductive biases. Moreover, increasing the number of hyperparameters is not practical due to the massive number of users. To address this important and challenging problem, we propose a model-agnostic and practical approach called hardness-personalized negative sampling (HAPENS). HAPENS uses a two-stage approach: in stage one, it trains the ICF model with a customized objective function that optimizes its worst performance on each user’s interacted item set. In stage two, it utilizes these worst performances as personalized hardness levels with a well-designed sampling distribution, and trains the final model with the same architecture. We evaluated HAPENS on the collected Bing advertising dataset and one public dataset, and the comprehensive experimental results demonstrate its robustness and superiority. Moreover, HAPENS has delivered significant benefits to the Bing advertising system. To the best of our knowledge, we are the first to study this important and challenging problem.
Same-style products retrieval plays an important role in e-commerce platforms, aiming to identify the same products which may have different text descriptions or images. It can be used for similar products retrieval from different suppliers or duplicate products detection of one supplier. Common methods use the image as the detected object, but they only consider the visual features and overlook the attribute information contained in the textual descriptions, and perform weakly for products in image less important industries like machinery, hardware tools and electronic component, even if an additional text matching module is added. In this paper, we propose a unified vision-language modeling method for e-commerce same-style products retrieval, which is designed to represent one product with its textual descriptions and visual contents. It contains one sampling skill to collect positive pairs from user click logs with category and relevance constrained, and a novel contrastive loss unit to model the image, text, and image+text representations into one joint embedding space. It is capable of cross-modal product-to-product retrieval, as well as style transfer and user-interactive search. Offline evaluations on annotated data demonstrate its superior retrieval performance, and online testings show it can attract more clicks and conversions. Moreover, this model has already been deployed online for similar products retrieval in alibaba.com, the largest B2B e-commerce platform in the world.
Embedding-based Retrieval (EBR) is a powerful search retrieval technique in e-commerce to address semantic matches between search queries and products. However, commerce search engines like Facebook Marketplace Search are complex multi-stage systems with each stage optimized for different business objectives. Search retrieval system usually focuses on query-product semantic relevance, while search ranking puts more emphasis on up-ranking products for high quality engagement. As a result, the end-to-end search experience is a combined result of relevance, engagement, and the interaction between different stages of the system. This presents challenges to EBR systems in optimizing overall search experiences. In this paper we present Que2Engage, a search EBR system designed to bridge the gap between retrieval and ranking for better end-to-end optimization. Que2Engage takes a multimodal & multitask approach to infuse contextual information into the retrieval stage and balance different business objectives. We show the effectiveness of our approach via a multitask evaluation framework with thorough baseline comparisons and ablation studies. Que2Engage has been deployed into Facebook Marketplace Search engine and shows significant improvements in user engagement in two weeks of A/B testing.
Quota is often used in resource allocation and management scenarios to prevent abuse of resource and increase the efficiency of resource utilization. Quota management is usually fulfilled with a set of rules maintained by the system administrator. However, maintaining these rules usually needs deep domain knowledge. Moreover, arbitrary rules usually cannot guarantee both high resource utilization and fairness at the same time. In this paper, we propose a reinforcement learning framework to automatically respond to quota requests in cloud computing platforms with distinctive usage characteristics for users. Extensive experimental results have demonstrated the superior performance of our framework on achieving a great trade-off between efficiency and fairness.
A/B tests are the gold standard for evaluating digital experiences on the web. However, traditional “fixed-horizon” statistical methods are often incompatible with the needs of modern industry practitioners as they do not permit continuous monitoring of experiments. Frequent evaluation of fixed-horizon tests (“peeking”) leads to inflated type-I error and can result in erroneous conclusions. We have released an experimentation service on the Adobe Experience Platform based on anytime-valid confidence sequences, allowing for continuous monitoring of the A/B test and data-dependent stopping. We demonstrate how we adapted and deployed asymptotic confidence sequences in a full featured A/B testing platform, describe how sample size calculations can be performed, and how alternate test statistics like “lift” can be analyzed. On both simulated data and thousands of real experiments, we show the desirable properties of using anytime-valid methods instead of traditional approaches.
Google My Business (GMB) is a platform that hosts business profiles, which will be displayed when a user issues a relevant query on Google Search or Google Maps. GMB businesses provide a wide variety of services, from home cleaning and repair, to legal consultation. However, the exact details of the service provided (a.k.a. job types), are often missing in business profiles. This places the burden of finding these details on the users. To alleviate this burden, we built a pipeline to automatically extract the job types from business websites. We share the various challenges we faced while developing this pipeline, and how we effectively addressed these challenges by (1) utilizing structured content to tackle the cold start problem for dataset collection; (2) exploiting context information to improve model performance without hurting scalability; and (3) formulating the extraction problem as a retrieval task to improve both generalizability, efficiency, and coverage. The pipeline has been deployed for over a year and is scalable enough to be periodically refreshed. The extracted job types are serving users of Google Search and Google Maps, with significant improvements in both precision and coverage.
Recommender systems usually rely on observed user interaction data to build personalized recommendation models, assuming that the observed data reflect user interest. However, user interacting with an item may also due to conformity, the need to follow popular items. Most previous studies neglect user’s conformity and entangle interest with it, which may cause the recommender systems fail to provide satisfying results. Therefore, from the cause-effect view, disentangling these interaction causes is a crucial issue. It also contributes to OOD problems, where training and test data are out-of-distribution. Nevertheless, it is quite challenging as we lack the signal to differentiate interest and conformity. The data sparsity of pure cause and the items’ long-tail problem hinder disentangled causal embedding. In this paper, we propose DCCL, a framework that adopts contrastive learning to disentangle these two causes by sample augmentation for interest and conformity respectively. Futhermore, DCCL is model-agnostic, which can be easily deployed in any industrial online system. Extensive experiments are conducted over two real-world datasets and DCCL outperforms state-of-the-art baselines on top of various backbone models in various OOD environments. We also demonstrate the performance improvements by online A/B testing on Kuaishou, a billion-user scale short-video recommender system.
Retrieving relevant items that match users’ queries from billion-scale corpus forms the core of industrial e-commerce search systems, in which embedding-based retrieval (EBR) methods are prevailing. These methods adopt a two-tower framework to learn embedding vectors for query and item separately and thus leverage efficient approximate nearest neighbor (ANN) search to retrieve relevant items. However, existing EBR methods usually ignore inconsistent user behaviors in industrial multi-stage search systems, resulting in insufficient retrieval efficiency with a low commercial return. To tackle this challenge, we propose to improve EBR methods by learning Multi-level Multi-Grained Semantic Embeddings (MMSE). We propose the multi-stage information mining to exploit the ordered, clicked, unclicked and random sampled items in practical user behavior data, and then capture query-item similarity via a post-fusion strategy. We then propose multi-grained learning objectives that integrate the retrieval loss with global comparison ability and the ranking loss with local comparison ability to generate semantic embeddings. Both experiments on a real-world billion-scale dataset and online A/B tests verify the effectiveness of MMSE in achieving significant performance improvements on metrics such as offline recall and online conversion rate (CVR).
Query intent classification, which aims at assisting customers to find desired products, has become an essential component of the e-commerce search. Existing query intent classification models either design more exquisite models to enhance the representation learning of queries or explore label-graph and multi-task to facilitate models to learn external information. However, these models cannot capture multi-granularity matching features from queries and categories, which makes them hard to mitigate the gap in the expression between informal queries and categories.
This paper proposes a Multi-granularity Matching Attention Network (MMAN), which contains three modules: a self-matching module, a char-level matching module, and a semantic-level matching module to comprehensively extract features from the query and a query-category interaction matrix. In this way, the model can eliminate the difference in expression between queries and categories for query intent classification. We conduct extensive offline and online A/B experiments, and the results show that the MMAN significantly outperforms the strong baselines, which shows the superiority and effectiveness of MMAN. MMAN has been deployed in production and brings great commercial value for our company.
Recently, short video platforms have achieved rapid user growth by recommending interesting content to users. The objective of the recommendation is to optimize user retention, thereby driving the growth of DAU (Daily Active Users). Retention is a long-term feedback after multiple interactions of users and the system, and it is hard to decompose retention reward to each item or a list of items. Thus traditional point-wise and list-wise models are not able to optimize retention. In this paper, we choose reinforcement learning methods to optimize the retention as they are designed to maximize the long-term performance. We formulate the problem as an infinite-horizon request-based Markov Decision Process, and our objective is to minimize the accumulated time interval of multiple sessions, which is equal to improving the app open frequency and user retention. However, current reinforcement learning algorithms can not be directly applied in this setting due to uncertainty, bias, and long delay time incurred by the properties of user retention. We propose a novel method, dubbed RLUR, to address the aforementioned challenges. Both offline and live experiments show that RLUR can significantly improve user retention. RLUR has been fully launched in Kuaishou app for a long time, and achieves consistent performance improvement on user retention and DAU.
Sequential recommender models are essential components of modern industrial recommender systems. These models learn to predict the next items a user is likely to interact with based on his/her interaction history on the platform. Most sequential recommenders however lack a higher-level understanding of user intents, which often drive user behaviors online. Intent modeling is thus critical for understanding users and optimizing long-term user experience. We propose a probabilistic modeling approach and formulate user intent as latent variables, which are inferred based on user behavior signals using variational autoencoders (VAE). The recommendation policy is then adjusted accordingly given the inferred user intent. We demonstrate the effectiveness of the latent user intent modeling via offline analyses as well as live experiments on a large-scale industrial recommendation platform.
Modeling high-level user intent in recommender systems can improve performance, although it is often difficult to obtain a ground truth measure of this intent. In this paper, we investigate a novel way to obtain such an intent signal by leveraging resource pages associated with a particular task. We jointly model product interactions and resource page interactions to create a system which can recommend both products and resource pages to users. Our experiments consider the domain of home improvement product recommendation, where resource pages are DIY (do-it-yourself) project pages from Lowes.com. Each DIY page provides a list of tools, materials, and step-by-step instructions to complete a DIY project, such as building a deck, installing cabinets, and fixing a leaking pipe. We use this data as an indicator of the intended project, which is a natural high-level intent signal for home improvement shoppers. We then extend a state-of-the-art system to incorporate this new intent data, and show a significant improvement in the ability of the system to recommend products. We further demonstrate that our system can be used to successfully recommend DIY project pages to users. We have taken initial steps towards deploying our method for project recommendation in production on the Lowe’s website and for recommendations through marketing emails.
Model evolution and data updating are two common phenomena in large-scale real-world machine learning applications, e.g. ads and recommendation systems. To adapt, the real-world system typically retrain with all available data and online learn with recently available data to update the models periodically with the goal of better serving performance. In this paper, we propose a novel framework, named Confidence Ranking, which designs the optimization objective as a ranking function with two different models. Our confidence ranking loss allows direct optimization of the logits output for different convex surrogate functions of metrics, e.g. AUC and Accuracy depending on the target task and dataset. Armed with our proposed methods, our experiments show that the introduction of confidence ranking loss can outperform all baselines on the CTR prediction tasks of public and industrial datasets. This framework has been deployed in the advertisement system of JD.com to serve the main traffic in the fine-rank stage.
Identifying the fraud risk of applications on the web platform is a critical challenge with both requirements of effectiveness and interpretability. In these high-stakes web applications especially in financial scenarios, decision rules have been extensively used due to the rising requirements for explainable artificial intelligence (XAI). In this work, we develop a rule learning framework with rule mining and rule refining modules for addressing the learning efficiency and class imbalance issues while making the decision rules more broadly and simply applicable to risk management scenarios. On four benchmark data-sets and two large-scale data-sets, the classification performance, interpretability, and scalability of the framework have been proved, achieving at least a 26.2% relative improvement over the state-of-the-art (SOTA) models. The system is currently being used by hundreds of millions of users and dealing with an enormous number of transactions in Ant Group, which is one of the largest mobile payment platforms in the world.
Causal effect estimation has been increasingly emphasized in the past few years. To handle this problem, tree-based causal methods have been widely used due to their robustness and explainability. However, most of the existing methods are limited to running on a single machine, making it difficult to scale up to hundreds of millions of data in typical industrial scenarios. This paper proposes DGBCT, a Distributed Gradient Boosting Causal Tree to tackle such problem, and the contribution of this paper is three folds. First, we extend the original GBCT method to a multi-treatment setting and take the monotonic constraints into consideration, so that more typical industrial necessities can be resolved with our framework. Moreover, we implement DGBCT based on the ‘Controller-Coordinator-Worker’ framework, in which dual failover mechanism is achieved, and commendable flexibility is ensured. In addition, empirical results show that DGBCT significantly outperforms the state-of-the-art causal trees, and has a near-linear speedup as the number of workers grows. The system is currently deployed in Alipay1 to support the daily business tasks that involve hundreds of millions of users.
In e-commerce, images are widely used to display more intuitive information about items. Image selection significantly affects the user’s click-through rate (CTR). Most existing work considers the CTR as the target to find an appropriate image. However, these methods are challenging to deploy online efficiently. Also, the selected images may not relate to the item but are profitable to CTR, resulting in the undesirable phenomenon of enticing users to click on the item. To address these issues, we propose a novel two-stage pipeline method with content-based recall model and CTR-based ranking model. The first is realized as a joint method based on the title-image matching model and multi-modal knowledge graph embedding learning model. The second is a CTR-based visually aware scoring model, incorporating the entity textual information and entity images. Experimental results show the effectiveness and efficiency of our method in offline evaluations. After a month of online A/B testing on a travel platform Fliggy, the relative improvement of our method is 5% with respect to seller selection on CTCVR in the searching scenario, and our method further improves pCTR from 3.48% of human pick to 3.53% in the recommendation scenario.
We propose a novel adaptation of graph-based active learning for customer address resolution or de-duplication, with the aim to determine if two addresses represent the same physical building or not. For delivery systems, improving address resolution positively impacts multiple downstream systems such as geocoding, route planning and delivery time estimations, leading to an efficient and reliable delivery experience, both for customers as well as delivery agents. Our proposed approach jointly leverages address text, past delivery information and concepts from graph theory to retrieve informative and diverse record pairs to label. We empirically show the effectiveness of our approach on manually curated dataset across addresses from India (IN) and United Arab Emirates (UAE). We achieved absolute improvement in recall on average across IN and UAE while preserving precision over the existing production system. We also introduce delivery point (DP) geocode learning for cold-start addresses as a downstream application of address resolution. In addition to offline evaluation, we also performed online A/B experiments which show that when the production model is augmented with active learnt record pairs, the delivery precision improved by and delivery defects reduced by on an average across shipments from IN and UAE.
We present a few-shot intent detection model for an enterprise’s conversational dialogue system. The model uses an intent topological tree to guide the search for the user intent using large language models (LLMs). The intents are resolved based on semantic similarities between user utterances and the text descriptions of the internal nodes of the intent tree or the intent examples in the leaf nodes of the tree. Our results show that an off-the-shelf language model can work reasonably well in a large enterprise deployment without fine-tuning, and its performance can be further improved with fine-tuning as more domain-specific data becomes available. We also show that the fine-tuned language model meets and outperforms the state-of-the-art (SOTA) results in resolving conversation intents without training classifiers. With the use of a topological intent tree, our model provides more interpretability to cultivate people’s trust in their decisions.
Query AutoComplete (QAC) helps customers complete their search queries quickly by suggesting completed queries. QAC on eCommerce sites usually employ Learning to Rank (LTR) approaches based on customer behaviour signals such as clicks and conversion rates to optimize business metrics. However, they do not exclusively optimize for the quality of suggested queries which results in lack of navigational suggestions like product categories and attributes, e.g., "sports shoes" and "white shoes" for query "shoes". We propose to improve the quality of query suggestions by introducing navigational suggestions without impacting the business metrics. For this purpose, we augment the customer behaviour (CB) based objective with Query-Quality (QQ) objective and assemble them with trainable mixture weights to define multi-objective optimization function. We propose to optimize this multi-objective function by implementing ALMO algorithm to obtain a model robust against any mixture weight. We show that this formulation improves query relevance on an eCommerce QAC dataset by at least 13% over the baseline Deep Pairwise LTR (DeepPLTR) with minimal impact on MRR and results in a lift of 0.26% in GMV in an online A/B test. We also evaluated our approach on public search logs datasets and got improvement in query relevance by using query coherence as QQ objective.
Click-through rate (CTR) estimation plays as a pivotal function module in various online services. Previous studies mainly apply CTR models to the field of recommendation or online advertisement. Indeed, CTR is also critical in information retrieval, since the CTR probability can serve as a valuable feature for a query-document pair. In this paper, we study the CTR task under account search scenario in WeChat, where users search official accounts or mini programs corresponding to an organization. Despite the large number of CTR models, directly applying them to our task is inappropriate since the account retrieval task has a number of specific characteristics. E.g., different from traditional user-centric CTR models, in our task, CTR prediction is query-centric and does not model user information. In addition, queries and accounts are short texts, and heavily rely on prior knowledge and semantic understanding. These characteristics require us to specially design a CTR model for the task. To this end, we propose a novel CTR prediction model named Knowledge eNhanced hIerarchical Fusion nEtwork (KNIFE). Specifically, to tackle the prior information problem, we mine the knowledge graph of accounts as side information; to enhance the representations of queries, we construct a bipartite graph for queries and accounts. In addition, a hierarchical network structure is proposed to fuse the representations of different information in a fine-grained manner. Finally, the representations of queries and accounts are obtained from this hierarchical network and fed into the CTR model together with other features for prediction. We conduct extensive experiments against 12 existing models across two industrial datasets. Both offline and online A/B test results indicate the effectiveness of KNIFE.
With the development of recommender systems, it becomes an increasingly common need to mix multiple item sequences from different sources. Therefore, the integrated ranking stage is proposed to be responsible for this task with re-ranking models. However, existing methods ignore the relation between the sequences, thus resulting in local optimum over the interaction session. To resolve this challenge, in this paper, we propose a new model named NFIRank (News Feed Integrated Ranking with reinforcement learning) and formulate the whole interaction session as a MDP (Markov Decision Process). Sufficient offline experiments are provided to verify the effectiveness of our model. In addition, we deployed our model on Huawei Browser and gained 1.58% improvements in CTR compared with the baseline in online A/B test. Code will be available at https://gitee.com/mindspore/models/tree/master/research/recommend/NFIRank.
Large-scale e-commercial platforms usually contain multiple business fields, which require industrial algorithms to characterize user intents across multiple domains. Numerous efforts have been made in user multi-domain intent modeling to achieve state-of-the-art performance. However, existing methods mainly focus on the domains having rich user information, which makes implementation to domains with sparse or rare user behavior meet with mixed success. Hence, in this paper, we propose a novel method named Meta-generator enhanced multi-Domain model (MetaDomain) to address the above issue. MetaDomain mainly includes two steps, 1) users’ multi-domain intent representation and 2) users’ multi-domain intent fusion. Specifically, in users’ multi-domain intent representation, we use the gradient information from a domain intent extractor to train the domain intent meta-generator, where the domain intent extractor has the input of users’ sequence feature and domain meta-generator has the input of users’ basic feature, hence the capability of generating users’ intent with sparse behavior. Afterward, in users’ multi-domain intent fusion, a domain graph is used to represent the high-order multi-domain connectivity. Extensive experiments have been carried out under a real-world industrial platform named Meituan. Both offline and rigorous online A/B tests under the billion-level data scale demonstrate the superiority of the proposed MetaDomain method over the state-of-the-art baselines. Furthermore comparing with the method using multi-domain sequence features, MetaDomain can reduce the serving latency by 20%. Currently, MetaDomain has been deployed in Meituan one of the largest worldwide Online-to-Offline(O2O) platforms.
CTR and CVR are critical factors in personalized applications, and many methods jointly estimate them via multi-task learning to alleviate the ultra-sparsity of conversion behaviors. However, it is still difficult to predict CVR accurately and robustly due to the limited and even biased knowledge extracted by the single model tower optimized on insufficient conversion samples. In this paper, we propose a task adaptive multi-learner (TAML) framework for joint CTR and CVR prediction. We design a hierarchical task adaptive knowledge representation module with different experts to capture knowledge in different granularities, which can effectively exploit the commonalities between CTR and CVR estimation tasks meanwhile keeping their unique characteristics. We apply multiple learners to extract data knowledge from various views and fuse their predictions to obtain accurate and robust scores. To facilitate knowledge sharing across learners, we further perform self-distillation that uses the fused scores to teach different learners. Thorough offline and online experiments show the superiority of TAML in different Ad ranking tasks, and we have deployed it in Huawei’s online advertising platform to serve the main traffic.
Digital technology organizations routinely use online experiments (e.g. A/B tests) to guide their product and business decisions. In e-commerce, we often measure changes to transaction- or item-based business metrics such as Average Basket Value (ABV), Average Basket Size (ABS), and Average Selling Price (ASP); yet it remains a common pitfall to ignore the dependency between the value/size of transactions/items during experiment design and analysis. We present empirical evidence on such dependency, its impact on measurement uncertainty, and practical implications on A/B test outcomes if left unmitigated. By making the evidence available, we hope to drive awareness of the pitfall among experimenters in e-commerce and hence encourage the adoption of established mitigation approaches. We also share lessons learned when incorporating selected mitigation approaches into our experimentation analysis platform currently in production.1
Interacting with voice assistants, such as Amazon Alexa to aid in day-to-day tasks has become a ubiquitous phenomenon in modern-day households. These voice assistants often have screens to provide visual content (e.g., images, videos) to their users. There is an increasing trend of users shopping or searching for products using these devices, yet, these voice assistants do not support commands or queries that contain visual references to the content shown on screen (e.g., “blue one”, “red dress”). We introduce a novel multi-modal visual shopping experience where the voice assistant is aware of the visual content shown on the screen and assists the user in item selection using natural language multi-modal interactions. We detail a practical, lightweight end-to-end system architecture spanning from model fine-tuning, deployment, to skill invocation on an Amazon Echo family device with a screen. We also define a niche “Visual Item Selection” task and evaluate whether we can effectively leverage publicly available multi-modal models, and embeddings produced from these models for the task. We show that open source contrastive embeddings like CLIP  and ALBEF  have zero-shot accuracy above for the “Visual Item Selection” task on an internally collected visual shopping dataset. By further fine-tuning the embeddings, we obtain further gains of 8.6% to 24.0% in relative accuracy improvement over a baseline. The technology that enables our visual shopping assistant is available as an Alexa Skill in the Alexa Skills store.
Recommender systems are increasingly successful in recommending personalized content to users. However, these systems often capitalize on popular content. There is also a continuous evolution of user interests that need to be captured, but there is no direct way to systematically explore users’ interests. This also tends to affect the overall quality of the recommendation pipeline as training data is generated from the candidates presented to the user. In this paper, we present a framework for exploration in large-scale recommender systems to address these challenges. It consists of three parts, first the user-creator exploration which focuses on identifying the best creators that users are interested in, second the online exploration framework and third a feed composition mechanism that balances explore and exploit to ensure optimal prevalence of exploratory videos. Our methodology can be easily integrated into an existing large-scale recommender system with minimal modifications. We also analyze the value of exploration by defining relevant metrics around user-creator connections and understanding how this helps the overall recommendation pipeline with strong online gains in creator and ecosystem value. In contrast to the regression on user engagement metrics generally seen while exploring, our method is able to achieve significant improvements of 3.50% in strong creator connections and 0.85% increase in novel creator connections. Moreover, our work has been deployed in production on Facebook Watch, a popular video discovery and sharing platform serving billions of users.
Learning large-scale industrial recommender system models by fitting them to historical user interaction data makes them vulnerable to conformity bias. This may be due to a number of factors, including the fact that user interests may be difficult to determine and that many items are often interacted with based on ecosystem factors other than their relevance to the individual user. In this work, we introduce CAM2, a conformity-aware multi-task ranking model to serve relevant items to users on one of the largest industrial recommendation platforms. CAM2 addresses these challenges systematically by leveraging causal modeling to disentangle users’ conformity to popular items from their true interests. This framework is generalizable and can be scaled to support multiple representations of conformity and user relevance in any large-scale recommender system. We provide deeper practical insights and demonstrate the effectiveness of the proposed model through improvements in offline evaluation metrics compared to our production multi-task ranking model. We also show through online experiments that the CAM2 model results in a significant 0.50% increase in aggregated user engagement, coupled with a 0.21% increase in daily active users on Facebook Watch, a popular video discovery and sharing platform serving billions of users.
Many recommendation systems rely on point-wise models, which score items individually. However, point-wise models generating scores for a video are unable to account for other videos being recommended in a query. Due to this, diversity has to be introduced through the application of heuristic-based rules, which are not able to capture user preferences, or make balanced trade-offs in terms of diversity and item relevance. In this paper, we propose a novel method which introduces diversity by modeling the impact of low diversity on user’s engagement on individual items, thus being able to account for both diversity and relevance to adjust item scores. The proposed method is designed to be easily pluggable into existing large-scale recommender systems, while introducing minimal changes in the recommendations stack. Our models show significant improvements in offline metrics based on the normalized cross entropy loss compared to production point-wise models. Our approach also shows a substantial increase of 1.7% in topline engagements coupled with a 1.5% increase in daily active users in an A/B test with live traffic on Facebook Watch, which translates into an increase of millions in the number of daily active users for the product.
Multi-Source Domain Adaptation (MSDA) is widely used in various machine learning scenarios for domain shifts between labeled source domains and unlabeled target domains. Conventional MSDA methods are built on a strong hypothesis that data samples from the same source belong to the same domain with the same latent distribution. However, in practice sources and their latent domains are not necessarily one-to-one correspondence. To tackle this problem, a novel Multi-source Reconstructed Domain Adaptation (MRDA) framework for MSDA is proposed. We use an Expectation-Maximization (EM) mechanism that iteratively reconstructs the source domains to recover the latent domains and performs domain adaptation on the reconstructed domains. Specifically, in the E-step, we cluster the samples from multiple sources into different latent domains, and a soft assignment strategy is proposed to avoid cluster imbalance. In the M-step, we freeze the latent domains clustered in the E-step and optimize the objective function for domain adaptation, and a global-specific feature extractor is used to capture both domain-invariant and domain-specific features. Extensive experiments demonstrate that our approach can reconstruct source domains and perform domain adaptation on the reconstructed domains effectively, thus significantly outperforming state-of-the-art (SOTA) baselines (e.g., 1% to 3.1% absolute improvement in AUC).
With the explosion of Internet product users, how to locate the faulty ones from numerous back-end applications after a customer complaint has become an essential issue in improving user experience. However, existing solutions mostly rely on manual testing to infer the fault, severely limiting their efficiency. In this paper, we transform the problem of locating faulty applications into two subproblems and propose a fully automated framework. We design a scorecard model in one stage to evaluate the semantic relevance between applications and customer complaints. Then in the other stage, topology graphs that reflect the actual calling relationship and engineering connection relationship between applications are utilized to evaluate the topology relevance between applications. Specifically, we employ a multi-graph co-learning framework constrained by consistency-independence loss and an engineering-theory-driven clustering strategy for the unsupervised learning of graphs. With semantic and topology relevance, we can comprehensively locate relevant faulty applications. Experiments on the Alipay dataset show that our method gains significant improvements in both model performance and efficiency.
E-commerce platforms provide entrances for customers to enter mini-apps that can meet their specific shopping requirements. Trigger items displayed on entrance icons can attract more entering. However, conventional Click-Through-Rate (CTR) prediction models, which ignore user instant interest in trigger item, fail to be applied to the new recommendation scenario dubbed Trigger-Induced Recommendation in Mini-Apps (TIRA). Moreover, due to the high stickiness of some customers to mini-apps, existing trigger-based methods that over-emphasize the importance of triggers, are undesired for TIRA, since a large portion of customer entries are because of their routine shopping habits instead of triggers. We identify that the key to TIRA is to extract customers’ personalized entering intention and weigh the impact of triggers based on this intention. To achieve this goal, we convert CTR prediction for TIRA into a separate estimation form, and present Deep Intention-Aware Network (DIAN) with three key elements: 1) Intent Net that estimates user’s entering intention, i.e., whether he/she is affected by the trigger or by the habits; 2) Trigger-Aware Net and 3) Trigger-Free Net that estimate CTRs given user’s intention is to the trigger-item and the mini-app respectively. Following a joint learning way, DIAN can both accurately predict user intention and dynamically balance the results of trigger-free and trigger-based recommendations. Experiments show that DIAN advances state-of-the-art performance in a large real-world dataset, and brings a 9.39% lift of online Item Page View and 4.74% CTR for Juhuasuan, a famous mini-app of Taobao.
User behaviors on an e-commerce app not only contain different kinds of feedback on items but also sometimes imply the cognitive clue of the user’s decision-making. For understanding the psychological procedure behind user decisions, we present the behavior path and propose to match the user’s current behavior path with historical behavior paths to predict user behaviors on the app. Further, we design a deep neural network for behavior path matching and solve three difficulties in modeling behavior paths: sparsity, noise interference, and accurate matching of behavior paths. In particular, we leverage contrastive learning to augment user behavior paths, provide behavior path self-activation to alleviate the effect of noise, and adopt a two-level matching mechanism to identify the most appropriate candidate. Our model shows excellent performance on two real-world datasets, outperforming state-of-the-art CTR models. Moreover, our model has been deployed on the Meituan food delivery platform and has accumulated 1.6% improvement in CTR and 1.8% improvement in advertising revenue.
This paper presents WAM Studio, an open source, online Digital Audio Workstation (DAW) that takes advantages of several W3C Web APIs, such as Web Audio, Web Assembly, Web Components, Web Midi, Media Devices etc. It also uses the Web Audio Modules proposal that has been designed to facilitate the development of inter-operable audio plugins (effects, virtual instruments, virtual piano keyboards as controllers etc.) and host applications. DAWs are feature-rich software and therefore particularly complex to develop in terms of design, implementation, performances and ergonomics. Very few commercial online DAWs exist today and the only open-source examples lack features (no support for inter-operable plugins, for example) and do not take advantage of the recent possibilities offered by modern W3C APIs (e.g. AudioWorklets/Web Assembly). WAM Studio was developed as an open-source technology demonstrator with the aim of showcasing the potential of the web platform, made possible by these APIs. The paper highlights some of the difficulties we encountered (i.e limitations due to the sandboxed and constrained environments that are Web browsers, latency compensation etc.). An online demo, as well as a GitHub repository for the source code are available.
In this paper, I discuss arguments in favor and in disfavor of building for the Web. I look at three extraordinary examples of apps built for the Web, and analyze reasons their creators provided for doing so. In continuation, I look at the decline of interest in cross-platform app frameworks with the exception of Flutter, which leads me to the two research questions RQ1 "Why do people not fully bet on PWA" and RQ2 "Why is Flutter so popular". My hypothesis for why developers don’t more frequently set on the Web is that in many cases they (or their non-technical reporting lines) don’t realize how powerful it has become. To counter that, I introduce a Web app and a browser extension that demonstrate the Web’s capabilities.
The inclusion of autistic people can be augmented by a mobile app that provides information without a human mediator making information perception more liberating for people in the spectrum. This paper is an overview of a doctoral work dedicated to the development of a web-based mobile tool for supporting the inclusion of people on the autism spectrum. The work includes UX/UI research conducted with psychiatry experts, web information retrieval study and neural question-answering research. Currently, the study results comprise several mobile app layouts, a retriever-reader model design and fine-tuned neural network for extractive question-answering. Source code and other resources are available at https://github.com/vifirsanova/empi.
The study of framing bias on the Web is crucial in our digital age, as the framing of information can influence human behavior and decision on critical issues such as health or politics. Traditional frame analysis requires a curated set of frames derived from manual content analysis by domain experts. In this work, we introduce a frame analysis approach based on pretrained Transformer models that let us capture frames in an exploratory manner beyond predefined frames. In our experiments on two public online news and social media datasets, we show that our approach lets us identify underexplored conceptualizations, such as that health-related content is framed in terms of beliefs for conspiracy media, while mainstream media is instead concerned with science. We anticipate our work to be a starting point for further research on exploratory computational framing analysis using pretrained Transformers.
The research addresses a topic whose precise boundaries are yet to be defined: the criminal accountability for conducts committed in the Metaverse. Following a short introduction motivating the reason why this issue has to be considered as pivotal both for the Web and for the society, the main problem raised by the research will be identified, namely, whether an action taken against a person, that in real-life would be a criminal conduct, is considerable as a crime in the Metaverse as well. A short assessment of the (very little so far) current state of the art, as well as the proposed approach and methodology will be then overviewed; finally, the contribution shows its current results, and concludes stating that countries are highly encouraged to shape respective criminal frameworks when applied to the Metaverse, and that the international community should consider the topic as a priority in its agenda. Nevertheless, further experimental and research work has still to be made.
Knowledge Graphs (KGs) form the backbone of many knowledge dependent applications such as search engines and digital personal assistants. KGs are generally constructed either manually or automatically using a variety of extraction techniques applied over multiple data sources. Due to the diverse quality of these data sources, there are likely anomalies introduced into any KG. Hence, it is unrealistic to expect a perfect archive of knowledge. Given how large KGs can be, manual validation is impractical, necessitating an automated approach for anomaly detection in KGs. To improve KG quality, and to identify interesting and abnormal triples (edges) and entities (nodes) that are worth investigating, we introduce SEKA, a novel unsupervised approach to detect anomalous triples and entities in a KG using both the structural characteristics and the content of edges and nodes of the graph. While an anomaly can be an interesting or unusual discovery, such as a fraudulent transaction requiring human intervention, anomaly detection can also identify potential errors. We propose a novel approach named Corroborative Path Algorithm to generate a matrix of semantic features, which we then use to train a one-class Support Vector Machine to identify abnormal triples and entities with no dependency on external sources. We evaluate our approach on four real-world KGs demonstrating the ability of SEKA to detect anomalies, and to outperform comparative baselines.
In recent years, knowledge graph embeddings have achieved great success. Many methods have been proposed and achieved state-of-the-art results in various tasks. However, most of the current methods present one or more of the following problems: (i) They only consider fact triplets, while ignoring the ontology information of knowledge graphs. (ii) The obtained embeddings do not contain much semantic information. Therefore, using these embeddings for semantic tasks is problematic. (iii) They do not enable large-scale training. In this paper, we propose a new algorithm that incorporates the ontology of knowledge graphs and partitions the knowledge graph based on classes to include more semantic information for parallel training of large-scale knowledge graph embeddings. Our preliminary results show that our algorithm performs well on several popular benchmarks.
This study investigates statements of participation in an exploitative animal activity on social media website Twitter. The data include social posts (tweets) related to two exploited species - the sloth (N=32,119), and the elephant (N=15,160). Tweets for each of these case studies were examined and labeled. The initial results reveal several features of interaction with exploited species. Namely, there are a high number of tweets indicating that individuals participated in exploited species activities during vacations in destinations that double as native countries for the exploited species. The data also indicate that a large number of exploited species activities take place at fairs, carnivals, and circuses. These initial results shed light on the trends in human participation in activities with exploited species. These findings will offer insight to stakeholders seeking to bolster education programs and quantify the level of animal exploitation.
An information gap exists across Wikipedia’s language editions, with a considerable proportion of articles available in only a few languages. As an illustration, it has been observed that 10 languages possess half of the available Wikipedia articles, despite the existence of 330 Wikipedia language editions. To address this issue, this study presents an approach to identify the information gap between the different language editions of Wikipedia. The proposed approach employs Latent Dirichlet Allocation (LDA) to analyze linked entities in a cross-lingual knowledge graph in order to determine topic distributions for Wikipedia articles in 28 languages. The distance between paired articles across language editions is then calculated. The potential applications of the proposed algorithm to detecting sources of information disparity in Wikipedia are discussed, and directions for future research are put forward.
The current era is dominated by intelligent Question Answering (QA) systems that can instantly answer almost all their questions, saving users search time and increasing the throughput and precision in the applied domain. A vast amount of work is being carried out in QA systems to deliver better content satisfying users’ information needs . Since QA systems are ascending the cycle of emerging technologies, there are potential research gaps that can be explored. QA systems form a significant part of Conversational Artificial Intelligent systems giving rise to a new research pathway, i.e., Conversational Question Answering (CQA) systems . We propose to design and develop a CQA system leveraging Hypergraph-based techniques. The approach focuses on the multi-turn conversation and multi-context to gauge users’ exact information needs and deliver better answers. We further aim to address "supporting evidence-based retrieval" for fact-based responsible answer generation. Since the QA system requires a large amount of data and processing, we also intend to investigate hardware performance for effective system utilization.
Previous research has identified phenomena such as cyberbystander intervention and various other forms of responses to aggressive or hateful behaviours online. In the online media ecosystem, some people from marginalized communities and their allies have attempted to enhance organic engagement by participating in organized activism, which is sometimes characterized as "non-complementary" or "indirect". This paper attempts to identify, recognize, and label this phenomenon, as well as provide suggestions for further research in this area.
Signed network graphs provide a way to model complex relationships and interdependencies between entities: negative edges allow for a deeper study of social dynamics. One approach to achieving balance in a network is to model the sources of conflict through structural balance. Current methods focus on computing the frustration index or finding the largest balanced clique, but these do not account for multiple ways to reach a consensus or scale well for large, sparse networks. In this paper, we propose an expansion of the frustration cloud computation and compare various tree-sampling algorithms that can discover a high number of diverse balanced states. Then, we compute and compare the frequencies of balanced states produced by each. Finally, we investigate these techniques’ impact on the consensus feature space.
Reflex-in is a sound installation that uses brain-wave streams to create music composition within the Web environment in real time. The work incorporates various state-of-the-art Web technologies, including Web Audio, WebSocket, WebAssembly, and WebGL.
The music generated from the algorithm - mapping brain wave signal to musical events - aims to produce a form of furniture music that is relaxing and meditative, possibly therapeutic. This effect can be further enhanced through binaural beats or other forms of auditory stimulation, also known as “digital drugs,” which can be enabled through the user interface. The system represents a potential avenue for the development of closed-loop brain-computer interfaces by using the listener’s own brain waves as the source of musical stimuli, which can be used for therapeutic or medical purposes.
Web browsers have come a long way since their inception, evolving from a simple means of displaying text documents over the network to complex software stacks with advanced graphics and network capabilities. As personal computers grew in popularity, developers jumped at the opportunity to deploy cross-platform games with centralized management and a low barrier to entry. Simply going to the right address is now enough to start a game. From text-based to GPU-powered 3D games, browser gaming has evolved to become a strong alternative to traditional console and mobile-based gaming, targeting both casual and advanced gamers. Browser technology has also evolved to accommodate more demanding applications, sometimes even supplanting functions typically left to the operating system. Today, websites display rich, computationally intensive, hardware-accelerated graphics, allowing developers to build ever-more impressive applications and games.
In this paper, we present the evolution of browser gaming and the technologies that enabled it, from the release of the first text-based games in the early 1990s to current open-world and game-engine-powered browser games. We discuss the societal impact of browser gaming and how it has allowed a new target audience to access digital gaming. Finally, we review the potential future evolution of the browser gaming industry.
This paper presents a personal chronicle of internet access in Cuba from the perspective of a visitor to the island. It is told across three time periods: 1997, 2010, and 2021. The story describes how the island first connected to the internet in the 90s, how internet access evolved throughout the 2000s, and ends in the role the internet played in the government protests on July 11, 2021. The article analyzes how internet access in Cuba has changed over the decades and its effects on civil society. It discusses issues such as Cuba’s technological infrastructure, internet censorship, and free expression.
Wikidata, now a decade old, is the largest public knowledge graph, with data on more than 100 million concepts contributed by over 560,000 editors. It is widely used in applications and research. At its launch in late 2012, however, it was little more than a hopeful new Wikimedia project, with no content, almost no community, and a severely restricted platform. Seven years earlier still, in 2005, it was merely a rough idea of a few PhD students, a conceptual nucleus that had yet to pick up many important influences from others to turn into what is now called Wikidata. In this paper, we try to recount this remarkable journey, and we review what has been accomplished, what has been given up on, and what is yet left to do for the future.
The Web has grown considerably since its inception and opened up a multitude of opportunities for people all around the world for work, leisure, and learning. These opportunities were limited to western audiences earlier on, but globalization has now put almost the entire world online. While there is a growing social understanding and acknowledgment of various gender and ethnic groups in society, we still have a long way to go toward achieving equity in gender and ethnic representations, especially in the workplace. In this paper, we attempt to quantify the diversity and evenness in terms of gender and ethnicity of The WebConference participants over its 30 year history. The choice is motivated by the monumental contribution of this conference to the evolution of the web. In particular, we study the gender and ethnicity of program committee members, authors and other speakers at the conference between 1994-2022. We also generate the co-speaker network over the three decades to study how closely the speakers work with each other. Our findings show that we still have a long way to go before achieving fair representation at The WebConference, especially for female participants and individuals from non-White, non-Asian ethnicities.
The internet has ingrained itself into every aspect of our lives, but there's one aspect of the digital world that some take for granted. Did you ever notice that many links, specifically hyperlinks, are blue? When a coworker casually asked me why links are blue, I was stumped. As a user experience designer who has created websites since 2001, I've always made my links blue. I have advocated for the specific shade of blue, and for the consistent application of blue, yes, but I've never stopped and wondered, why are links blue? It was just a fact of life. Grass is green and hyperlinks are blue. Culturally, we associate links with the color blue so much that in 2016, when Google changed its links to black, it created quite a disruption .
But now, I find myself all consumed by the question, WHY are links blue? WHO decided to make them blue? WHEN was this decision made, and HOW has this decision made such a lasting impact?
Mosaic, an early browser released by Marc Andreessen and Eric Bina on January 23, 1993 , had blue hyperlinks. To truly understand the origin and evolution of hyperlinks, I took a journey through technology history and interfaces to explore how links were handled before color monitors, and how interfaces and hyperlinks rapidly evolved once color monitors became an option.
The year 2023 marks the thirty-second anniversary of the World Wide Web being announced.
In the intervening years, the web has become an essential part of the fabric of society. Part of that is that huge amounts of information that used to be available (only) on paper is now available (only) electronically. One of the dangers of this is that owners of information often treat the data as ephemeral, and delete old information once it becomes out of date. As a result society is at risk of losing large parts of its history.
So it is time to assess how we use the web, how it has been designed, and what we should do to ensure that in one hundred years time (and beyond) we will still be able to access, and read, what we are now producing. We can still read 100 year-old books; that should not be any different for the web.
This paper takes a historical view of the web, and discusses the web from its early days: why it was successful compared with other similar systems emerging at the time, the things it did right, the mistakes that were made, and how it has developed to the web we know today, to what extent it meets the requirements needed for such an essential part of society's infrastructure, and what still needs to be done.
This paper summarizes the content of the 28 tutorials that have been given at The Web Conference 2023.
This paper introduces the 20 workshops that were organized at The Web Conference 2023.
In this paper, we describe our intuitions about how language technologies can contribute to create new ways to enhance the accessibility of exhibits in cultural contexts by exploiting the knowledge about the history of our senses and the link between perception and language.
We evaluate the performance of five multi-class classification models for the task of sensory recognition and introduce the DEEP Sensorium (Deep Engaging Experiences and Practices - Sensorium), a multidimensional dataset that combines cognitive and affective features to inform systematic methodologies for augmenting exhibits with multi-sensory stimuli.
For each model, using different feature sets, we show that the features expressing the affective dimension of words combined with sub-lexical features perform better than uni-dimensional training sets.
Olfactory experience has great advantages in awakening human memories and emotions, which may even surpass vision in some cases. Studies have proved that olfactory scene descriptions in images and text content can also arouse human olfactory imagination, but there are still few studies on solving related problems from the perspective of computer vision and NLP. This paper proposes a multimodal model that can detect similar olfactory experience in paired text-image samples. The model builds two stages, coarse-grained and fine-grained. The model adopts the feature fusion method based on pre-trained CLIP for coarse-grained matching training to obtain a preliminary feature extractor to promote fine-grained matching training, and then uses the similarity calculation method based on stacked cross attention for fine-grained matching training to obtain the final feature extractor which in turn promotes coarse-grained matching training. Finally, we manually build an approximate olfactory nouns list during fine-grained matching training, which not only yields significantly better performance when fed back to the fine-grained matching process, but this noun list can be used for future research. Experiments on the MUSTI task dataset of MediaEval2022 prove that the coarse-grained and fine-grained matching stages in proposed model both perform well, and both F1 measures exceed the existing baseline models.
Influencers are followed by a relatively smaller group of people on social media platforms under a common theme. Unlike the global celebrities, it is challenging to categorize influencers into general categories of fame (e.g., Politics, Religion, Entertainment, etc.) because of their overlapping and narrow reach to people interested in these categories.
In this paper, we focus on categorizing influencers based on their followers. We exploit the top-1K Twitter celebrities to identify the common interest among the followers of an influencer as his/her category. We annotate the top one thousand celebrities in multiple categories of popularity, language, and locations. Such categorization is essential for targeted marketing, recommending experts, etc. We define a novel FollowerSimilarity between the set of followers of an influencer and a celebrity. We propose an inverted index to calculate similarity values efficiently. We exploit the similarity score in a K-Nearest Neighbor classifier and visualize the top celebrities over a neighborhood-embedded space.
Online misinformation has become a major concern in recent years, and it has been further emphasized during the COVID-19 pandemic. Social media platforms, such as Twitter, can be serious vectors of misinformation online. In order to better understand the spread of these fake-news, lies, deceptions, and rumours, we analyze the correlations between the following textual features in tweets: emotion, sentiment, political bias, stance, veracity and conspiracy theories. We train several transformer-based classifiers from multiple datasets to detect these textual features and identify potential correlations using conditional distributions of the labels. Our results show that the online discourse regarding some topics, such as COVID-19 regulations or conspiracy theories, is highly controversial and reflects the actual U.S. political landscape.
We present a novel algorithm for multilingual text clustering built upon two well studied techniques: multilingual aligned embedding and community detection in graphs. The aim of our algorithm is to discover underlying topics in a multilingual dataset using clustering. We present both a numerical evaluation using silhouette and V-measure metrics, and a qualitative evaluation for which we propose a new systematic approach. Our algorithm presents robust overall performance and its results were empirically evaluated by an analyst. The work we present was done in the context of a large multilingual public consultation, for which our new algorithm was deployed and used on a daily basis.
This is the 9th edition of the workshop series labeled “AW4City – Web Applications and Smart Cities”, which started back in Florence in 2015 and kept on taking place every year in conjunction with the WWW conference series. Last year the workshop was held virtually in Lyon, France. The workshop series aims to investigate the Web and Web applications’ role in establishing smart city (SC) promises. The workshop series aim to investigate the role of the Web and of Web applications in SC growth. This year, the workshop focuses on the role of the web in social coherence. cities appear to play a crucial role in securing humanity against social threats and generating sustainable and circular cities. In this regard, cities attempt to secure social sustainability and coherence (e.g., deal with affordable energy, poverty, hunger, equal opportunities in education, jobs, and health, etc.) and enhance their performance to become friendlier and able to host their increasing populations. Additionally, new types of business appear (e.g., for smart energy), while the co-existence of autonomous things and people generate another challenge that cities have started phasing. This workshop aims to demonstrate how web applications Apps can Web intelligence serve communities.
Smart city platforms operate as central points of access for locally collected data. The smart city hub (SCHub) introduces a new concept that aims to homogenize data, service, human and material flows in cities. A proof of concept is based on non-typical data flows, like the ones that are collected by biotechnological activities, like Diet-related non-communicable diseases (NCDs). NCDs are responsible for 1 in 5 deaths globally. Most of the diet related NCDs are related to the gut microbiome, the microbial community that resides in our gastrointestinal tract. The imbalance or loss of microbiome diversity is one of the main factors leading to NCDs by affecting various functions, including energy metabolism, intestinal permeability, and brain function. Gut dysbiosis is reflected in altered concentrations of Short Chain Fatty Acids (SCFAs), produced by the gut microbiota. A microcapsule system can play the role of a sensor that collects data from the local community and transmits it to the SCHub in order for the doctors to receive the appropriate patients information and define the appropriate treatment method; for the city to process anonymized information and measure community's health in diet terms. A prototype with a biosensor that correlates the amount of gut SCFAs with gut microbiome functional capacities is presented in this paper, together with the use-case scenario that engages the SCHub.
As remote-sensing becomes more actively utilized in the environmental sciences, our research continues the efforts in adapting smart cities by using civilian UAVs and drones for land surface temperature (LST) analysis. Given the increased spatial resolution that this technology provides as compared to standard satellite measurements, we sought to further study the urban heat island (UHI) effect – specifically when it comes to heterogeneous and dynamic landscapes such as the Charleston peninsula. Furthermore, we sought to develop a method to enhance the spatial resolution of publicly available LST temperature data (such as those measured from the Landsat satellites) by building a machine learning model utilizing remote-sensed data from drones. While we found a high correlation and an accurate degree of prediction for areas of open water and vegetation (respectively), our model struggled when it came to areas containing highly impervious surfaces. We believe, however, that these findings further illustrate the discrepancy between high and medium spatial resolutions, and demonstrate how urban environments specifically are prone to inaccurate LST measurements and are uniquely in need of an industry pursuit of higher spatial resolution for hyperlocal environmental sciences and urban analysis.
In this era of rapid urbanization and our endeavors to create more smart cities, it's crucial to keep track of how our society and neighborhood are getting impacted. It is important to make conscious decisions to keep harmony in sustainability. There are multiple frameworks to evaluate how sustainability is measured and to understand how sustainable a place is, be it a city or a region, and one such framework is the Circles of Sustainability. Though these frameworks offer good solutions, it is a challenge to collect relevant data to make the framework widely usable. This paper focuses on this specific issue by utilizing the methodology introduced in the framework and applying it practically to better understand how sustainable our cities and society are. We present a unique web-based application which utilizes publicly accessible data to compute sustainability scores and rank for every city and presents the results in an intelligent and easy to comprehend visual interface. The paper also discusses the technical difficulties associated with creating such an application, including data collection, data processing, data integration, and scoring algorithm. The paper concludes by discussing the needs for such practical solutions for promoting sustainable urban development.
An environmental assessment (EA) report describes and assesses the environmental impact of a series of activities involved in the development of a project. As such, EA is a key tool for sustainability. Improving information access to EA reporting is a billion-euro untapped business opportunity to build an engaging, efficient digital experience for EA. We aim to become a landmark initiative in making this experience come true, by transforming the traditional manual assessment of numerous heterogeneous reports by experts into a computer-assisted approach. Specifically, a knowledge graph that represents and stores facts about EA practice allows for what it is so far only accessible manually to become machine-readable, and by this, to enable downstream information access services. This paper describes the ongoing process of building DreamsKG, a knowledge graph that stores relevant data- and expert-driven EA reporting and practicing in Denmark. Representation of cause-effect relations in EA and integration of Sustainable Developmental Goals (SDGs) are among its prominent features.
Circular Economy has the goal to reduce value loss and avoid waste by extending the life span of materials and products, including circulating materials or product parts before they become waste. Circular economy models (e.g., circular value networks) are typically complex and networked, involving different cross-industry domains. In the context of a circular value network, multiple actors, such as suppliers, manufacturers, recyclers, and product end-users, may be involved. In addition, there may be various flows of resources, energy, information and value throughout the network. This means that we face the challenge that the data and information from cross-industry domains in a circular economy model are not built on common ground, and as a result are difficult to understand and use for both humans and machines. Using ontologies to represent domain knowledge can enable actors and stakeholders from different industries in the circular economy to communicate using a common language. The knowledge domains involved include circular economy, sustainability, materials, products, manufacturing, and logistics. The objective of this paper is to investigate the landscape of current ontologies for these domains. This will enable us to in the future explore what existing knowledge can be adapted or used to develop ontologies for circular value networks.
In this work, we propose a sustainable path-finding application for grain transportation during the ongoing Russian military invasion in Ukraine. This application is to build a suite of algorithms to find possible optimal paths for transporting grain that remains in Ukraine. The application uses the KNowledge Acquisition and Representation Methodology(KNARM) and the KnowWhereGraph to achieve this goal. Currently, we are working towards creating an ontology that will allow for a more effective heuristic approach by incorporating the lessons learned from the KnowWhereGraph. The aim is to enhance the path-finding process and provide more accurate and efficient results. In the future, we will continue exploring and implementing new techniques that can further improve the sustainability of the path-finding applications with a knowledge graph backend for grain transportation through hazardous and adversarial environments. The code is available upon reviewer’s request. It can not be made public due to the sensitive nature of the data.
The utility of a search system for its users can be further enhanced by providing personalized results and recommendations within the search context. However, the research discussions around these aspects of search remain fragmented across different conferences and workshops. Hence, this workshop aims to bring together researchers and practitioners from industry and academia to engage in the discussions of algorithmic and system challenges in search personalization and effectively recommending within search context.
Personalized Search(henceforth called P10d Search) focuses to deliver user-specific search results based on the previous purchases. Search engine retrieves the result based on the defined relevancy algorithm. When a user searches a keyword, search engine constructs the search query based on the defined searchable fields/attributes along with configured relevancy algorithm. Position of the item retrieved in search results is determined by the search algorithm based on the search term. The results are further refined or ranked based on different click stream signals, product features, market data to provide much relevant results. Personalisation provides the ranked the list of items for a given user based on past purchases. Personalisation is agnostic of search query and takes user id, cart additions, site taxonomy and user’s shopping history as input signals. In summary, search engine queries data based on relevancy and personalisation engine retrieves based purely on purchases. Goal of personalised search is to enhance the search results by adding personalised results without affecting the search relevance.
Counterfactual learning to rank via inverse propensity weighting is the most popular approach to train ranking models using biased implicit user feedback from logged search data. Standard click propensity estimation techniques rely on simple models of user browsing behavior that primarily account for the attributes of the presentation context that affect whether the relevance of an item to the search context is observed. Most notably, the inherent effect of the listwise presentation of the items on users’ propensity for engagement is captured in the position of the presented items on the search result page. In this work, we enrich this position bias based click propensity model by proposing an observation model that further incorporates the underlying search intent, as reflected in the user’s click pattern in the search context. Our approach does not require an intent prediction model based on the content of the search context. Instead, we rely on a simple, yet effective, non-causal estimate of the user’s browsing intent from the number of click events in the search context. We empirically characterize the distinct rank decay patterns of the estimated click propensities in the characterized intent classes. In particular, we demonstrate a sharper decay of click propensities in top ranks for the intent class identified by sparse user clicks and the higher likelihood of observing clicks in lower ranks for the intent class identified by higher number of user clicks. We show that the proposed intent-aware propensity estimation technique helps with training ranking models with more effective personalization and generalization power through empirical results for a ranking task in a major e-commerce platform.
At Netflix, personalization plays a key role in several aspects of our user experience, from ranking titles to constructing an optimal Homepage. Although personalization is a well established research field, its application to search presents unique problems and opportunities. In this paper, we describe the evolution of Search personalization at Netflix, its unique challenges, and provide a high level overview of relevant solutions.
Streaming media has become a popular medium for consumers of all ages, with people spending several hours a day streaming videos, games, music, or podcasts across devices. Most global streaming services have introduced Machine Learning (ML) into their operations to personalize consumer experience, improve content, and further enhance the value proposition of streaming services. Despite the rapid growth, there is a need to bridge the gap between academic research and industry requirements and build connections between researchers and practitioners in the field. This workshop aims to provide a unique forum for practitioners and researchers interested in Machine Learning to get together, exchange ideas and get a pulse for the state of the art in research and burning issues in the industry.
Video downscaling is an important component of adaptive video streaming, which tailors streaming to screen resolutions of different devices and optimizes picture quality under varying network conditions. With video downscaling, a high-resolution input video is downscaled into multiple lower-resolution videos. This is typically done by a conventional resampling filter like Lanczos. In this talk, we describe how we improved Netflix video quality by developing neural networks for video downscaling and deploying them at scale.
Customers search for movie and series titles released across the world on streaming services like primevideo.com (PV), netflix.com (Netflix). In non-English speaking countries like India, Nepal and many others, the regional titles are transliterated from native language to English and are being searched in English. Given that there can be multiple transliterations possible for almost all the titles, searching for a regional title can be a very frustrating customer experience if these nuances are not handled correctly by the search system. Typing errors make the problem even more challenging. Streaming services uses spell correction and auto-suggestions/auto-complete features to address this issue up to certain extent. Auto-suggest fails when user searches keywords not in scope of the auto-suggest. Spell correction is effective at correcting common typing errors but as these titles doesn’t follow strict grammar rules and new titles constantly added to the catalog, spell correction have limited success.
With recent progress in deep learning (DL), embedding vectors based dense retrieval is being used extensively to retrieve semantically relevant documents for a given query. In this work, we have used dense retrieval to address the noise introduced by transliteration variations and typing errors to improve retrieval of regional media titles. In the absent of any relevant dataset to test our hypothesis, we created a new dataset of 40K query title pairs from PV search logs. We also created a baseline by bench-marking PV’s performance on test data. We present an extensive study on the impact of 1. pre-training, 2. data augmentation, 3. positive to negative sample ratio, and 4. choice of loss function on retrieval performance. Our best model has shown 51.24% improvement in Recall@16 over PV baseline.
To improve Amazon Music podcast services and customer engagements, we introduce Entity-Linked Topic Extraction (ELTE) to identify well-known entity and event topics from podcast episodes. An entity can be a person, organization, work-of-art, etc., while an event, such as the Opioid epidemic, occurs at specific point(s) in time. ELTE first extracts key-phrases from episode title and description metadata. It then uses entity linking to canonicalize them against Wikipedia knowledge base (KB), ensuring that the topics exist in the real world. ELTE also models NIL-predictions for entity or event topics that are not in the KB, as well as topics that are not of entity or event type. To test the model, we construct a podcast topic database of 1166 episodes from various categories. Each episode comes with a Wiki-link annotated main topic or NIL-prediction. ELTE produces the best overall Exact Match EM score of .84, with by-far the best EM of .89 among the entity or event type episodes, as well as NIL-predictions for episodes without entity or event main topic (EM score of .86).
Millions of content gets created daily on platforms like YouTube, Facebook, TikTok etc. Most of such large scale recommender systems are data demanding, thus taking substantial time for content embedding to mature. This problem is aggravated when there is no behavioral data available for new content. Poor quality recommendation for these items lead to user dissatisfaction and short content shelf-life. In this paper we propose a solution MEMER (Multimodal Encoder for Multi-signal Early-stage Recommendations), that utilises the multimodal semantic information of content and uses it to generate better quality embeddings for early-stage items. We demonstrate the flexibility of the framework by extending it to various explicit and implicit user actions. Using these learnt embeddings, we conduct offline and online experiments to verify its effectiveness. The predicted embeddings show significant gains in online early-stage experiments for both videos and images (videos: 44% relative gain in click through rate, 46% relative gain in explicit engagements, 9% relative gain in successful video play, 20% relative reduction in skips, images: 56% relative gain in explicit engagements). This also compares well against the performance of mature embeddings (83.3% RelaImpr (RI)  in Successful Video Play, 97.8% RelaImpr in Clicks).
Recommender systems are widely used in many Web applications to recommend items which are relevant to a user’s preferences. However, focusing on exploiting user preferences while ignoring exploration will lead to biased feedback and hurt the user’s experience in the long term. The Mutli-Armed Bandit (MAB) is introduced to balance the tradeoff between exploitation and exploration. By utilizing context information in the reward function, contextual bandit algorithms lead to better performance compared to context-free bandit algorithms. However, existing contextual bandit algorithms either assume a linear relation between the expected reward and context features, whose representation power gets limited, or use a deep neural network in the reward function which is impractical in implementation. In this paper, we propose a new contextual bandit algorithm, DeepLinUCB, which leverages the representation power of deep neural network to transform the raw context features in the reward function. Specifically, this deep neural network is dedicated to the recommender system, which is efficient and practical in real-world applications. Furthermore, we conduct extensive experiments in our online recommender system using requests from real-world scenarios and show that DeepLinUCB is efficient and outperforms other bandit algorithms.
The International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment (Sci-K 2023) is now running its third edition. The Sci-K workshop is a venue that brings together researchers and practitioners from different disciplines (including, but not limited to, Digital Libraries, Information Extraction, Machine Learning, Semantic Web, Knowledge Engineering, Natural Language Processing, Scholarly Communication, Science of Science, Scientometrics and Bibliometrics), as well as professionals from the industry, to explore innovative solutions and ideas for the production and consumption of Scientific Knowledge Graphs and assessing the research impact. The workshop has called for high-quality submissions around the three main themes of research, related to scientific knowledge: representation, discovery, and assessment.
In response to the call for papers, the workshop has received outstanding submissions from researchers in 15 different countries: United States of America, Germany, United Kingdom, Ireland, Sweden, Canada, India, Brazil, Australia, Italy, Slovenia, Bulgaria, Denmark, Ethiopia, and Norway. Each paper was reviewed at least by three members of the programme committee. Given the quality and the interesting topics covered by the submissions, we accepted 10 papers.
Sci-K 2023 builds on two previous successful editions and keeps attracting a combined pool of attendees. The first edition (Sci-K 2021), was held on 13 April 2021 in conjunction with The Web Conference 2021. Its program consisted of two keynote talks, and the presentation of 11 research papers. The second edition (Sci-K 2022) took place on the 26 April 2022 at The Web Conference 2022. The program included the presentation of 5 long papers, 4 short papers, 2 vision papers, 2 keynote speeches and a panel on “What’s next after Microsoft Academic Graph?”.
The full program as well as the list of accepted papers can be found on the Sci-K website: https://sci-k.github.io/2023/.
Representation learning is the first step in automating tasks such as research paper recommendation, classification, and retrieval. Due to the accelerating rate of research publication, together with the recognised benefits of interdisciplinary research, systems that facilitate researchers in discovering and understanding relevant works from beyond their immediate school of knowledge are vital. This work explores different methods of research paper representation (or document embedding), to identify those methods that are capable of preserving the interdisciplinary implications of research papers in their embeddings. In addition to evaluating state of the art methods of document embedding in a interdisciplinary citation prediction task, we propose a novel Graph Neural Network architecture designed to preserve the key interdisciplinary implications of research articles in citation network node embeddings. Our proposed method outperforms other GNN-based methods in interdisciplinary citation prediction, without compromising overall citation prediction performance.
The Bridge2AI project, funded by the National Institutes of Health, involves researchers from different disciplines and backgrounds to develop well-curated AI health data and tools. Understanding cross-disciplinary and cross-organizational collaboration at the individual, team, and project levels is critical. In this paper, we matched Bridge2AI team members to the PubMed Knowledge dataset to get their health-related publications. We built the collaboration network for Bridge2AI members and all of their collaborators and sorted out researchers with the largest degree of centrality and betweenness centrality. Our finding suggests that Bridge2AI members need to strengthen internal collaborations and boost mutual understanding in this project. We also applied machine learning methods to cluster all the researchers and labeled publication topics in different clusters. Finally, by identifying the gender/racial diversity of researchers, we found that teams with higher racial diversity receive more citations, and individuals with diverse gender collaborators publish more papers.
The size of the National Aeronautics and Space Administration (NASA) Science Mission Directorate (SMD) data catalog is growing exponentially, allowing researchers to make discoveries. However, making discoveries is challenging and time-consuming due to the size of the data catalogs, and as many concepts and data are indirectly connected. This paper proposes a pipeline to generate knowledge graphs (KGs) representing different NASA SMD domains. These KGs can be used as the basis for dataset search engines, saving researchers time and supporting them in finding new connections. We collected textual data and used several modern natural language processing (NLP) methods to create the nodes and the edges of the KGs. We explore the cross-domain connections, discuss our challenges, and provide future directions to inspire researchers working on similar challenges.
Scientific data collected in the oceanographic domain is invaluable to researchers when performing meta-analyses and examining changes over time in oceanic environments. However, many of the data samples and subsequent analyses published by researchers are not uploaded to a repository leaving the scientific paper as the only available source. Automated extraction of scientific data is, therefore, a valuable tool for such researchers. Specifically, much of the most valuable data in scientific papers are structured as tables, making these a prime target for information extraction research. Using the data relies on an additional step where the concepts mentioned in the tables, such as names of measures, units, and biological species, are identified within a domain ontology. Unfortunately, state-of-the-art table extraction leaves much to be desired and has not been attempted on a large scale on oceanographic papers. Furthermore, while entity linking in the context of a full paragraph of text has been heavily researched, it is still lacking in this harder task of linking single concepts. In this work, we present an annotated benchmark dataset of data tables from oceanographic papers. We further present the result of an evaluation on the extraction of these tables and the linking of the contained entities to the domain and general-purpose knowledge bases using the current state of the art. We highlight the challenges and quantify the performance of current tools for table extraction and table-concept linking.
Link prediction between two nodes is a critical task in graph machine learning. Most approaches are based on variants of graph neural networks (GNNs) that focus on transductive link prediction and have high inference latency. However, many real-world applications require fast inference over new nodes in inductive settings where no information on connectivity is available for these nodes. Thereby, node features provide an inevitable alternative in the latter scenario. To that end, we propose Graph2Feat, which enables inductive link prediction by exploiting knowledge distillation (KD) through the Student-Teacher learning framework. In particular, Graph2Feat learns to match the representations of a lightweight student multi-layer perceptron (MLP) with a more expressive teacher GNN while learning to predict missing links based on the node features, thus attaining both GNN’s expressiveness and MLP’s fast inference. Furthermore, our approach is general; it is suitable for transductive and inductive link predictions on different types of graphs regardless of them being homogeneous or heterogeneous, directed or undirected. We carry out extensive experiments on seven real-world datasets including homogeneous and heterogeneous graphs. Our experiments demonstrate that Graph2Feat significantly outperforms SOTA methods in terms of AUC and average precision in homogeneous and heterogeneous graphs. Finally, Graph2Feat has the minimum inference time compared to the SOTA methods, and 100x acceleration compared to GNNs. The code and datasets are available on GitHub1.
In order to advance academic research, it is important to assess and evaluate the academic influence of researchers and the findings they produce. Citation metrics are universally used methods to evaluate researchers. Amongst the several variations of citation metrics, the h-index proposed by Hirsch has become the leading measure. Recent work shows that h-index is not an effective measure to determine scientific impact - due to changing authorship patterns. This can be mitigated by using h-index of a paper to compute h-index of an author. We show that using fractional allocation of h-index gives better results. In this work, we reapply two indices based on the h-index of a single paper. The indices are referred to as: hp-index and hp-frac-index. We run large-scale experiments in three different fields with about a million publications and 3,000 authors. Our experiments show that hp-frac-index provides a unique ranking when compared to h-index. It also performs better than h-index in providing higher ranks to the awarded researcher.
Model card reports provide a transparent description of machine learning models which includes information about their evaluation, limitations, intended use, etc. Federal health agencies have expressed an interest in model cards report for research studies using machine-learning based AI. Previously, we have developed an ontology model for model card reports to structure and formalize these reports. In this paper, we demonstrate a Java-based library (OWL API, FaCT++) that leverages our ontology to publish computable model card reports. We discuss future directions and other use cases that highlight applicability and feasibility of ontology-driven systems to support FAIR challenges.
In the present academic landscape, the process of collecting data is slow, and the lax infrastructures for data collaborations lead to significant delays in coming up with and disseminating conclusive findings. Therefore, there is an increasing need for a secure, scalable, and trustworthy data-sharing ecosystem that promotes and rewards collaborative data-sharing efforts among researchers, and a robust incentive mechanism is required to achieve this objective. Reputation-based incentives, such as the h-index, have historically played a pivotal role in the academic community. However, the h-index suffers from several limitations. This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher’s scientific contributions. Utilizing the Microsoft Academic Graph and machine learning techniques, the SCIENCE-index predicts the progress made by a researcher over their career and provides a soft incentive for sharing their datasets with peer researchers. To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter. DataCite, a database of openly available datasets, proxies this parameter, which is further enhanced by including a researcher’s data-sharing activity. Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index. We observe that it results in a much more even spread of evaluations. The SCIENCE-index is a crucial component in constructing a decentralized protocol that promotes trust-based data sharing, addressing the current inequity in dataset sharing. The work outlined in this paper provides the foundation for assessing scientific contributions in future data-sharing spaces powered by decentralized applications.
Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD , our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.
Researchers seeking to comprehend the state-of-the-art innovations in a particular field of study must examine recent patents and scientific articles in that domain. Innovation ecosystems consist of interconnected information about entities such as researchers, institutions, projects, products, and technologies. However, representing such information in a machine-readable format is challenging because concepts like "knowledge" are not easily represented. Nonetheless, even a partial representation of innovation ecosystems provides valuable insights. Therefore, representing innovation ecosystems as knowledge graphs (KGs) would enable advanced data analysis and generate new insights. To this end, we propose InnoGraph, a framework that integrates multiple heterogeneous data sources to build a Knowledge Graph of the worldwide AI innovation ecosystem.
E-commerce features like easy cancellations, returns, and refunds can be exploited by bad actors or uninformed customers, leading to revenue loss for organization. One such problem faced by e-commerce platforms is Return To Origin (RTO), where the user cancels an order while it is in transit for delivery. In such a scenario platform faces logistics and opportunity costs. Traditionally, models trained on historical trends are used to predict the propensity of an order becoming RTO. Sociology literature has highlighted clear correlations between socio-economic indicators and users’ tendency to exploit systems to gain financial advantage. Social media profiles have information about location, education, and profession which have been shown to be an estimator of socio-economic condition. We believe combining social media data with e-commerce information can lead to improvements in a variety of tasks like RTO, recommendation, fraud detection, and credit modeling. In our proposed system, we find the public social profile of an e-commerce user and extract socio-economic features. Internal data fused with extracted social features are used to train a RTO order detection model. Our system demonstrates a performance improvement in RTO detection of 3.1% and 19.9% on precision and recall, respectively. Our system directly impacts the bottom line revenue and shows the applicability of social re-identification in e-commerce.
We have developed a conversational assistant called the Decision Assistant (DA) to help customers make purchase decisions. To answer customer queries successfully, we use a question and answering (QnA) system that retrieves data on product pages and extracts answers. With various data sources available on the product pages, we deal with unique challenges such as different terminologies and data formats for successful answer retrieval. In this paper, we propose two different bi-encoder architectures for retrieving data from each of the two data sources considered – product descriptions and specifications. The proposed architectures beat the baseline approaches while maintaining a high recall and low latency in production. We envision that the proposed approaches can be widely applicable to other e-commerce QnA systems.
Product search for online shopping should be season-aware, i.e., presenting seasonally relevant products to customers. In this paper, we propose a simple yet effective solution to improve seasonal relevance in product search by incorporating seasonality into language models for semantic matching. We first identify seasonal queries and products by analyzing implicit seasonal contexts through time-series analysis over the past year. Then we introduce explicit seasonal contexts by enhancing the query representation with a season token according to when the query is issued. A new season-enhanced BERT model (SE-BERT) is also proposed to learn the semantic similarity between the resulting seasonal queries and products. SE-BERT utilizes Multi-modal Adaption Gate (MAG) to augment the season-enhanced semantic embedding with other contextual information such as product price and review counts for robust relevance prediction. To better align with the ranking objective, a listwise loss function (neural NDCG) is used to regularize learning. Experimental results validate the effectiveness of the proposed method, which outperforms existing solutions for query-product relevance prediction in terms of NDCG and Price Weighted Purchases (PWP).
Commercial search engines use different semantic models to augment lexical matches. These models provide candidate items for a user’s query from a target space of millions to billions of items. Models with different inductive biases provide relatively different predictions, making it desirable to launch multiple semantic models in production. However, latency and resource constraints make simultaneously deploying multiple models impractical. In this paper, we introduce a distillation approach, called Blend and Match (BM), to unify two different semantic search models into a single model. We use a Bi-encoder semantic matching model as our primary model and propose a novel loss function to incorporate eXtreme Multi-label Classification (XMC) predictions as the secondary model. Our experiments conducted on two large-scale datasets, collected from a popular e-commerce store, show that our proposed approach significantly improves the recall of the primary Bi-encoder model by 11% to 17% with a minimal loss in precision. We show that traditional knowledge distillation approaches result in a sub-optimal performance for our problem setting, and our BM approach yields comparable rankings with strong Rank Fusion (RF) methods used only if one could deploy multiple models.
Building machine learning models can be a time-consuming process that often takes several months to implement in typical business scenarios. To ensure consistent model performance and account for variations in data distribution, regular retraining is necessary. This paper introduces a solution for improving online customer service in e-commerce by presenting a universal model for predicting labels based on customer questions, without requiring training. Our novel approach involves using machine learning techniques to tag customer questions in transcripts and create a repository of questions and corresponding labels. When a customer requests assistance, an information retrieval model searches the repository for similar questions, and statistical analysis is used to predict the corresponding label. By eliminating the need for individual model training and maintenance, our approach reduces both the model development cycle and costs. The repository only requires periodic updating to maintain accuracy.
Structured interviews are used in many settings, importantly in market research on topics such as brand perception, customer habits, or preferences, which are critical to product development, marketing, and e-commerce at large. Such interviews generally consist of a series of questions that are asked to a participant. These interviews are typically conducted by skilled interviewers, who interpret the responses from the participants and can adapt the interview accordingly. Using automated conversational agents to conduct such interviews would enable reaching a much larger and potentially more diverse group of participants than currently possible. However, the technical challenges involved in building such a conversational system are relatively unexplored. To learn more about these challenges, we convert a market research multiple-choice questionnaire to a conversational format and conduct a user study. We address the key task of conducting structured interviews, namely interpreting the participant’s response, for example, by matching it to one or more predefined options. Our findings can be applied to improve response interpretation for the information elicitation phase of conversational recommender systems.
Online stores in the US offer a unique scenario for Cross-Lingual Information Retrieval (CLIR) due to the mix of Spanish and English in user queries. Machine Translation (MT) provides an opportunity to lift relevance by translating the Spanish queries to English before delivering them to the search engine. However, polysemy-derived problems, high latency and context scarcity in product search, make generic MT an impractical solution. The wide diversity of products in marketplaces injects non-translatable entities, loanwords, ambiguous morphemes, cross-language ambiguity and a variety of Spanish dialects in the communication between buyers and sellers, posing a thread to the accuracy of MT. In this work, we leverage domain adaptation on a simplified architecture of Neural Machine Translation (NMT) to make both latency and accuracy suitable for e-commerce search. Our NMT model is fine-tuned on a mixed-domain corpus based on engagement data expanded with catalog back-translation techniques. Beyond accuracy, and given that translation is not the goal but the means to relevant results, the problem of Query Translatability is addressed by a classifier on whether the translation should be automatic or explicitly requested. We assembled these models into a query translation system that we tested and launched at Walmart.com , with a statistically significant lift in Spanish GMV and an nDCG gain for Spanish queries of +70%.
The use of pretrained embeddings has become widespread in modern e-commerce machine learning (ML) systems. In practice, however, we have encountered several key issues when using pretrained embedding in a real-world production system, many of which cannot be fully explained by current knowledge. Unfortunately, we find that there is a lack of a thorough understanding of how pre-trained embeddings work, especially their intrinsic properties and interactions with downstream tasks. Consequently, it becomes challenging to make interactive and scalable decisions regarding the use of pre-trained embeddings in practice.
Our investigation leads to two significant discoveries about using pretrained embeddings in e-commerce applications. Firstly, we find that the design of the pretraining and downstream models, particularly how they encode and decode information via embedding vectors, can have a profound impact. Secondly, we establish a principled perspective of pre-trained embeddings via the lens of kernel analysis, which can be used to evaluate their predictability, interactively and scalably. These findings help to address the practical challenges we faced and offer valuable guidance for successful adoption of pretrained embeddings in real-world production. Our conclusions are backed by solid theoretical reasoning, benchmark experiments, as well as online testings.
The main task of an e-commerce search engine is to semantically match the user query to the product inventory and retrieve the most relevant items that match the user’s intent. This task is not trivial as often there can be a mismatch between the user’s intent and the product inventory for various reasons, the most prevalent being: (i) the buyers and sellers use different vocabularies, which leads to a mismatch; (ii) the inventory doesn’t contain products that match the user’s intent. To build a successful e-commerce platform it is of paramount importance to be able to address both of these challenges. To do so, query rewriting approaches are used, which try to bridge the semantic gap between the user’s intent and the available product inventory. Such approaches use a combination of query token dropping, replacement and expansion. In this work we introduce a novel Knowledge Graph-enhanced neural query rewriting in the e-commerce domain. We use a relationship-rich product Knowledge Graph to infuse auxiliary knowledge in a transformer-based query rewriting deep neural network. Experiments on two tasks, query pruning and complete query rewriting, show that our proposed approach significantly outperforms a baseline BERT-based query rewriting solution.
Online testing is indispensable in decision making for information retrieval systems. Interleaving emerges as an online testing method with orders of magnitude higher sensitivity than the pervading A/B testing. It merges the compared results into a single interleaved result to show to users, and attributes user actions back to the systems being tested. However, its pairwise design also brings practical challenges to real-world systems, in terms of effectively comparing multiple (more than two) systems and interpreting the magnitude of raw interleaving measurement. We present two novel methods to address these challenges that make interleaving practically applicable. The first method infers the ordering of multiple systems based on interleaving pairwise results with false discovery control. The second method estimates A/B effect size based on interleaving results using a weighted linear model that adjust for uncertainties of different measurements. We showcase the effectiveness of our methods in large-scale e-commerce experiments, reporting as many as 75 interleaving results, and provide extensive evaluations of their underlying assumptions.
Recently, there has been an increasing adoption of differential privacy guided algorithms for privacy-preserving machine learning tasks. However, the use of such algorithms comes with trade-offs in terms of algorithmic fairness, which has been widely acknowledged. Specifically, we have empirically observed that the classical collaborative filtering method, trained by differentially private stochastic gradient descent (DP-SGD), results in a disparate impact on user groups with respect to different user engagement levels. This, in turn, causes the original unfair model to become even more biased against inactive users. To address the above issues, we propose DP-Fair, a two-stage framework for collaborative filtering based algorithms. Specifically, it combines differential privacy mechanisms with fairness constraints to protect user privacy while ensuring fair recommendations. The experimental results, based on Amazon datasets, and user history logs collected from Etsy, one of the largest e-commerce platforms, demonstrate that our proposed method exhibits superior performance in terms of both overall accuracy and user group fairness on both shallow and deep recommendation models compared to vanilla DP-SGD.
We introduce a Reinforcement Learning Psychotherapy AI Companion that generates topic recommendations for therapists based on patient responses. The system uses Deep Reinforcement Learning (DRL) to generate multi-objective policies for four different psychiatric conditions: anxiety, depression, schizophrenia, and suicidal cases. We present our experimental results on the accuracy of recommended topics using three different scales of working alliance ratings: task, bond, and goal. We show that the system is able to capture the real data (historical topics discussed by the therapists) relatively well, and that the best performing models vary by disorder and rating scale. To gain interpretable insights into the learned policies, we visualize policy trajectories in a 2D principal component analysis space and transition matrices. These visualizations reveal distinct patterns in the policies trained with different reward signals and trained on different clinical diagnoses. Our system’s success in generating DIsorder-Specific Multi-Objective Policies (DISMOP) and interpretable policy dynamics demonstrates the potential of DRL in providing personalized and efficient therapeutic recommendations.
An ultimate goal of recommender systems (RS) is to improve user engagement. Reinforcement learning (RL) is a promising paradigm for this goal, as it directly optimizes overall performance of sequential recommendation. However, many existing RL-based approaches induce huge computational overhead, because they require not only the recommended items but also all other candidate items to be stored. This paper proposes an efficient alternative that does not require the candidate items. The idea is to model the correlation between user engagement and items directly from data. Moreover, the proposed approach consider randomness in user feedback and termination behavior, which are ubiquitous for RS but rarely discussed in RL-based prior work. With online A/B experiments on real-world RS, we confirm the efficacy of the proposed approach and the importance of modeling the two types of randomness.
Social networks are platforms where content creators and consumers share and consume content. The edge recommendation system, which determines who a member should connect with, significantly impacts the reach and engagement of the audience on such networks. This paper emphasizes improving the experience of inactive members (IMs) who do not have a large connection network by recommending better connections. To that end, we propose a multi-objective linear optimization framework and solve it using accelerated gradient descent. We report our findings regarding key business metrics related to user engagement on LinkedIn, a professional network with over 850 million members.
We evaluate two popular local explainability techniques, LIME and SHAP, on a movie recommendation task. We discover that the two methods behave very differently depending on the sparsity of the data set, where sparsity is defined by the amount of historical viewing data available to explain a movie recommendation for a particular data instance. We find that LIME does better than SHAP in dense segments of the data set and SHAP does better in sparse segments. We trace this difference to the differing bias-variance characteristics of the underlying estimators of LIME and SHAP. We find that SHAP exhibits lower variance in sparse segments of the data compared to LIME. We attribute this lower variance to the completeness constraint property inherent in SHAP and missing in LIME. This constraint acts as a regularizer and therefore increases the bias of the SHAP estimator but decreases its variance, leading to a favorable bias-variance trade-off especially in high sparsity data settings. With this insight, we introduce the same constraint into LIME and formulate a novel local explainabilty framework called Completeness-Constrained LIME (CLIME) that is superior to LIME and much faster than SHAP.
We consider the problem of Stochastic Contextual Multi-Armed Bandits (CMABs) initialised with Historical data. Initialisation with historical data is an example of data-driven regularisation which should, in theory, accelerate the convergence of CMABs. However, in practice, we have little to no control over the underlying generation process of such data, which may exhibit some pathologies, possibly impeding the convergence and the stability of the algorithm. In this paper, we focus on two main challenges: bias selection and data corruption. We propose two new algorithms to solve these specific issues: LinUCB with historical data and offline balancing (OB-HLinUCB) and Robust LinUCB with corrupted historical data (R-HLinUCB). We derive their theoretical regret bounds and discuss their computational performance using real-world datasets.
Recommender systems are used to suggest items to users based on the users’ preferences. Such systems often deal with massive item sets and incredibly sparse user-item interactions, which makes it very challenging to generate high-quality personalized recommendations. Reinforcement learning (RL) is a framework for sequential decision making and naturally formulates recommender-system tasks: recommending items as actions in different user and context states to maximize long-term user experience. We investigate two RL policy parameterizations that generalize sparse user-items interactions by leveraging the relationships between actions: parameterizing the policy over action features as a softmax or Gaussian distribution. Our experiments on synthetic problems suggest that the Gaussian parameterization—which is not commonly used on recommendation tasks—is more robust to the set of action features than the softmax parameterization. Based on these promising results, we propose a more thorough investigation of the theoretical properties and empirical benefits of the Gaussian parameterization for recommender systems.
Notification is a core feature of mobile applications. They inform users about a variety of events happening in the communities. Users may take immediate action to visit the app or ignore the notifications depending on the timing and the relevance of a notification to the user. In this paper, we present the design, implementation, and evaluation of DeepPvisit, a novel probabilistic deep learning survival method for modeling interactions between a user visit and a mobile notification decision, targeting notification volume and delivery time optimization, and driving long-term user engagements. Offline evaluations and online A/B test experiments show DeepPvisit outperforms the existing survival regression model and the other baseline models and delivers better business metrics online.
LinkedIn is building a skill graph to power a skill-first talent marketplace. Constructing a skill graph from a flat list is not an trivial task, especially by human curation. In this paper, we leverage the pre-trained large language model BERT to achieve this through semantic understanding on synthetically generated texts as training data. We automatically create positive and negative labels from the seed skill graph. The training data are encoded by pre-trained language models into embeddings and they are consumed by the downstream classification module to classify the relationships between skill pairs.
Personalization is essential in e-commerce, with item recommendation as a critical task. In this paper, we describe a hybrid embedding-based retrieval system for real-time personalized item recommendations on Instacart. Our system addresses unique challenges in the multi-source retrieval system, and includes several key components to make it highly personalized and dynamic. Specifically, our system features a hybrid embedding model that includes a long-term user interests embedding model and a real-time session-based model, which are combined to capture users’ immediate intents and historical interactions. Additionally, we have developed a contextual bandit solution to dynamically adjust the number of candidates from each source and optimally allocate retrieval slots given a limited computational budget. Our modeling and system optimization efforts have enabled us to provide highly personalized item recommendations in real-time at scale to all our customers, including new and long-standing users.
Digital twin technology has revolutionized the state-of-the-art practice in many industries, and digital twins have a natural application to modeling cancer patients. By simulating patients at a more fundamental level than conventional machine learning models, digital twins can provide unique insights by predicting each patient's outcome trajectory. This has numerous associated benefits, including patient-specific clinical decision-making support and the potential for large-scale virtual clinical trials. Historically, it has not been feasible to use digital twin technology to model cancer patients because of the large number of variables that impact each patient's outcome trajectory, including genotypic, phenotypic, social, and environmental factors. However, the path to digital twins in radiation oncology is becoming possible due to recent progress, such as multiscale modeling techniques that estimate patient-specific cellular, molecular, and histological distributions, and modern cryptographic techniques that enable secure and efficient centralization of patient data across multiple institutions. With these and other future scientific advances, digital twins for radiation oncology will likely become feasible. This work discusses the likely generalized architecture of patient-specific digital twins and digital twin networks, as well as the benefits, existing barriers, and potential gateways to the application of digital twin technology in radiation oncology.
SocialNLP is an inter-disciplinary area of natural language processing (NLP) and social computing. SocialNLP has three directions: (1) addressing issues in social computing using NLP techniques; (2) solving NLP problems using information from social networks or social media; and (3) handling new problems related to both social computing and natural language processing. The 11th SocialNLP workshop is held at TheWebConf 2023. We accepted nine papers with acceptance ratio 56%. We sincerely thank to all authors, program committee members, and workshop chairs, for their great contributions and help in this edition of SocialNLP workshop.
The widespread use of social media for expressing personal thoughts and emotions makes it a valuable resource for identifying individuals at risk of suicide. Existing sequential learning-based methods have shown promising results. However, these methods may fail to capture global features. Due to its inherent ability to learn interconnected data, graph-based methods can address this gap. In this paper, we present a new graph-based hierarchical attention network (GHAN) that uses a graph convolutional neural network with an ordinal loss to improve suicide risk identification on social media. Specifically, GHAN first captures global features by constructing three graphs to capture semantic, syntactic, and sequential contextual information. Then encoded textual features are fed to attentive transformers’ encoder and optimized to factor in the increasing suicide risk levels using an ordinal classification layer hierarchically for suicide risk detection. Experimental results show that the proposed GHAN outperformed state-of-the-art methods on a public Reddit dataset.
Humor is a cognitive construct that predominantly evokes the feeling of mirth. During the COVID-19 pandemic, the situations that arouse out of the pandemic were so incongruous to the world we knew that even factual statements often had a humorous reaction. In this paper, we present a dataset of 2510 samples hand-annotated with labels such as humor style, type, theme, target and stereotypes formed or exploited while creating the humor in addition to 909 memes. Our dataset comprises Reddit posts, comments, Onion news headlines, real news headlines, and tweets. We evaluate the task of humor detection and maladaptive humor detection on state-of-the-art models namely RoBERTa and GPT-3. The finetuned models trained on our dataset show significant gains over zero-shot models including GPT-3 when detecting humor. Even though GPT-3 is good at generating meaningful explanations, we observed that it fails to detect maladaptive humor due to the absence of overt targets and profanities. We believe that the presented dataset will be helpful in designing computational methods for topical humor processing as it provides a unique sample set to study the theory of incongruity in a post-pandemic world. The data is available to research community at https://github.com/smritae01/Covid19_Humor.
Market sentiment analysis on social media content requires knowledge of both financial markets and social media jargon, which makes it a challenging task for human raters. The resulting lack of high-quality labeled data stands in the way of conventional supervised learning methods. In this work, we conduct a case study approaching this problem with semi-supervised learning using a large language model (LLM). We select Reddit as the target social media platform due to its broad coverage of topics and content types. Our pipeline first generates weak financial sentiment labels for Reddit posts with an LLM and then uses that data to train a small model that can be served in production. We find that prompting the LLM to produce Chain-of-Thought summaries and forcing it through several reasoning paths helps generate more stable and accurate labels, while training the student model using a regression loss further improves distillation quality. With only a handful of prompts, the final model performs on par with existing supervised models. Though production applications of our model are limited by ethical considerations, the model’s competitive performance points to the great potential of using LLMs for tasks that otherwise require skill-intensive annotation.
The COVID-19 pandemic has had a profound impact on the global community, and vaccination has been recognized as a crucial intervention. To gain insight into public perceptions of COVID-19 vaccines, survey studies and the analysis of social media platforms have been conducted. However, existing methods lack consideration of individual vaccination intentions or status and the relationship between public perceptions and actual vaccine uptake. To address these limitations, this study proposes a text classification approach to identify tweets indicating a user’s intent or status on vaccination. A comparative analysis between the proportions of tweets from different categories and real-world vaccination data reveals notable alignment, suggesting that tweets may serve as a precursor to actual vaccination status. Further, regression analysis and time series forecasting were performed to explore the potential of tweet data, demonstrating the significance of incorporating tweet data in predicting future vaccination status. Finally, clustering was applied to the tweet sets with positive and negative labels to gain insights into underlying focuses of each stance.
Social Media (SM) has become a stage for people to share thoughts, emotions, opinions, and almost every other aspect of their daily lives. This abundance of human interaction makes SM particularly attractive for social sensing. Especially during polarizing events such as political elections or referendums, users post information and encourage others to support their side, using symbols such as hashtags to represent their attitudes. However, many users choose not to attach hashtags to their messages, use a different language, or show their position only indirectly. Thus, automatically identifying their opinions becomes a more challenging task. To uncover these implicit perspectives, we propose a collaborative filtering model based on Graph Convolutional Networks that exploits the textual content in messages and the rich connections between users and topics. Moreover, our approach only requires a small annotation effort compared to state-of-the-art solutions. Nevertheless, the proposed model achieves competitive performance in predicting individuals’ stances. We analyze users’ attitudes ahead of two constitutional referendums in Chile in 2020 and 2022. Using two large Twitter datasets, our model achieves improvements of 3.4% in recall and 3.6% in accuracy over the baselines.
Human social behaviour has been observed to adhere to certain structures. One such structure, the Ego Network Model (ENM), has been found almost ubiquitously in human society. Recently, this model has been extended to include signed connections. While the unsigned ENM has been rigorously observed for decades, the signed version is still somewhat novel and lacks the same breadth of observation. Therefore, the main aim of this paper is to examine this signed structure across various categories of individuals from a swathe of culturally distinct regions. Minor differences in the distribution of signs across the SENM can be observed between cultures. However, these can be overwhelmed when the network is centred around a specific topic. Indeed, users who are engaged with specific themes display higher levels of negativity in their networks. This effect is further supported by a significant negative correlation between the number of "general" topics discussed in a network and that network’s percentage of negative connections. These findings suggest that the negativity of communications and relationships on Twitter are very dependent on the topics being discussed and, furthermore, these relationships are more likely to be negative when they are based around a specific topic.
Pain is one of the most prevalent reasons for seeking medical attention in the United States. Understanding how different communities report and express pain can aid in directing medical efforts and in advancing precision pain management. Using a large-scale self-report survey data set on pain from Gallup (2.5 million surveys) and social media posts from Twitter (1.8 million tweets), we investigate a) if Twitter posts could predict community-level pain and b) how expressions of pain differ across communities in the United States. Beyond observing an improvement of over 9% (in Pearson r) when using Twitter language over demographics to predict community-level pain, our study reveals that the discourse on pain varied significantly across communities in the United States. Evangelical Hubs frequently post about God, lessons from struggle, and prayers when expressing pain, whereas Working Class Country posts about regret and extreme endurance. Academic stresses, injuries, painkillers, and surgeries were the most commonly discussed pain themes in College Towns; Graying America discussed therapy, used emotional language around empathy and anger, and posted about chronic pain treatment; the African American South posted about struggles, patience, and faith when talking about pain. Our study demonstrates the efficacy of using Twitter to predict survey-based self-reports of pain across communities and has implications in aiding community-focused pain management interventions.
The information ecosystem today is noisy, and rife with messages that contain a mix of objective claims and subjective remarks or reactions. Any automated system that intends to capture the social, cultural, or political zeitgeist, must be able to analyze the claims as well as the remarks. Due to the deluge of such messages on social media, and their tremendous power to shape our perceptions, there has never been a greater need to automate these analyses, which play a pivotal role in fact-checking, opinion mining, understanding opinion trends, and other such downstream tasks of social consequence. In this noisy ecosystem, not all claims are worth checking for veracity. Such a check-worthy claim, moreover, must be accurately distilled from subjective remarks surrounding it. Finally, and especially for understanding opinion trends, it is important to understand the stance of the remarks or reactions towards that specific claim. To this end, we introduce a COVID-19 Twitter dataset, and present a three-stage process to (i) determine whether a given Tweet is indeed check-worthy, and if so, (ii) which portion of the Tweet ought to be checked for veracity, and finally, (iii) determine the author’s stance towards the claim in that Tweet, thus introducing the novel task of topic-agnostic stance detection.
Language model pre-training has led to state-of-the-art performance in text summarization. While a variety of pre-trained transformer models are available nowadays, they are mostly trained on documents. In this study we introduce self-supervised pre-training to enhance the BERT model’s semantic and structural understanding of dialog texts from social media. We also propose a semi-supervised teacher-student learning framework to address the common issue of limited available labels in summarization datasets. We empirically evaluate our approach on extractive summarization task with the TWEETSUMM corpus, a recently introduced dialog summarization dataset from Twitter customer care conversations and demonstrate that our self-supervised pre-training and semi-supervised teacher-student learning are both beneficial in comparison to other pre-trained models. Additionally, we compare pre-training and teacher-student learning in various low data-resource settings, and find that pre-training outperforms teacher-student learning and the differences between the two are more significant when the available labels are scarce.
This half-day workshop on Cryptoasset Analytics allows researchers from different disciplines to present their newest findings related to cryptoassets. This workshop is relevant for the Web research community for two reasons. First, on a technical level, fundamental concepts of cryptoassets are increasingly integrated with Web technologies. Second, we witness the formation of socio-technical cryptoasset ecosystems, which are tightly connected to the Web. The program will feature a mix of invited talks and a selection of peer-reviewed submissions. Topics range from empirical studies, over analytics methods and tools, to case studies, datasets, and cross-cutting issues like legal or ethical aspects.
Blockchain explorers are important tools for quick look-ups of on-chain activities. However, as centralized data providers, their reliability remains under-studied. As a case study, we investigate Beaconcha.in , a leading explorer serving Ethereum’s proof-of-stake (PoS) update. According to the explorer, we find that more than 75% of slashable Byzantine actions were not slashed. Since Ethereum relies on the “stake-and-slash" mechanism to align incentives, this finding would at its face value cause concern over Ethereum’s security. However, further investigation reveals that all the apparent unslashed incidents were erroneously recorded due to the explorer’s mishandling of consensus edge cases. Besides the usual message of using caution with centralized information providers, our findings also call for attention to improving the monitoring of blockchain systems that support high-value applications.
As the decentralized finance industry gains traction, governments worldwide are creating or modifying legislations to regulate such financial activities. To avoid these new legislations, decentralized finance enterprises may shop for fiscally advantageous jurisdictions. This study explores global tax evasion opportunities for decentralized finance enterprises. Opportunities are identified by considering various jurisdictions’ tax laws on cryptocurrencies along with their corporate income tax rates, corporate capital gains tax rates, level of financial development and level of cryptocurrency adoption. They are visualized with the manifold approximation and projection for dimension reduction (UMAP) technique. The study results show that there exist a substantial number of tax evasion opportunities for decentralized finance enterprises through both traditional offshore jurisdictions and crypto-advantageous jurisdictions. The latter jurisdictions are usually considered high-tax fiscal regimes; but, given that they do not apply tax laws, tax evasion opportunities arise, especially in jurisdictions that have high financial development and high cryptocurrency adoption. Further research should investigate these new opportunities and how they are evolving. Understanding the global landscape surrounding tax evasion opportunities in decentralized finance represents a first step at preventing corporate capital flight of cryptocurrencies.
In the world of cryptocurrencies, public listing of a new token often generates significant hype, in many cases causing its price to skyrocket in a few seconds. In this scenario, timing is crucial to determine the success or failure of an investment opportunity. In this work, we present an in-depth analysis of sniper bots, automated tools designed to buy tokens as soon as they are listed on the market. We leverage GitHub open-source repositories of sniper bots to analyze their features and how they are implemented. Then, we build a dataset of Ethereum and BNB Smart Chain (BSC) liquidity pools to identify addresses that serially take advantage of sniper bots. Our findings reveal 14,029 sniping operations on Ethereum and 1,395,042 in BSC that bought tokens for a total of $10,144,808 dollars and $18,720,447, respectively. We find that Ethereum operations have a higher success rate but require a larger investment. Finally, we analyze token smart contracts to identify mechanisms that can hinder sniper bots.
The increasing adoption of Digital Assets (DAs), such as Bitcoin (BTC), raises the need for accurate option pricing models. Yet, existing methodologies fail to cope with the volatile nature of the emerging DAs. Many models have been proposed to address the unorthodox market dynamics and frequent disruptions in the microstructure caused by the non-stationarity, and peculiar statistics, in DA markets. However, they are either prone to the curse of dimensionality, as additional complexity is required to employ traditional theories, or they overfit historical patterns that may never repeat. Instead, we leverage recent advances in market regime (MR) clustering with the Implied Stochastic Volatility Model (ISVM) on a very recent dataset covering BTC options on the popular trading platform Deribit. Time-regime clustering is a temporal clustering method, that clusters the historic evolution of a market into different volatility periods accounting for non-stationarity. ISVM can incorporate investor expectations in each of the sentiment-driven periods by using implied volatility (IV) data. In this paper, we apply this integrated time-regime clustering and ISVM method (termed MR-ISVM) to high-frequency data on BTC options. We demonstrate that MR-ISVM contributes to overcome the burden of complex adaption to jumps in higher order characteristics of option pricing models. This allows us to price the market based on the expectations of its participants in an adaptive fashion and put the procedure to action on a new dataset covering previously unexplored DA dynamics.
A Smart Legal Contract (SLC) is a specialized digital agreement comprising natural language and computable components. The Accord Project provides an open-source SLC framework containing three main modules: Cicero, Concerto, and Ergo. Currently, we need lawyers, programmers, and clients to work together with great effort to create a usable SLC using the Accord Project. This paper proposes a pipeline to automate the SLC creation process with several Natural Language Processing (NLP) models to convert law contracts to the Accord Project’s Concerto model. After evaluating the proposed pipeline, we discovered that our NER pipeline accurately detects CiceroMark from Accord Project template text with an accuracy of 0.8. Additionally, our Question Answering method can extract one-third of the Concerto variables from the template text. We also delve into some limitations and possible future research for the proposed pipeline. Finally, we describe a web interface enabling users to build SLCs. This interface leverages the proposed pipeline to convert text documents to Smart Legal Contracts by using NLP models.
Over recent years, large knowledge bases have been constructed to store massive knowledge graphs. However, these knowledge graphs are highly incomplete. To solve this problem, we propose a web-based question answering system with multimodal fusion of unstructured and structured information, to fill in missing information for knowledge bases. To utilize unstructured information from the Web for knowledge graph construction, we design multimodal features and question templates to extract missing facts, which can achieve good quality with very few questions. The question answering system also employs structured information from knowledge bases, such as entity types and entity-to-entity relatedness, to help improve extraction quality. To improve system efficiency, we utilize a few query-driven techniques for web-based question answering to reduce the runtime and provide fast responses to user queries. Extensive experiments have been conducted to demonstrate the effectiveness and efficiency of our system.
Ontologies conceptualize domains and are a crucial part of web semantics and information systems. However, re-using an existing ontology for a new task requires a detailed evaluation of the candidate ontology as it may cover only a subset of the domain concepts, contain information that is redundant or misleading, and have inaccurate relations and hierarchies between concepts. Manual evaluation of large and complex ontologies is a tedious task. Thus, a few approaches have been proposed for automated evaluation, ranging from concept coverage to ontology generation from a corpus. Existing approaches, however, are limited by their dependence on external structured knowledge sources, such as a thesaurus, as well as by their inability to evaluate semantic relationships. In this paper, we propose a novel framework to automatically evaluate the domain coverage and semantic correctness of existing ontologies based on domain information derived from text. The approach uses a domain-tuned named-entity-recognition model to extract phrasal concepts. The extracted concepts are then used as a representation of the domain against which we evaluate the candidate ontology’s concepts. We further employ a domain-tuned language model to determine the semantic correctness of the candidate ontology’s relations. We demonstrate our automated approach on several large ontologies from the oceanographic domain and show its agreement with a manual evaluation by domain experts and its superiority over the state-of-the-art.
Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG). A typical approach collects nodes relevant to the QA pair from a KG to form a Working Graph (WG) followed by reasoning using Graph Neural Networks (GNNs). This faces two major challenges: (i) it is difficult to capture all the information from the QA in the WG, and (ii) the WG contains some irrelevant nodes from the KG. To address these, we propose GrapeQA with two simple improvements on the WG: (i) Prominent Entities for Graph Augmentation identifies relevant text chunks from the QA pair and augments the WG with corresponding latent representations from the LM, and (ii) Context-Aware Node Pruning removes nodes that are less relevant to the QA pair. We evaluate our results on OpenBookQA, CommonsenseQA and MedQA-USMLE and see that GrapeQA shows consistent improvements over its LM + KG predecessor (QA-GNN in particular) and large improvements on OpenBookQA.
Large Language Models (LLMs), with their advanced architectures and training on massive language datasets, contain unexplored knowledge. One method to infer this knowledge is through the use of cloze-style prompts. Typically, these prompts are manually designed because the phrasing of these prompts impacts the knowledge retrieval performance, even if the LLM encodes the desired information. In this paper, we study the impact of prompt syntax on the knowledge retrieval capacity of LLMs. We use a template-based approach to paraphrase simple prompts into prompts with a more complex grammatical structure. We then analyse the LLM performance for these structurally different but semantically equivalent prompts. Our study reveals that simple prompts work better than complex forms of sentences. The performance across the syntactical variations for simple relations (1:1) remains best, with a marginal decrease across different typologies. These results reinforce that simple prompt structures are more effective for knowledge retrieval in LLMs and motivate future research into the impact of prompt syntax on various tasks.
AI-based systems, especially those based on machine learning technologies, have become central in modern societies. In the meanwhile, users and legislators are becoming aware of privacy issues. Users are increasingly reluctant in sharing their sensitive information, and new laws have been enacted to regulate how private data is handled (e.g., the GDPR).
Federated Learning (FL) has been proposed to develop better AI systems without compromising users’ privacy and the legitimate interests of private companies. Although still in its infancy, FL has already shown significant theoretical and practical results making FL one of the hottest topics in the machine learning community.
Given the considerable potential in overcoming the challenges of protecting users’ privacy while making the most of available data, we propose a workshop on Federated Learning Technologies (FLT) at TheWebConf 2023.
The goal of this workshop is to focus the attention of the TheWebConf research community on addressing the open questions and challenges in this thriving research area. Given the broad range of competencies in the TheWebConf community, the workshop will welcome foundational contributions as well as contributions expanding the scope of these techniques, such as improvements in the interpretability and fairness of the learned models.
The metaverse, which is at the stage of innovation and exploration, faces the dilemma of data collection and the problem of private data leakage in the process of development. This can seriously hinder the widespread deployment of the metaverse. Fortunately, federated learning (FL) is a solution to the above problems. FL is a distributed machine learning paradigm with privacy-preserving features designed for a large number of edge devices. Federated learning for metaverse (FL4M) will be a powerful tool. Because FL allows edge devices to participate in training tasks locally using their own data, computational power, and model-building capabilities. Applying FL to the metaverse not only protects the data privacy of participants but also reduces the need for high computing power and high memory on servers. Until now, there have been many studies about FL and the metaverse, respectively. In this paper, we review some of the early advances of FL4M, which will be a research direction with unlimited development potential. We first introduce the concepts of metaverse and FL, respectively. Besides, we discuss the convergence of key metaverse technologies and FL in detail, such as big data, communication technology, the Internet of Things, edge computing, blockchain, and extended reality. Finally, we discuss some key challenges and promising directions of FL4M in detail. In summary, we hope that our up-to-date brief survey can help people better understand FL4M and build a fair, open, and secure metaverse.
Federated Learning (FL), also known as collaborative learning, is a distributed machine learning approach that collaboratively learns a shared prediction model without explicitly sharing private data. When dealing with sensitive data, privacy measures need to be carefully considered. Optimizers have a massive role in accelerating the learning process given the high dimensionality and non-convexity of the search space. The data partitioning in FL can be assumed to be either IID (independent and identically distributed) or non-IID. In this paper, we experiment with the impact of applying different adaptive optimization methods for FL frameworks in both IID and non-IID setups. We analyze the effects of label and quantity skewness, learning rate, and local client training on the learning process of optimizers as well as the overall performance of the global model. We evaluate the FL hyperparameter settings on biomedical text classification tasks on two datasets ADE V2 (Adverse Drug Effect: 2 classes) and Clinical-Trials (Reasons to stop trials: 17 classes).
Trustworthy artificial intelligence (AI) technology has revolutionized daily life and greatly benefited human society. Among various AI technologies, Federated Learning (FL) stands out as a promising solution for diverse real-world scenarios, ranging from risk evaluation systems in finance to cutting-edge technologies like drug discovery in life sciences. However, challenges around data isolation and privacy threaten the trustworthiness of FL systems. Adversarial attacks against data privacy, learning algorithm stability, and system confidentiality are particularly concerning in the context of distributed training in federated learning. Therefore, it is crucial to develop FL in a trustworthy manner, with a focus on robustness and privacy. In this survey, we propose a comprehensive roadmap for developing trustworthy FL systems and summarize existing efforts from two key aspects: robustness and privacy. We outline the threats that pose vulnerabilities to trustworthy federated learning across different stages of development, including data processing, model training, and deployment. To guide the selection of the most appropriate defense methods, we discuss specific technical solutions for realizing each aspect of Trustworthy FL (TFL). Our approach differs from previous work that primarily discusses TFL from a legal perspective or presents FL from a high-level, non-technical viewpoint.
Aggregating pharmaceutical data in the drug-target interaction (DTI) domain can potentially deliver life-saving breakthroughs. It is, however, notoriously difficult due to regulatory constraints and commercial interests [5, 18]. This work proposes the application of federated learning, which is reconcilable with the industry’s constraints. It does not require sharing any information that would reveal the entities’ data or any other high-level summary. When used on a representative GraphDTA model and the KIBA dataset, it achieves up to 15% improved performance relative to the best available non-privacy preserving alternative. Our extensive battery of experiments shows that, unlike in other domains, the non-IID data distribution in the DTI datasets does not deteriorate FL performance. Additionally, we identify a material trade-off between the benefits of adding new data and the cost of adding more clients.
The TempWeb workshop series has a long-standing tradition as a co-located event at The Web Conference. Its main objectives are to provide a venue for researchers of all domains (IE/IR, Web mining, etc.) where the temporal dimension opens an entirely new range of challenges and possibilities. For this, research published at TempWeb covers (and has covered in its previous editions) the full spectrum of longitudinal studies using web content. This ranges from “low-level” network analysis (e.g., Internet traffic) and graph analysis (e.g., in social media) up to “high-level” content and concept analysis ranging from micro-content (e.g., tweets) up to entire web archives. As such, TempWeb has become a potential object of longitudinal studies on its own. Aiming at the investigation of infrastructures, scalable methods, and innovative software for aggregating, querying, and analyzing heterogeneous data at Web scale, TempWeb has developed as a forum for a community from science and industry. Since longitudinal aspects in web content analysis is becoming more and more relevant for analysts from various domains, the studies, tools, and demonstrations are not only relevant for computer science, but also potentially interesting for sociology, marketing, environmental studies, politics, etc. Keeping up with its “tradition”, the current edition covers again the full spectrum of temporal Web analytics at various levels of granularity.
New web content is published constantly, and although protocols such as RSS can notify subscribers of new pages, they are not always implemented or actively maintained. A more reliable way to discover new content is to periodically re-crawl the target sites. Designing such “content discovery crawlers” has important applications, for example, in web search, digital assistants, business, humanitarian aid, and law enforcement. Existing approaches assume that each site of interest has a relatively small set of unknown “source pages” that, when refreshed, frequently provide hyperlinks to the majority of new content. The state of the art (SOTA) uses ideas from the multi-armed bandit literature to explore candidate sources while simultaneously exploiting known good sources. We observe, however, that the SOTA uses a sub-optimal algorithm for balancing exploration and exploitation. We trace this back to a mismatch between the space of actions that the SOTA algorithm models and the space of actions that the crawler must actually choose from. Our proposed approach, the Thompson crawler (named after the Thompson sampler that drives its refresh decisions), addresses this shortcoming by more faithfully modeling the action space. On a dataset of 4,070 source pages drawn from 53 news domains over a period of 7 weeks, we show that, on average, the Thompson crawler discovers 20% more new pages, finds pages 6 hours earlier, and uses 14 fewer refreshes per 100 pages discovered than the SOTA.
The COVID-19 pandemic has had a significant impact on human behaviors and how it influenced peoples’ interests in cultural products is an unsolved problem. While prior studies mostly adopt subjective surveys to find an answer, these methods are always suffering from high cost, limited size, and subjective bias. Inspired by the rich user-oriented data over the Internet, this work explores the possibility to leverage users’ search logs to reflect humans’ underlying cultural product interests. To further examine how the COVID-19 mobility policy might influence cultural interest changes, we propose a new regression discontinuity design that has the additional potential to predict the recovery phase of peoples’ cultural product interests. By analyzing the 1592 search interest time series in 6 countries, we found different patterns of change in interest in movies, music, and art during the COVID-19 pandemic, but a clear overall incremental increase. Across the six countries we studied, we found that changes in interest in cultural products were found to be strongly correlated with mobility and that as mobility declined, interest in movies, music, and art increased by an average of 35, 27 and 20, respectively, with these changes lasting at least eight weeks.
Searching for local communities is an important research challenge that allows for personalized community discovery and supports advanced data analysis in various complex networks, such as the World Wide Web, social networks, and brain networks. The evolution of these networks over time has motivated several recent studies to identify local communities in temporal networks. Given any query nodes, Community Search aims to find a densely connected subgraph containing query nodes. However, existing community search approaches in temporal networks have two main limitations: (1) they adopt pre-defined subgraph patterns to model communities, which cannot find communities that do not conform to these patterns in real-world networks, and (2) they only use the aggregation of disjoint structural information to measure quality, missing the dynamic of connections and temporal properties. In this paper, we propose a query-driven Temporal Graph Convolutional Network (CS-TGN) that can capture flexible community structures by learning from the ground-truth communities in a data-driven manner. CS-TGN first combines the local query-dependent structure and the global graph embedding in each snapshot of the network and then uses a GRU cell with contextual attention to learn the dynamics of interactions and update node embeddings over time. We demonstrate how this model can be used for interactive community search in an online setting, allowing users to evaluate the found communities and provide feedback. Experiments on real-world temporal graphs with ground-truth communities validate the superior quality of the solutions obtained and the efficiency of our model in both temporal and interactive static settings.
Timeline summarization (TLS) is a challenging research task that requires researchers to distill extensive and intricate temporal data into a concise and easily comprehensible representation. This paper proposes a novel approach to timeline summarization using Abstract Meaning Representations (AMRs), a graphical representation of the text where the nodes are semantic concepts and the edges denote relationships between concepts. With AMR, sentences with different wordings, but similar semantics, have similar representations. To make use of this feature for timeline summarization, a two-step sentence selection method that leverages features extracted from both AMRs and the text is proposed. First, AMRs are generated for each sentence. Sentences are then filtered out by removing those with no named-entities and keeping the ones with the highest number of named-entities. In the next step, sentences to appear in the timeline are selected based on two scores: Inverse Document Frequency (IDF) of AMR nodes combined with the score obtained by applying a keyword extraction method to the text. Our experimental results on the TLS-Covid19 test collection demonstrate the potential of the proposed approach.
This paper presents a qualitative analysis of the recoverability of various webpages on the live web, using their archived counterparts as a baseline. We used a heterogeneous dataset consisting of four web archive collections, each with varying degrees of content drift. We were able to recover a small number of webpages previously thought to have been lost and analyzed their content and evolution. Our analysis yielded three types of lost webpages: 1) those that are not recoverable (with three subtypes), 2) those that are fully recoverable, and 3) those that are partially recoverable. The analysis presented here attempts to establish clear definitions and boundaries between the different degrees of webpage recoverabilty. By using a few simple methods, web archivists could discover the new locations of web content that was previously deemed lost, and include them in future crawling efforts, and lead to more complete web archives with less content drift.
Influence campaigns pose a threat to fact-based reasoning, erode trust in institutions, and tear at the fabric of our society. In the 21st century, influence campaigns have rapidly evolved, taking on new online identities. Many of these propaganda campaigns are persistent and well-resourced, making their identification and removal both hard and expensive. Social media companies have predominantly aimed to counter the threat of online propaganda by prioritizing the moderation of "coordinated inauthentic behavior". This strategy focuses on identifying orchestrated campaigns explicitly intended to deceive, rather than individual social media accounts or posts. In this paper, we study the Twitter footprint of a multi-year influence campaign linked to the Russian government. Drawing from the influence model, a generative model that describes the interactions between networked Markov chains, we demonstrate how temporal correlations in the sequential decision processes of individual social media accounts can reveal coordinated inauthentic activity.
Today, coding skills are among the most required competencies worldwide, often also for non-computer scientists. Because of this trend, community contribution-based, question-and-answer (Q&A) platforms became prominent for finding the proper solution to all programming issues. Stack Overflow has been the most popular platform for technical-related questions for years. Still, recently, some programming-related subreddits of Reddit have become a standing stone for questions and discussions. This work investigates the developers’ behavior and community formation around the twenty most popular programming languages. We examined two consecutive years of programming-related questions from Stack Overflow and Reddit, performing a longitudinal study on users’ posting activity and their high-order interaction patterns abstracted via hypergraphs. Our analysis highlighted crucial differences in how these Q&A platforms are utilized by their users. In line with previous literature, it emphasized the constant decline of Stack Overflow in favor of more community-friendly platforms, such as Reddit, which has been growing rapidly lately.
As a growing number of policies are adopted to address the substantial rise in urbanization, there is a significant push for smart governance, endowing transparency in decision-making and enabling greater public involvement. The thriving concept of smart governance goes beyond just cities, ultimately aiming at a smart planet. Ordinances (local laws) affect our life with regard to health, business, etc. This is particularly notable during major events such as the recent pandemic, which may lead to rapid changes in ordinances, pertaining for instance to public safety, disaster management, and recovery phases. However, many citizens view ordinances as impervious and complex. This position paper proposes a research agenda enabling novel forms of ordinance content analysis over time and temporal web question answering (QA) for both legislators and the broader public. Along with this, we aim to analyze social media posts so as to track the public opinion before and after the introduction of ordinances. Challenges include addressing concepts changing over time and infusing subtle human reasoning in mining, which we aim to address by harnessing terminology evolution methods and commonsense knowledge sources, respectively. We aim to make the results of the historical ordinance mining and event-driven analysis seamlessly accessible, relying on a robust semantic understanding framework to flexibly support web QA.
This paper shares our observations based on our three-year experience organizing the FinWeb workshop series. In addition to the widely-discussed topic, content analysis, we notice two tendencies for FinTech applications: customers’ behavior analysis and finance-oriented LegalTech. We also briefly share our idea on the research direction about reliable and trustworthy FinWeb from the investment perspective.
The existing datasets are mostly composed of official documents, statements, news articles, and so forth. So far, only a little attention has been paid to the numerals in financial social comments. Therefore, this paper presents CFinNumAttr, a financial numeral attribute dataset in Chinese via annotating the stock reviews and comments collected from social networking platform. We also conduct several experiments on the CFinNumAttr dataset with state-of-the-art methods to discover the importance of the financial numeral attributes. The experimental results on the CFinNumAttr dataset show that the numeral attributes in social reviews or comments contain rich semantic information, and the numeral clue extraction and attribute classification tasks can make a great improvement in financial text understanding.
Towards the intelligent understanding of table-text data in the finance domain, previous research explores numerical reasoning over table-text content with Question Answering (QA) tasks. A general framework is to extract supporting evidence from the table and text and then perform numerical reasoning over extracted evidence for inferring the answer. However, existing models are vulnerable to missing supporting evidence, which limits their performance. In this work, we propose a novel Semantic-Oriented Hierarchical Graph (SoarGraph) that models the semantic relationships and dependencies among the different elements (e.g., question, table cells, text paragraphs, quantities, and dates) using hierarchical graphs to facilitate supporting evidence extraction and enhance numerical reasoning capability. We conduct our experiments on two popular benchmarks, FinQA and TAT-QA datasets, and the results show that our SoarGraph significantly outperforms all strong baselines, demonstrating remarkable effectiveness.
During the last decades, Anti-Financial Crime (AFC) entities and Financial Institutions have put a constantly increasing effort to reduce financial crime and detect fraudulent activities, that are changing and developing in extremely complex ways. We propose an anomaly detection approach based on network analysis to help AFC officers navigating through the high load of information that is typical of AFC data-driven scenarios. By experimenting on a large financial dataset of more than 80M cross-country wire transfers, we leverage on the properties of complex networks to develop a tool for explainable anomaly detection, that can help in identifying outliers that could be engaged in potentially malicious activities according to financial regulations. We identify a set of network metrics that provide useful insights on individual nodes; by keeping track of the evolution over time of the metric-based node rankings, we are able to highlight sudden and unexpected changes in the roles of individual nodes that deserve further attention by AFC officers. Such changes can hardly be noticed by means of current AFC practices, that sometimes can lack a higher-level, global vision of the system. This approach represents a preliminary step in the automation of AFC and AML processes, serving the purpose of facilitating the work of AFC officers by providing them with a top-down view of the picture emerging from financial data.
Aspect-based summarization of a legal case file related to regulating bodies allows different stakeholders to consume information of interest therein efficiently. In this paper, we propose a multi-step process to achieve the same. First, we explore the semantic sentence segmentation of SEBI case files via classification. We also propose a dataset of Indian legal adjudicating orders which contain tags from carefully crafted domain-specific sentence categories with the help of legal experts. We experiment with various machine learning and deep learning methods for this multi-class classification. Then, we examine the performance of numerous summarization methods on the segmented document to generate persona-specific summaries. Finally, we develop a pipeline making use of the best methods in both sub-tasks to achieve high recall.
Repo (repurchase agreement) trading provides easy access to short-term financing secured by a pledge of collateral and plays an important role in the global financial system. However, repo traders face many tough challenges in their job, from managing complex financial transactions to keeping up with changing market trends and regulations in the complex financial transactions involved. Besides the difficult and tedious processes that take a lot of time and energy, repo traders need to keep up to date with various laws, regulations, and financial trends that may affect their job, worsened by the exposure to a variety of market risks. As the leader of the FinTech industry, Ant Group launched a new initiative to alleviate the affliction of the repo traders at MyBank1. By leveraging many existing platform technologies, such as AI ChatBot and forecasting platforms, and with the collective work of various engineering groups, we are able to create a ChatBot that communicates with other human traders in natural language and create electronic contracts based on the negotiated terms, equipped with proper trading strategies based on forecasting results. The fully automatic workflow not only frees our trader from tedious routines, but also reduces potential human errors. At the same time, it enables refined portfolio and risk management, while opening up the possibility to apply neural network-based trading strategies, and yielding greater returns comparing to traditional workflow reliant on human experiences. Our system has evolved beyond just providing services to our own traders, to now a fully commercialized product, covering other types of interbank trading.
Deep learning and the Web of Things (WoT) have become powerful tools for web engineering, leading to increased investigation and publication of research related to deep learning in web engineering. Therefore, this workshop is titled "The 3rd International Workshop on Deep Learning for the Web of Things" for the Web Conference 2023 (WWW’23). This workshop features five research articles: (1) "Multiple-Agent Deep Reinforcement Learning for Avatar Migration in Vehicular Metaverses", (2) "Web 3.0: Future of the Internet", (3) "Weighted Statistically Significant Pattern Mining", (4) "DSNet: Efficient Lightweight Model for Video Salient Object Detection for IoT and WoT Applications", and (5) "The Human-Centric Metaverse: A Survey".
Vehicular Metaverses are widely considered as the next Internet revolution to build a 3D virtual world with immersive virtual-real interaction for passengers and drivers. In vehicular Metaverse applications, avatars are digital representations of on-board users to obtain and manage immersive vehicular services (i.e., avatar tasks) in Metaverses and the data they generate. However, traditional Internet of Vehicles (IoV) data management solutions have serious data security risks and privacy protection. Fortunately, blockchain-based Web 3.0 enables avatars to have an ownership identity to securely manage the data owned by users in a decentralized and transparent manner. To ensure users’ immersive experiences and securely manage their data, avatar tasks often require significant computing resources. Therefore, it is impractical for the vehicles to process avatar tasks locally, massive computation resources are needed to support the avatar tasks. To this end, offloading avatar tasks to nearby RoadSide Units (RSUs) is a promising solution to avoid computation overload. To ensure real-time and continuous Metaverse services, the avatar tasks should be migrated among the RSUs when the vehicle navigation. It is challenging for the vehicles to independently decide whether migrate or not according to current and future avatar states. Therefore, in this paper, we propose a new avatar task migration framework for vehicular Metaverses. We then formulate the avatar task migration problem as a Partially Observable Markov Decision Process (POMDP), and apply a Multi-Agent Deep Reinforcement Learning (MADRL) algorithm to dynamically make migration decisions for avatar tasks. Numerous results show that our proposed algorithm outperforms existing baselines for avatar task migration and enables immersive vehicular Metaverse services.
With the rapid growth of the Internet, human daily life has become deeply bound to the Internet. To take advantage of massive amounts of data and information on the internet, the Web architecture is continuously being reinvented and upgraded. From the static informative characteristics of Web 1.0 to the dynamic interactive features of Web 2.0, scholars and engineers have worked hard to make the internet world more open, inclusive, and equal. Indeed, the next generation of Web evolution (i.e., Web 3.0) is already coming and shaping our lives. Web 3.0 is a decentralized Web architecture that is more intelligent and safer than before. The risks and ruin posed by monopolists or criminals will be greatly reduced by a complete reconstruction of the Internet and IT infrastructure. In a word, Web 3.0 is capable of addressing web data ownership according to distributed technology. It will optimize the internet world from the perspectives of economy, culture, and technology. Then it promotes novel content production methods, organizational structures, and economic forms. However, Web 3.0 is not mature and is now being disputed. Herein, this paper presents a comprehensive survey of Web 3.0, with a focus on current technologies, challenges, opportunities, and outlook. This article first introduces a brief overview of the history of World Wide Web as well as several differences among Web 1.0, Web 2.0, Web 3.0, and Web3. Then, some technical implementations of Web 3.0 are illustrated in detail. We discuss the revolution and benefits that Web 3.0 brings. Finally, we explore several challenges and issues in this promising area.
Pattern discovery (aka pattern mining) is a fundamental task in the field of data science. Statistically significant pattern mining (SSPM) is the task of finding useful patterns that statistically occur more often from databases for one class than for another. The existing SSPM task does not consider the weight of each item. While in the real world, the significant level of different items/objects is various. Therefore, in this paper, we introduce the Weighted Statistically Significant Patterns Mining (WSSPM) problem and propose a novel WSSpm algorithm to successfully solve it. We present a new framework that effectively mines weighted statistically significant patterns by combining the weighted upper-bound model and the multiple hypotheses test. We also propose a new weighted support threshold that can satisfy the demand of WSSPM and prove its correctness and completeness. Besides, our weighted support threshold and modified weighted upper-bound can effectively shrink the mining range. Finally, experimental results on several real datasets show that the WSSpm algorithm performs well in terms of execution time and memory storage.
The most challenging aspects of deploying deep models in IoT and embedded systems are extensive computational complexity and large training and inference time. Although various lightweight versions of state-of-the-art models are also being designed, maintaining the performance of such models is difficult. To overcome these problems, an efficient, lightweight, Deformable Separable Network (DSNet) is proposed for video salient object detection tasks, mainly for mobile and embedded vision applications. DSNet is equipped with a Deformable Convolution Network (DeCNet), Separable Convolution Network (SCNet), and Depth-wise Attention Response Propagation (DARP) module, which makes it maintain the trade-off between accuracy and latency. The proposed model generates saliency maps considering both the background and foreground simultaneously, making it perform better in unconstrained scenarios (such as partial occlusion, deformable background/objects, and illumination effect). The extensive experiments conducted on six benchmark datasets demonstrate that the proposed model outperforms state-of-art approaches in terms of computational complexity, number of parameters, and latency measures.
In the era of the Web of Things, the Metaverse is expected to be the landing site for the next generation of the Internet, resulting in the increased popularity of related technologies and applications in recent years and gradually becoming the focus of Internet research. The Metaverse, as a link between the real and virtual worlds, can provide users with immersive experiences. As the concept of the Metaverse grows in popularity, many scholars and developers begin to focus on the Metaverse’s ethics and core. This paper argues that the Metaverse should be centered on humans. That is, humans constitute the majority of the Metaverse. As a result, we begin this paper by introducing the Metaverse’s origins, characteristics, related technologies, and the concept of the human-centric Metaverse (HCM). Second, we discuss the manifestation of human-centric in the Metaverse. Finally, we discuss some current issues in the construction of HCM. In this paper, we provide a detailed review of the applications of human-centric technologies in the Metaverse, as well as the relevant HCM application scenarios. We hope that this paper can provide researchers and developers with some directions and ideas for human-centric Metaverse construction.
White supremacist extremist groups are a significant domestic terror threat in many Western nations. These groups harness the Internet to spread their ideology via online platforms: blogs, chat rooms, forums, and social media, which can inspire violence offline. In this work, we study the persistence and reach of white supremacist propaganda in both online and offline environments. We also study patterns in narratives that crossover from online to offline environments, or vice versa. From a geospatial analysis, we find that offline propaganda is geographically widespread in the United States, with a slight tendency toward Northeastern states. Propaganda that spreads the farthest and lasts the longest has a patriotic framing and is short, memorable, and repeatable. Through text comparison methods, we illustrate that online propaganda typically leads the appearance of the same propaganda in offline flyers, banners, and graffiti. We hope that this study sheds light on the characteristics of persistent white supremacist narratives both online and offline.
Nowadays, false and unverified information on social media sway individuals’ perceptions during major geo-political events and threaten the quality of the whole digital information ecosystem. Since the Russian invasion of Ukraine, several fact-checking organizations have been actively involved in verifying stories related to the conflict that circulated online. In this paper, we leverage a public repository of fact-checked claims to build a methodological framework for automatically identifying false and unsubstantiated claims spreading on Twitter in February 2022. Our framework consists of two sequential models: First, the claim detection model identifies whether tweets incorporate a (false) claim among those considered in our collection. Then, the claim retrieval model matches the tweets with fact-checked information by ranking verified claims according to their relevance with the input tweet. Both models are based on pre-trained language models and fine-tuned to perform a text classification task and an information retrieval task, respectively. In particular, to validate the effectiveness of our methodology, we consider 83 verified false claims that spread on Twitter during the first week of the invasion, and manually annotate 5,872 tweets according to the claim(s) they report. Our experiments show that our proposed methodology outperforms standard baselines for both claim detection and claim retrieval. Overall, our results highlight how social media providers could effectively leverage semi-automated approaches to identify, track, and eventually moderate false information that spreads on their platforms.
Recent advancements in machine learning and computer vision have led to the proliferation of Deepfakes. As technology democratizes over time, there is an increasing fear that novice users can create Deepfakes, to discredit others and undermine public discourse. In this paper, we conduct user studies to understand whether participants with advanced computer skills and varying level of computer science expertise can create Deepfakes of a person saying a target statement using limited media files. We conduct two studies; in the first study (n = 39) participants try creating a target Deepfake in a constrained time frame using any tool they desire. In the second study (n = 29) participants use pre-specified deep learning based tools to create the same Deepfake. We find that for the first study, of the participants successfully created complete Deepfakes with audio and video, whereas for the second user study, of the participants were successful in stitching target speech to the target video. We further use Deepfake detection software tools as well as human examiner-based analysis, to classify the successfully generated Deepfake outputs as fake, suspicious, or real. The software detector classified of the Deepfakes as fake, whereas the human examiners classified of the videos as fake. We conclude that creating Deepfakes is a simple enough task for a novice user given adequate tools and time; however, the resulting Deepfakes are not sufficiently real-looking and are unable to completely fool detection software as well as human examiners.
Users are exposed to a large volume of harmful content that appears daily on various social network platforms. One solution to users’ protection is developing online moderation tools using Machine Learning (ML) techniques for automatic detection or content filtering. On the other hand, the processing of user data requires compliance with privacy policies. This paper proposes a privacy–preserving Federated Learning (FL) framework for online content moderation that incorporates Central Differential Privacy (CDP). We simulate the FL training of a classifier for detecting tweets with harmful content, and we show that the performance of the FL framework can be close to the centralized approach. Moreover, it has a high performance even if a small number of clients (each with a small number of tweets) are available for the FL training. When reducing the number of clients (from fifty to ten) or the tweets per client (from 1K to 100), the classifier can still achieve AUC. Furthermore, we extend the evaluation to four other Twitter datasets that capture different types of user misbehavior and still obtain a promising performance (61% – 80% AUC).
Locating and characterizing polarization is one of the most important issues to enable a healthier web ecosystem. Finding groups of nodes that form strongly stable agreements and participate in collective conflicts with other groups is an important problem in this context. Previous works approach this problem by finding balanced subgraphs, in which the polarity measure is optimized, that result in large subgraphs without a clear notion of agreement or conflict. In real-world signed networks, balanced subgraphs are often not polarized as in the case of a subgraph with only positive edges. To remedy this issue, we leverage the notion of cohesion — we find pairs of cohesively polarized communities where each node in a community is positively connected to nodes in the same community and negatively connected to nodes in the other community. To capture the cohesion along with the polarization, we define a new measure, dichotomy. We leverage the balanced triangles, which model the cohesion and polarization at the same time, to design a heuristic that results in good seedbeds for polarized communities in real-world signed networks. Then, we introduce the electron decomposition which finds cohesively polarized communities with high dichotomy score. In an extensive experimental evaluation, we show that our method finds cohesively polarized communities and outperforms the state-of-the-art methods with respect to several measures. Moreover, our algorithm is more efficient than the existing methods and practical for large-scale networks.
With the growing ubiquity of the Internet and access to media-based social media platforms, the risks associated with media content sharing on social media and the need for safety measures against such risks have grown paramount. At the same time, risk is highly contextualized, especially when it comes to media content youth share privately on social media. In this work, we conducted qualitative content analyses on risky media content flagged by youth participants and research assistants of similar ages to explore contextual dimensions of youth online risks. The contextual risk dimensions were then used to inform semi- and self-supervised state-of-the-art vision transformers to automate the process of identifying risky images shared by youth. We found that vision transformers are capable of learning complex image features for use in automated risk detection and classification. The results of our study serve as a foundation for designing contextualized and youth-centered machine-learning methods for automated online risk detection.
Social media have been deliberately used for malicious purposes, including political manipulation and disinformation. Most research focuses on high-resource languages. However, malicious actors share content across countries and languages, including low-resource ones. Here, we investigate whether and to what extent malicious actors can be detected in low-resource language settings. We discovered that a high number of accounts posting in Tagalog were suspended as part of Twitter’s crackdown on interference operations after the 2016 US Presidential election. By combining text embedding and transfer learning, our framework can detect, with promising accuracy, malicious users posting in Tagalog without any prior knowledge or training on malicious content in that language. We first learn an embedding model for each language, namely a high-resource language (English) and a low-resource one (Tagalog), independently. Then, we learn a mapping between the two latent spaces to transfer the detection model. We demonstrate that the proposed approach significantly outperforms state-of-the-art models and yields marked advantages in settings with very limited training data—the norm when dealing with detecting malicious activity in online platforms.
Fraud and other types of adversarial behavior are serious problems on customer-to-customer (C2C) e-commerce platforms, where harmful behaviors by bad actors erode user trust and safety. Many modern e-commerce integrity systems utilize machine learning (ML) to detect fraud and bad actors. We discuss the practical problems faced by integrity systems which utilize data associated with user interactions with the platform. Specifically, we focus on the challenge of representing the user interaction events, and aggregating their features. We compare the performance of two paradigms to handle the feature temporality when training the ML models: hand-engineered temporal aggregation and a learned aggregation using a sequence encoder. We show that a model which learns a time-aggregation using a sequence encoder outperforms models trained on handcrafted aggregations on the fraud classification task with a real-world dataset.
The spread of radicalization through the Internet is a growing problem. We are witnessing a rise in online hate groups, inspiring the impressionable and vulnerable population towards extreme actions in the real world. In this paper, we study the spread of hate sentiments in online forums by collecting 1,973 long comment threads (30+ comments per thread) posted on dark-web forums and containing a combination of benign posts and radical comments on the Islamic religion. This framework allows us to leverage network analysis tools to investigate sentiment propagation through a social network. By combining sentiment analysis, social network analysis, and graph theory, we aim to shed light on the propagation of hate speech in online forums and the extent to which such speech can influence individuals. The results of the intra-thread analysis suggests that sentiment tends to cluster within comment threads, with around 75% of connected members sharing similar sentiments. They also indicate that online forums can act as echo chambers where people with similar views reinforce each other’s beliefs and opinions. On the other hand, the inter-thread shows that 64% of connected threads share similar sentiments, suggesting similarities between the ideologies present in different threads and that there likely is a wider network of individuals spreading hate speech across different forums. Finally, we plan to study this work with a larger dataset, which could provide further insights into the spread of hate speech in online forums and how to mitigate it.
Cybergrooming is a serious cybercrime that primarily targets youths through online platforms. Although reactive predator detection methods have been studied, proactive victim protection and crime prevention can also be achieved through vulnerability analysis of potential youth victims. Despite its significance, vulnerability analysis has not been thoroughly studied in the data science literature, while several social science studies used survey-based methods. To address this gap, we investigate humans’ social-psychological traits and quantify key vulnerability factors to cybergrooming by analyzing text features in the Linguistic Inquiry and Word Count (LIWC). Through pairwise correlation studies, we demonstrate the degrees of key vulnerability dimensions to cybergrooming from youths’ conversational features. Our findings reveal that victims have negative correlations with family and community traits, contrasting with previous social survey studies that indicated family relationships or social support as key vulnerability factors. We discuss the current limitations of text mining analysis and suggest cross-validation methods to increase the validity of research findings. Overall, this study provides valuable insights into understanding the vulnerability factors to cybergrooming and highlights the importance of adopting multidisciplinary approaches.
Social media platforms provide fertile ground for investigating the processes of identity creation and communication that shape individual and public opinion. The computational methods used in social network analysis have opened the way for new approaches to be used to understand the psychological and social processes that occur when users take part in online social movements or digital activism. The research in this paper takes an interdisciplinary approach bridging social identity and deindividuation theories to show how shared, individual social identities merge into a collective identity using computational techniques. We demonstrate a novel approach to evaluating the emergence of collective identity by measuring: 1) the statistical similarity of discussion topics within online communities and 2) the strength of these communities by examining network modularity and assortative properties of the network. To accomplish this, we examined the online connective action campaign of the #stopthesteal movement that emerged during the 2020 U.S. Presidential Election. Our dataset consisted of 838,395 tweets posted by 178,296 users collected from January 04, 2020, to January 31, 2021. The results show that the network becomes more cohesive and topic similarity increases within communities leading up to and just after the elections (event 1) and the U.S. Capitol riot (event 2). Taking this multi-method approach of measuring content and network structure over time helps researchers and social scientists understand the emergence of a collective community as it is being constructed. The use of computational methods to study collective identity formation can help researchers identify the behaviors and social dynamics emerging from this type of cyber-collective movement that often serve as catalysts for these types of events. Finally, this research offers a new way to assess the psycho-social drivers of participant behaviors in cyber collective action.
For more than a decade scholars have been investigating the disinformation flow on social media contextually to societal events, like, e.g., elections. In this paper, we analyze the Twitter traffic related to the US 2020 pre-election debate and ask whether it mirrors the electoral system. The U.S. electoral system provides that, regardless of the actual vote gap, the premier candidate who received more votes in one state ‘takes’ that state. Criticisms of this system have pointed out that election campaigns can be more intense in particular key states to achieve victory, so-called swing states. Our intuition is that election debate may cause more traffic on Twitter-and probably be more plagued by misinformation-when associated with swing states. The results mostly confirm the intuition. About 88% of the entire traffic can be associated with swing states, and links to non-trustworthy news are shared far more in swing-related traffic than the same type of news in safe-related traffic. Considering traffic origin instead, non-trustworthy tweets generated by automated accounts, so-called social bots, are mostly associated with swing states. Our work sheds light on the role an electoral system plays in the evolution of online debates, with, in the spotlight, disinformation and social bots.
Twitter bots amplify target content in a coordinated manner to make them appear popular, which is an astroturfing attack. Such attacks promote certain keywords to push them to Twitter trends to make them visible to a broader audience. Past work on such fake trends revealed a new astroturfing attack named ephemeral astroturfing that employs a very unique bot behavior in which bots post and delete generated tweets in a coordinated manner. As such, it is easy to mass-annotate such bots reliably, making them a convenient source of ground truth for bot research. In this paper, we detect and disclose over 212,000 such bots targeting Turkish trends, which we name astrobots. We also analyze their activity and suspension patterns. We found that Twitter purged those bots en-masse 6 times since June 2018. However, the adversaries reacted quickly and deployed new bots that were created years ago. We also found that many such bots do not post tweets apart from promoting fake trends, which makes it challenging for bot detection methods to detect them. Our work provides insights into platforms’ content moderation practices and bot detection research. The dataset is publicly available at https://github.com/tugrulz/EphemeralAstroturfing.
Society has been significantly impacted by social media platforms in almost every aspect of their life. This impact has been effectively formulating people’s global mindsets and opinions on political, economic, and social events. Such waves of opinion formation are referred to as propagandas and misinformation. Online propaganda influences the emotional and psychological orientation of people. The remarkable leaps in Machine Learning models and Natural Language Processing have helped in analyzing the emotional and psychological effects of cyber social threats such as propaganda campaigns on different nations, specifically in the Middle East, where rates of disputes have risen after the Arab Spring and the ongoing crises. In this paper, we present an approach to detect propagandas and the associated emotional and psychological aspects from social media news headlines that contain such a contextualized cyber social attack. We created a new dataset of headlines containing propaganda tweets and another dataset of potential emotions that the audience might endure when being exposed to such propaganda headlines. We believe that this is the first research to address the detection of emotional reactions linked to propaganda types on social media in the Middle East.
Knowledge Graphs have become a foundation for sharing data on the web and building intelligent services across many sectors and also within some of the most successful corporations in the world. The over centralisation of data on the web, however, has been raised as a concern by a number of prominent researchers in the field. For example, at the beginning of 2022 a €2.7B civil lawsuit was launched against Meta on the basis that it has abused its market dominance to impose unfair terms and conditions on UK users in order to exploit their personal data. Data centralisation can lead to a number of problems including: lock-in/siloing effects, lack of user control over their personal data, limited incentives and opportunities for interoperability and openness, and the resulting detrimental effects on privacy and innovation. A number of diverse approaches and technologies exist for decentralising data, such as federated querying and distributed ledgers. The main question is, though, what does decentralisation really mean for web data and Knowledge Graphs? What are the main issues and tradeoffs involved? These questions and others are addressed in this workshop.
The Data Catalogue Vocabulary (DCAT) standard is a popular RDF vocabulary for publishing metadata about data catalogs and a valuable foundation for creating Knowledge Graphs. It has widespread application in the (Linked) Open Data and scientific communities. However, DCAT does not specify a robust mechanism to create and maintain persistent identifiers for the datasets. It relies on Internationalized Resource Identifiers (IRIs), that are not necessarily unique, resolvable and persistent. This impedes findability, citation abilities, and traceability of derived and aggregated data artifacts. As a remedy, we propose a decentralized identifier registry where persistent identifiers are managed by a set of collaborative distributed nodes. Every node gives full access to all identifiers, since an unambiguous state is shared across all nodes. This facilitates a common view on the identifiers without the need for a (virtually) centralized directory. To support this architecture, we propose a data model and network methodology based on a distributed ledger and the W3C recommendation for Decentralized Identifiers (DID). We implemented our approach as a working prototype on a five-peer test network based on Hyperledger Fabric.
The Open Digital Rights Language (ODRL) is a standard widely adopted to express privacy policies. This article presents several challenges identified in the context of the European project AURORAL in which ODRL is used to express privacy policies for Smart Communities and Rural Areas. The article presents that some challenges should be addressed directly by the ODRL standardisation group to achieve the best course of action, although others exists. For others, the authors have presented a potential solution, in particular, for considering dynamic values coming from external data sources into privacy policies. Finally, the last challenge is an open research question, since it revolves around the interoperability of privacy policies that belong to different systems and that are expressed with different privacy languages.
Increasing interest in decentralisation for data and processing on the Web brings with it the need to re-examine methods for verifying data and behaviour for scalable multi-party interactions. We consider factors relevant to verification of querying activity on knowledge graphs in a Trusted Decentralised Web, and set out ideas for future research in this area.
When students interact with an online course, the routes they take when navigating through the course can be captured. Learning Analytics is the process of measuring, collecting, recording, and analysing this Student Activity Data. Predictive Learning Analytics, a sub-field of Learning Analytics, can help to identify students who are at risk of dropping out or failing, as well as students who are close to a grade boundary. Course tutors can use the insights provided by the analyses to offer timely assistance to these students. Despite its usefulness, there are privacy and ethical issues with the typically centralised approach to Predictive Learning Analytics. In this positioning paper, it is proposed that the issues associated with Predictive Learning Analytics can be alleviated, in a framework called EMPRESS, by combining 1) self-sovereign data, where data owners control who legitimately has access to data pertaining to them, 2) Federated Learning, where the data remains on the data owner’s device and/or the data is processed by the data owners themselves, and 3) Graph Convolutional Networks for Heterogeneous graphs, which are examples of knowledge graphs.
The concepts for dataspaces range from database management systems to cross-company platforms for data and applications. In this short paper, we present the “Solid Data Space” (SDS), a concept for dataspaces that build on top of the (Semantic) Web and Social Linked Data (Solid). Existing Web technologies and Linked Data principles form the foundation for open, decentralized networks for sovereign data exchange between citizens, organizations and companies. Domain-specific dataspace implementations can extend the agreements for communication and collaboration to enable specific functionality. We compare the SDS with principles and components of the emerging International Data Spaces to identify similarities and point out technological differences.
In today’s internet almost any party can share sets of data with each other. However, creating frameworks and regulated realms for the sharing of data is very complex when multiple parties are involved and complicated regulation comes into play. As solution data spaces were introduced to enable participating parties to share data among themselves in an organized, regulated and standardized way. However, contract data processors, acting as data space participants, are currently unable to execute data requests on behalf of their contract partners. Here we show that an on-behalf-of actor model can be easily added to existing data spaces. We demonstrate how this extension can be realized using verifiable credentials. We provide a sample use case, a detailed sequence diagram and discuss necessary architectural adaptations and additions to established protocols. Using the extensions explained in this work numerous real life use cases which previously could technically not be realized can now be covered. This enables future data spaces to provide more dynamic and complex real world use cases.
This position paper proposes a hybrid architecture for secure and efficient data sharing and processing across dynamic data spaces. On the one hand, current centralized approaches are plagued by issues such as lack of privacy and control for users, high costs, and bad performance, making these approaches unsuitable for the decentralized data spaces prevalent in Europe and various industries (decentralized on the conceptual and physical levels while centralized in the underlying implementation). On the other hand, decentralized systems face challenges with limited knowledge of/control over the global system, fair resource utilization, and data provenance. Our proposed Semantic Data Ledger (SDL) approach combines the advantages of both architectures to overcome their limitations. SDL allows users to choose the best combination of centralized and decentralized features, providing a decentralized infrastructure for the publication of structured data with machine-readable semantics. It supports expressive structured queries, secure data sharing, and payment mechanisms based on an underlying autonomous ledger, enabling the implementation of economic models and fair-use strategies.
While the concept of “data spaces” is no longer new, its specific application to individuals and personal data management is still undeveloped. This short paper presents a vision for “personal data spaces” in the shape of a work-in-progress description of them and some of the conceptual and implementation features envisioned. It is offered for discussion, debate, and improvement by professionals, policymakers, and researchers operating in the intersection of data spaces and personal data management.
Our vision paper outlines a plan to improve the future of semantic interoperability in data spaces through the application of machine learning. The use of data spaces, where data is exchanged among members in a self-regulated environment, is becoming increasingly popular. However, the current manual practices of managing metadata and vocabularies in these spaces are time-consuming, prone to errors, and may not meet the needs of all stakeholders. By leveraging the power of machine learning, we believe that semantic interoperability in data spaces can be significantly improved. This involves automatically generating and updating metadata, which results in a more flexible vocabulary that can accommodate the diverse terminologies used by different sub-communities. Our vision for the future of data spaces addresses the limitations of conventional data exchange and makes data more accessible and valuable for all members of the community.
The vast majority of artificial intelligence practitioners overlook the importance of documentation when building and publishing models and datasets. However, due to the recent trend in the explainability and fairness of AI models, several frameworks have been proposed such as Model Cards, and Data Cards, among others, to help in the appropriate re-usage of those models and datasets. In addition, because of the introduction of the dataspace concept for similar datasets in one place, there is potential that similar Model Cards, Data Cards, Service Cards, and Dataspace Cards can be linked to extract helpful information for better decision-making about which model and data can be used for a specific application. This paper reviews the case for considering a Semantic Web approach for exchanging Model/Data Cards as Linked Data or knowledge graphs in a dataspace, making them machine-readable. We discuss the basic concepts and propose a schema for linking Data Cards and Model Cards within a dataspace. In addition, we introduce the concept of a dataspace card which can be a starting point for extracting knowledge about models and datasets in a dataspace. This helps in building trust and reuse of models and data among companies and individuals participating as publishers or consumers of such assets.
Modern data management is evolving from centralized integration-based solutions to a non-integration-based process of finding, accessing and processing data, as observed within dataspaces. Common reference dataspace architectures assume that sources publish their own domain-specific schema. These schemas, also known as semantic models, can only be partially created automatically and require oversight and refinement by human modellers. Non-expert users, such as mechanical engineers or municipal workers, often have difficulty building models because they are faced with multiple ontologies, classes, and relations, and existing tools are not designed for non-expert users. The PLASMA framework consists of a platform and auxiliary services that focus on providing non-expert users with an accessible way to create and edit semantic models, combining automation approaches and support systems such as a recommendation engine. It also provides data conversion from raw data to RDF. In this paper we highlight the main features, like the modeling interface and the data conversion engine. We discuss how PLASMA as a tool is suitable for building semantic models by non-expert users in the context of dataspaces and show some applications where PLASMA has already been used in data management projects.
The exponential growth in data production has led to increasing demand for high-quality data-driven services. Additionally, the benefits of data-driven analysis are vast and have significantly propelled research in many fields. Data sharing benefits scientific advancement, as it promotes transparency, and collaboration, accelerates research and aids in making informed decisions. The European strategy for data aims to create a single data market that ensures Europe’s global competitiveness and data sovereignty. Common European Data Spaces ensure that data from different sources are available in the economy and society, while data providers (e.g., hospitals and scientists) control data access. The National Research Data Infrastructure for Personal Health Data (NFDI4Health) initiative is a prime example of an effort focused on data from clinical trials and public health studies. Collecting and analyzing this data is essential to developing novel therapies, comprehensive care approaches, and preventive measures in modern healthcare systems.
This work describes distributed data analysis services and components that adhere to the FAIR data principles (Findable, Accessible, Interoperable, and Reusable) within the data space environment. We focus on distributed analytics functionality in Gaia-X-based data spaces. Gaia-X offers a trustworthy federation of data infrastructure and service providers for European countries.
With the advent and pervasiveness of the Internet of Things (IoT), big data and cloud computing technologies, digitalization in enterprises and factories has rapidly increased in the last few years. Digital platforms have emerged as an effective mechanism for enabling the management and sharing of data from various companies. To enable sharing of data beyond the platform boundaries, definition of new platform federation approaches is on the rise. This makes enabling the federation of digital platforms a key requirement for large-scale dataspaces. Therefore, the identification of platform federation requirements and building blocks for such dataspaces needs to be systematically addressed. In this paper, we try to systematically explore the high-level requirements for enabling a federation of digital platforms in the manufacturing domain and identify a set of building blocks. We integrate the requirements and building blocks into the notion of dataspaces. The identified requirements and building blocks act as a blueprint for designing and instantiating new dataspaces, thereby speeding up the development process and reducing costs. We present a case study to illustrate how the use of common building blocks can act as common guiding principles and result in more complete and interoperable implementations.
Multimodal knowledge graphs have the potential to enhance data spaces by providing a unified and semantically grounded structured representation of multimodal data produced by multiple sources. With the ability to integrate and analyze data in real-time, multimodal knowledge graphs offer a wealth of insights for smart city applications, such as monitoring traffic flow, air quality, public safety, and identifying potential hazards. Knowledge enrichment can enable a more comprehensive representation of multimodal data and intuitive decision-making with improved expressiveness and generalizability. However, challenges remain in effectively modelling the complex relationships between and within different types of modalities in data spaces and infusing common sense knowledge from external sources. This paper reviews the related literature and identifies major challenges and key requirements for effectively developing multimodal knowledge graphs for data spaces, and proposes an ontology for their construction.
The circular economy (CE) is essential to achieving a sustainable future through resource conservation and climate protection. Efficient use of materials and products over time is a critical aspect of CE, helping to reduce CO2 emissions, waste and resource consumption. The Digital Product Passport (DPP) is a CE-specific approach that contains information about components and their origin, and can also provide environmental and social impact assessments. However, creating a DPP requires collecting and analyzing data from many different stakeholders along the supply chain and even throughout the product lifecycle. In this paper, we present a concept for the SPACE_DS, which is a data space for circular economy data. A key point here is that the SPACE_DS enables the creation of DPPs by especially considering privacy and security concerns of data providers.
Capturing, visualizing and analyzing provenance data to better understand and support analytic reasoning processes is a rapidly growing research field named analytic provenance. Provenance data includes the state of a visualization within a tool as well as the user’s interactions performed while interacting with the tool. Research in this field has produced in many new approaches that generate data for specific tools and use cases. However, since a variety of tools are used and analytic tasks are performed in real analysis use cases there is a problem in building an interoperable baseline data corpus for investigation of the transferability of different approaches. In this paper, we present a visionary data space architecture for integrating and processing analytic provenance data in a unified way using semantic modeling. We discuss emerging challenges and research opportunities to realize such a vision using semantic models in data spaces to enable analytic provenance data interoperability.
The term dataspace was coined two decades ago  and has evolved since then. Definitions range from (i) an abstraction for data management in an identifiable scope  over (iii) a multi-sided data platform connecting participants in an ecosystem  to (iii) interlinking data towards loosely connected (global) information . Many implementations and scientific notions follow different interpretations of the term dataspace, but agree on some use of semantic technologies. For example, dataspaces such as the European Open Science Cloud and the German National Research Data Infrastructure are committed to applying the FAIR principles [11, 16]. Dataspaces built on top of Gaia-X are using semantic methods for service Self-Descriptions . This paper investigates ongoing dataspace efforts and aims to provide insights on the definition of the term dataspace, the usage of semantics and FAIR principles, and future directions for the role of semantics in dataspaces.
Provenance has been studied extensively to explain existing and missing results for many applications while focusing on scalability and usability challenges. Recently, techniques that efficiently compute a compact representation of provenance have been introduced. In this work, we introduce a practical solution that computes a sample of provenance for existing results without computing full provenance. Our technique computes a sample of provenance based on the distribution of provenance wrt the query result that is estimated from the distribution of input data while considering the correlation among the data. The preliminary evaluation demonstrates that comparing to the naive approach our method efficiently computes a sample of (large size of) provenance with low errors.
In the field of data-driven research and analysis, the quality of results largely depends on the quality of the data used. Data cleaning is a crucial step in improving the quality of data. Still, it is equally important to document the steps made during the data cleaning process to ensure transparency and enable others to assess the quality of the resulting data. While provenance models such as W3C PROV have been introduced to track changes and events related to any entity, their use in documenting the provenance of data-cleaning workflows can be challenging, particularly when mixing different types of documents or entities in the model. To address this, we propose a conceptual model and analysis that breaks down data-cleaning workflows into process abstraction and workflow recipes, refining operations to the column level. This approach provides users with detailed provenance information, enabling transparency, auditing, and support for data cleaning workflow improvements. Our model has several features that allow static analysis, e.g., to determine the minimal input schema and expected output schema for running a recipe, to identify which steps violate the column schema requirement constraint, and to assess the reusability of a recipe on a new dataset. We hope that our model and analysis will contribute to making data processing more transparent, accessible, and reusable.
We present a provenance model for the generic workflow of numerical Lattice Quantum Chromodynamics (QCD) calculations, which constitute an important component of particle physics research. These calculations are carried out on the largest supercomputers worldwide with data in the multi-PetaByte range being generated and analyzed. In the Lattice QCD community, a custom metadata standard (QCDml) that includes certain provenance information already exists for one part of the workflow, the so-called generation of configurations.
In this paper, we follow the W3C PROV standard and formulate a provenance model that includes both the generation part and the so-called measurement part of the Lattice QCD workflow. We demonstrate the applicability of this model and show how the model can be used to answer some provenance-related research questions. However, many important provenance questions in the Lattice QCD community require extensions of this provenance model. To this end, we propose a multi-layered provenance approach that combines prospective and retrospective elements.
Organisations have to comply with environmental regulations to protect the environment and meet internationally agreed climate change targets. To assist organisations, processes and standards are being defined to manage these compliance obligations. They typically rely on a notion of Environmental Management System (EMS), defined as a reflective framework allowing organisations to set and manage their goals, and demonstrate they follow due processes in order to comply with prevailing regulations. The importance of these obligations can be highlighted by the fact that failing to comply may lead to significant liabilities for organisations. An EMS framework, typically structured as a set of documents and spreadsheets, contains a record of continuously evolving regulations, teams, stakeholders, actions and updates. However, the maintainance of an EMS is often human driven, and therefore is error prone despite the meticulousness of environmental officers, and further requires external human auditing to check their validity. To avoid green washing, but also to contain the burden and cost of compliance, it is desirable for these claims to be checked by trusted automated means. Provenance is ideally suited to track the changes occurring in an EMS, allowing queries to determine precisely which compliance objective is prevailing at any point in time, whether it is being met, and who is responsible for it. Thus, this paper has a dual aim: first, it investigates the benefits of provenance for EMS, second, it presents the application of an emerging approach “Provenance-By-Design”, which automatically converts a specification of an EMS data model and its provenance to a data backend, a service for processing and querying of EMS provenance data, a client-side library to interact with such a service, and a simple user interface allowing developers to navigate the provenance. The application of a Provenance-By-Design approach to EMS applications results in novel opportunities for a provenance-based EMS; we present our preliminary reflection on their potential.
A Deep Learning (DL) life cycle involves several data transformations, such as performing data pre-processing, defining datasets to train and test a deep neural network (DNN), and training and evaluating the DL model. Choosing a final model requires DL model selection, which involves analyzing data from several training configurations (e.g. hyperparameters and DNN architectures). Tracing training data back to pre-processing operations can provide insights into the model selection step. Provenance is a natural solution to represent data derivation of the whole DL life cycle. However, there are challenges in providing an integration of the provenance of these different steps. There are a few approaches to capturing and integrating provenance data from the DL life cycle, but they require that the same provenance capture solution is used along all the steps, which can limit interoperability and flexibility when choosing the DL environment. Therefore, in this work, we present a prototype for provenance data integration using different capture solutions. We show use cases where the integrated provenance from pre-processing and training steps can show how data pre-processing decisions influenced the model selection. Experiments were performed using real-world datasets to train a DNN and provided evidence of the integration between the considered steps, answering queries such as how the data used to train a model that achieved a specific result was processed.
Data provenance has raised much attention across disciplines lately, as it has been shown that enrichment of data with provenance information leads to better credibility, renders data more FAIR fostering data reuse. Also, the biomedical domain has recognised the potential of provenance capture. However, several obstacles prevent efficient, automated, and machine-interpretable enrichment of biomedical data with provenance information, such as data heterogeneity, complexity, and sensitivity. Here, we explain how in Germany clinical data are transferred from hospital information systems into a data integration centre to enable secondary use of patient data and how it can be reused as research data. Considering the complex data infrastructures in hospitals, we indicate obstacles and opportunities when collecting provenance information along heterogeneous data processing pipelines. To express provenance data, we indicate the usage of the Fast Healthcare Interoperability Resource (FHIR) provenance resource for healthcare data. In addition, we consider already existing approaches from other research fields and standard communities. As a solution towards high-quality standardised clinical research data, we propose to develop a ’MInimal Requirements for Automated Provenance Information Enrichment’ (MIRAPIE) guideline. As a community project, MIRAPIE should generalise provenance information concepts to allow its world-wide applicability, possibly beyond the health care sector.
Good scientific work requires comprehensible, transparent and reproducible research. One way to ensure this is to include all data relevant to a study or evaluation when publishing an article. This data should be at least aggregated or anonymized, at best compact and complete, but always resilient.
In this paper we present ProSA, a system for calculating the minimal necessary data set, called sub-database. For this, we combine the Chase — a set of algorithms for transforming databases — with additional provenance information. We display the implementation of provenance guided by the ProSA pipeline and show its use to generate an optimized sub-database. Furhter, we demonstrate how the ProSA GUI looks like and present some applications and extensions.
Prior work on explaining missing (unexpected) query results identifies which parts of the query or data are responsible for the erroneous result or repairs the query or data to fix such errors. The problem of generating repairs is typically expressed as an optimization problem, i.e., a single repair is returned that is optimal wrt. to some criterion such as minimizing the repair’s side effects. However, such an optimization objective may not concretely model a user’s (often hard to formalize) notion of which repair is “correct”. In this paper, we motivate hybrid explanations and repairs, i.e., that fix both the query and the data. Instead of returning one “optimal” repair, we argue for an approach that empowers the user to explore the space of possible repairs effectively. We also present a proof-of-concept implementation and outline open research problems.
Financial Literacy (FL) initiatives, aimed at young people in formal or informal learning spaces, are defended and implemented in several countries, being encouraged since 2005 by the Organization for Economic Co-operation and Development (OECD). In Brazil, the teaching and learning process in several areas has been stimulated through Academic Competitions generally called Knowledge Olympics, which are essentially student contests that aim to encourage, find talent and awaken interest in the field knowledge presented in the competition. It was precisely for this purpose that the Brazilian Investment Olympics (OBInvest) was born, aiming to democratize access to education and promote reflections on economic and financial issues, through a FL perspective for high school students from all over the country. One of OBInvest’s objectives is to help boosting the development of computational tools, aiming to provide easier access to fundamental data for decision-making in the field of finance. However, from the tools developed by OBInvest, it was noted that the creation of new educational tools would be enhanced through the use of datasets enriched with provenance and aligned with FAIR principles. This work aims to offer a computational strategy based on data science techniques, which is easy to use and also provides curated data series through a reproducible pipeline, using open data on financial reports from publicly listed Brazilian companies, provided by the Brazilian Security and Exchange Commission, called Comissão de Valores Mobiliarios (CMV). During the exploration of related works, we found just a few academic works that use CVM data with little expressive results, which motivated the development of a tool called DRE-CVM, that was supported by computational tools, with a focus on the Python language, Pandas library, the KNIME workflow platform, and Jupyter integrated development environments, running on the Anaconda3 platform over a Docker container. It’s also possible run this experiment in the Google Colabotory cloud environment. This processing it’s capable of executing reproducible pipelines and using curated, fairified, and annotated data with the retrospective source metadata of the financial statements of publicly traded Brazilian companies. The artifact uses pipelines that can be reused by students and other interested parties in finance to study the behaviors of a company’s time series results and thus introduce research on predicting future results. The last executable version of the DRE-CVM experiment can be accessed through Zenodo website at https://doi.org/10.5281/zenodo.7110653 and can be reproduced using a Docker Container available on DockerHub repository. Some improvements can be incorporated into the presented work, the main suggestions for future work are: (i) Perform more substantial analyses on the created dataset, such as predicting results based on the history of demonstration results; (ii) Recover other types of information made available by CVM, to be used during the activities of the Brazilian Investment Olympics; (iii) Adapt the docker image so that it can be executed in the My Binder cloud environment, aiming to improve reproducibility issues.
Containers are lightweight mechanisms for the isolation of operating system resources. They are realized by activating a set of namespaces. Given the use of containers in scientific computing, tracking and managing provenance within and across containers is becoming essential for debugging and reproducibility. In this work, we examine the properties of container provenance graphs that result from auditing containerized applications. We observe that the generated container provenance graphs are hypergraphs because one resource may belong to one or more namespaces. We examine the hierarchical behavior of PID, mount, and user namespaces, that are more commonly activated and show that even when represented as hypergraphs, the resulting container provenance graphs are acyclic. We experiment with recently published container logs and identify hypergraph properties.