AI is no longer just a component of Web systems; it is increasingly shaping the experiences users have online. From search and recommendation to conversational and generative interfaces, AI is redefining how people interact with content at Web scale. In this keynote, I reflect on how recent advances in AI, including deep learning and generative models, are reshaping the design space of Web technologies and the systems that support them. Drawing on insights from developing AI-driven systems at Spotify, I discuss how search and recommendation are evolving into interactive, intent-aware, and user-controllable experiences that support exploration and discovery. The talk highlights emerging system paradigms, deployment challenges, and open research questions around building such experiences at scale, and reflects on the implications for the design of future Web systems and interactions.
We often treat data as an infinitely reusable resource: a single dataset can support many analyses, train multiple models, and be shared widely without apparent cost. This talk argues that in important ways, data is not endlessly reusable. Instead, in certain contexts, data behaves like a consumable resource that degrades with use.
The clearest example arises in the presence of privacy concerns. Fundamental results show that any informative public analysis of personal data inevitably leaks some information about the underlying individuals, and that these privacy losses accumulate across repeated uses of the same or overlapping datasets [1]. If some level of privacy is to be preserved, this imposes intrinsic limits on how many times data can be used. In joint work under submission, we connect this perspective to the mosaic effect from legal scholarship, arguing that privacy risks arise not only from combining data pieces, but also from combining seemingly innocuous data uses [6]. This view suggests regulatory and technical approaches that treat data use itself as a rival good [4].
Data can also degrade with use even when privacy is not at stake. A line of work on adaptive data analysis shows that repeatedly querying the same dataset can lead to overfitting: results that appear valid on the dataset but fail to generalize to the underlying distribution, even when the dataset is very large [2, 3, 5]. In both privacy and generalization, each interaction with a dataset consumes part of a limited resource, constraining future computations.
Recognizing data degradation opens a range of research directions, including systems for tracking and budgeting data use, algorithmic techniques to mitigate degradation, the role of synthetic data and data curators, and new models of non-worst-case adaptive computation. Together, these directions work towards a data ecosystem that explicitly accounts for data degradation.
Opinion dynamics models how the publicly expressed opinions of users in a social network coevolve according to their neighbors as well as their own intrinsic opinion. Motivated by the real-world manipulation of social networks during the 2016 US elections and the 2019 Hong Kong protests, a growing body of work models the effects of a strategic actor who interferes with the network to induce disagreement or polarization. We lift the assumption of a single strategic actor by introducing a model in which any subset of network users can manipulate network outcomes. They do so by acting according to a fictitious intrinsic opinion. Strategic actors can have conflicting goals, and push competing narratives. We characterize the Nash Equilibrium of the resulting meta-game played by the strategic actors. Experiments on real-world social network datasets from Twitter, Reddit, and Political Blogs show that strategic agents can significantly increase polarization and disagreement, as well as increase the ''cost'' of the equilibrium. To this end, we give worst-case upper bounds on the Price of Misreporting (analogous to the Price of Anarchy). Finally, we give efficient learning algorithms for the platform to (i) detect whether strategic manipulation has occurred, and (ii) learn who the strategic actors are. Our algorithms are accurate on the same real-world datasets, suggesting how platforms can take steps to mitigate the effects of strategic behavior.
We investigate majority voting where agents possess private information about an unobservable ground truth that determines their preferences. In such settings, agents may hold different preferences and (collectively) engage in counterintuitive strategic behaviors. Previous work either assumes strategic behavior occurs without coordination or with unlimited coordination, yielding overly inclusive or exclusive predictions about voting outcomes. We incorporate coordination ability—the largest coalition size at which agents could strategically coordinate—into the analysis. Under the ex-ante Bayesian k-strong equilibrium framework, where no group of at most k agents can benefit from deviation, we provide closed-form characterizations of when informed majority decisions, the decision favored by the majority if the ground truth is common knowledge, are achievable. Specifically, we determine (1) when all k-strong equilibria reach the informed majority decision and (2) when at least one such equilibrium exists. These conditions depend on three factors: coordination ability, fraction of majority agents, and information structure. The boundary for the second question exhibits surprising complexity--non-continuous, non-linear, and segmental. Our results reveal the complicated landscape and provide refined predictions for strategic behavior across different coordination levels.
Graph rationalization methods aim to improve the explainability of Graph Neural Networks by identifying critical subgraphs (rationales) for task prediction. Motivated by increasing concerns over data privacy, federated graph rationalization has recently gained traction as a novel research area. However, in federated settings, data heterogeneity across clients exacerbates shortcut learning, where models rely on spurious and client-specific features rather than invariant causal rationales. Existing solutions, such as environment-aware data augmentation, suffer from low-quality environment representations. To address this, we propose DiffGR, a Diff erence-based sample selection strategy for federated Graph Rationalization. DiffGR selects samples where local and global models exhibit the highest prediction discrepancies, as these likely reflect strong shortcut reliance, enabling more accurate environment representations. Additionally, we introduce a mutual information (MI) inspired environment-conditioned data augmentation method that minimizes MI between environments and predictions while maximizing MI between rationales and predictions. Experiments on real-world and synthetic datasets demonstrate the effectiveness of DiffGR in improving rationale quality and model robustness in federated settings. Code is available at https://github.com/yuelinan/Codes-of-DiffGR.
The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph–text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM's graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs' alignment research.
Unsupervised graph domain adaptation (UGDA) aims to transfer knowledge from a labeled source graph to an unlabeled target graph, addressing the performance degradation caused by distributional shifts in node attributes and graph structures across domains. Despite recent progress, existing UGDA approaches still face two key challenges: (C1) Data-level: Most methods rely on a single source domain, overlooking the complementary knowledge that could be leveraged from multiple sources. (C2) Model-level: Many UGDA models emphasize complex, handcrafted Graph neural network (GNN) architectures, while simpler yet effective designs with propagation (P) & transformation (T) pipeline remain underexplored. To address these challenges, in this paper, we propose a novel approach, which leverages Concise Propagation–Transformation pipeline for multi-source unsupervised Graph Domain Adaptation, dubbed as CPT-GDA, to better capture complementary knowledge from multiple sources in an efficient manner. Specifically, the proposed CPT-GDA adopts a dual-branch GNN architecture with different depths of propagation but the same P-T patterns, which enables the model to efficiently learn node representations to mitigate domain discrepancy. Meanwhile, to facilitate effective knowledge transfer across graphs, we derive three optimization objectives: (1) the classifier loss to learn discriminative representations; (2) the alignment loss weighted by the graph Wasserstein distance to align the structure and feature distribution; and (3) the pseudo-label loss to refine target node representations. Extensive experiments on real-world datasets confirm that the proposed method outperforms recent state-of-the-art baselines, demonstrating its effectiveness.
Graph incremental learning aims to sequentially adapt models to evolving graphs while mitigating catastrophic forgetting. This problem becomes particularly challenging due to the simultaneous occurrence of covariate and label distribution shifts, introduced by newly emerging node classes, changes in node feature styles, and additional edges. To address these challenges, we propose DINGLE, a novel framework for both class and domain (class-domain) incremental learning on graphs. DINGLE consists of two key modules: a representation decoupler, which disentangles node representations into domain-invariant semantic factors for classification and domain-specific variation factors, and a teacher-student knowledge distillation module, which facilitates knowledge transfer across tasks while mitigating catastrophic forgetting through memory replay. By leveraging a Representative Node Feature (RNF) bank and an Encoder Parameters (EP) bank, DINGLE ensures effective knowledge retention and adaptation. Extensive experiments on 5 real-world datasets demonstrate that DINGLE outperforms 11 state-of-the-art baselines in class-domain incremental learning, improving classification accuracy while effectively preventing forgetting across tasks.
Liquid Time-Constant networks (LTCs), a type of continuous-time graph neural network, excel at modeling irregularly-sampled dynamics but are fundamentally confined to Euclidean space. This limitation introduces significant geometric distortion when representing real-world graphs with inherent non-Euclidean structures (e.g., hierarchies and cycles), degrading representation quality. To overcome this limitation, we introduce the Riemannian Liquid Spatio-Temporal Graph Network (RLSTG), a framework that unifies continuous-time liquid dynamics with the geometric inductive biases of Riemannian manifolds. RLSTG models graph evolution through an Ordinary Differential Equation (ODE) formulated directly on a curved manifold, enabling it to faithfully capture the intrinsic geometry of both structurally static and dynamic spatio-temporal graphs. Moreover, we provide rigorous theoretical guarantees for RLSTG, extending stability theorems of LTCs to the Riemannian domain and quantifying its expressive power via state trajectory analysis. Extensive experiments on real-world benchmarks demonstrate that, by combining advanced temporal dynamics with a Riemannian spatial representation, RLSTG achieves superior performance on graphs with complex structures. Project Page: https://rlstg.github.io
Recently, Graph Foundation Models (GFMs) have emerged as a central focus in the field of graph learning due to their strong generalizability to various unseen graphs. However, existing GFMs typically work under the homophily assumption, and the exploration of universality on heterophilic graphs is still in its early stages. In fact, even in homophilic graphs, there exists limited yet informative heterophilic information that is not fully exploited by current GFMs. Moreover, due to the requirement for universality, the heterophily issue faced by GFMs is more challenging than in classical graph learning, as it requires training a single model to adapt to varying structures, features, and tasks. Classic heterophilic graph learning methods primarily based on the node-level homophily or heterophily. However, we highlight that homophily and heterophily exist not only at the node semantic level, but also at a finer granularity across individual feature dimensions. This finding enables GFMs to adapt to heterophilic graphs and better utilize the small amount of heterophilic information in homophilic graphs. Based on this, we propose Topology-aware Feature Sorting Graph Foundation Model (TFSGFM), which employs a feature-level topology-aware sorting strategy and a dual-channel graph neural network framework, enabling unified modeling of both feature and structure. Extensive experiments demonstrate the strong generalizability of TFSGFM. The source code is available at https://github.com/hedongxiao-tju/TFSGFM.
Clustering is a fundamental task in graph data mining, including both node-level and graph-level clustering. While the former has been extensively explored to capture local structures and features, the latter has gained attention for its ability to capture global relationships and high-level abstractions. However, existing methods often address these two tasks in isolation, which not only wastes computational resources but also fails to fully leverage the knowledge from both levels to improve each other, hindering consistent performance improvement. To this end, we propose a novel Unified Graph Clustering Network called UGCN, which employs both local and global graph information to address node- and graph-level clustering collaboratively. In detail, we design a dual-branch projector that performs joint learning at both node and graph levels. The first branch extracts node-level features and projects them into distinct cluster layers, where the derived prototypes are used to refine graph attributes and highlight clustering-friendly substructures. In parallel, the second branch captures subgraph embeddings and aggregates them into discriminative graph-level representations. we align the two branches through joint contrastive objectives to establish a bidirectional interaction: refined prototypes guide subgraph and graph-level clustering, while graph-level pseudo-labels provide feedback to enhance node-level clustering. Extensive experimental results across seven datasets demonstrate that our method significantly outperforms existing state-of-the-art approaches.
Dynamic graph anomaly detection (DGAD) is essential for iden- tifying anomalies in evolving graphs across domains such as fi- nance and social networks. Recently, generalist graph anomaly detection (GAD) models have shown promising results. They are pretrained on multiple source datasets and generalize across do- mains. While effective on static graphs, they struggle to capture evolving anomalies in dynamic graphs. Moreover, the continuous emergence of new domains and the lack of labeled data further challenge generalist DGAD. Effective cross-domain DGAD requires both domain-specific and domain-agnostic anomalous patterns. Importantly, these patterns evolve temporally within and across domains. Building on these insights, we propose a DGAD model with Dynamic Prototypes (DP) to capture evolving domain-specific and domain-agnostic patterns. Firstly, DP-DGAD extracts dynamic prototypes, i.e., evolving representations of normal and anomalous patterns, from temporal ego-graphs and stores them in a memory buffer. The buffer is selectively updated to retain general, domain- agnostic patterns while incorporating new domain-specific ones. Then, an anomaly scorer compares incoming data with dynamic prototypes to flag both general and domain-specific anomalies. Fi- nally, DP-DGAD employs confidence detection guided memory buffer updating for effective adaptation to target domain. Extensive experiments demonstrate state-of-the-art performance across ten real-world datasets from different domains.
By mimicking the brain's efficient spiking encoding paradigm, spiking graph neural networks exhibit significant potential for efficient graph data analysis. Due to the inherent expressive limitations of binary spiking signals adopted in spiking encoding, existing models typically enhance their expression by integrating numerous real-valued multiplication-additions or high-latency encoding. However, such integrations compromise the core efficiency superiority of spiking models, limiting their scalability in real-world applications. To simultaneously reconcile considerable expression and efficiency, we propose E2SGNN, a novel network comprising a dual-scale modulated spiking backbone and a latency-dynamic optimization module. The former backbone integrates global and local real-valued graph modulations into spiking graph convolution, enabling discriminative dual-scale neighbor embedding in the encoding process. It both breaks through binary spiking signals' expressive limitations and improves the content expressiveness of spiking graph representations, while retaining low-latency and addition-only efficient advantages. Moreover, to further reduce the latency redundancy for higher efficiency, the latter module adaptively customizes the latency for each graph data based on data complexity. In this way, our network can finally generate graph representations expressively and efficiently. Experiments on various datasets demonstrate the superiority of our network in expression and efficiency.
Every graph hides a tree: through tree decomposition—a foundational tool in modern graph theory with broad applications such as in computational power networks, any network can be unfolded into a hierarchy of overlapping vertex bags whose backbone is a tree. Leveraging this powerful lens, we propose Topological Decomposition for Self-supervised Learning (TopDSL), a framework that injects multi-scale signals into graph representation learning. Concretely, we: 1) decompose the input graph into tree structures with bags representing local structural contexts; 2) compute bag-level roles via closeness centrality for nodes and local edge betweenness for edges, and aggregate these scores across bags to capture context-dependent importance (e.g., local structural bridges); 3) convert the resulting importance and attribute-stability scores into a context-aware augmentation policy that adaptively perturbs nodes, edges, and features—preserving local bridges, honoring multi-community vertices, and attenuating noisy global hubs; 4) construct a new structural similarity loss for contrastive learning, which fuses traditional graph-based proximity with a novel tree-based similarity derived from node co-occurrence in decomposition bags; 5) demonstrate that our framework achieves superior performance over state-of-the-art baselines on various graph learning benchmarks.
Doxing refers to the disclosure of personal information without consent, has evolved from sporadic acts of online vigilantism into a structured and commodified practice. In Chinese cyberspace, this shift has produced Doxing-as-a-Service (DaaS), a commercial model in which personal data is retrieved, organized, and traded as on-demand products. This industrialization of privacy violation lowers the barriers to doxing and amplifies its social harms, posing new challenges for building a responsible and safe web. Yet little is known about how DaaS operates or sustains itself, motivating our systematic, data-driven examination of its ecosystem and practices. This paper provides the first systematic study of the Chinese DaaS ecosystem. Analyzing 25,972 messages and 13.22 million subscriber links from 100 major channels on Telegram, we demystify its organization, operations, and user engagement. We find that the DaaS ecosystem operates through a three-tier supply chain linking data providers, service operators, and end users. Operators sustain illicit businesses through six major service categories, persistent advertising, and crypto-based payments, while users interact via specialized group spaces that enable real-time matching, large-scale identity exposure, and community-driven fraud mitigation. Our findings reveal a mature, resilient underground data market operating within mainstream messaging platforms, highlighting new challenges for online privacy and exposing critical vulnerabilities in platform governance and content moderation. This study provides empirical evidence for developing effective regulatory frameworks and accountability mechanisms to mitigate commodified online harms on encrypted messaging services.
With the rapid growth of Web-based academic publications, more and more papers are being published annually, making it increasingly difficult to find relevant prior work. Citation prediction aims to automatically suggest appropriate references, helping scholars navigate the expanding scientific literature. Here we present CiteRAG, the first comprehensive retrieval-augmented generation (RAG)-integrated benchmark for evaluating large language models on academic citation prediction, featuring a multi-level retrieval strategy, specialized retrievers, and generators. Our benchmark makes four core contributions: (1) We establish two instances of the citation prediction task with different granularity. Task 1 focuses on coarse-grained list-specific citation prediction, while Task 2 targets fine-grained position-specific citation prediction. To enhance these two tasks, we build a dataset containing 7,267 instances for Task 1 and 8,541 instances for Task 2, enabling comprehensive evaluation of both retrieval and generation. (2) We construct a three-level large-scale corpus with 554k papers spanning many major subfields, using an incremental pipeline. (3) We propose a multi-level hybrid RAG approach to citation prediction, fine-tuning embedding models with contrastive learning to capture complex citation relationships, paired with specialized generation models. (4) We conduct extensive experiments across state-of-the-art language models, including closed-source APIs, open-source models, and our fine-tuned generators, demonstrating the effectiveness of our framework. Our open-source toolkit enables reproducible evaluation and focuses on academic literature, providing the first comprehensive evaluation framework for citation prediction and serving as a methodological template for other scientific domains. Our source code and data are released at https://github.com/LQgdwind/CiteRAG.
Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. Despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce **Med-R2**, a novel LLM physician framework that adheres to the *Evidence-Based Medicine (EBM)* process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that **Med-R2** achieves an improvement of 13.27% over vanilla RAG methods and even a 4.55% enhancement compared to fine-tuning strategies, without incurring additional training costs. Furthermore, we find that our LLaMA3.1-70B + Med-R2 surpasses frontier models, including GPT-4o, Claude3.5-Sonnet and DeepSeek-V3 by 1.05%, 6.14% and 1.91%. Med-R2 effectively enhances the capabilities of LLMs in the medical domain.
Query rewriting is a fundamental technique in information retrieval (IR). It typically employs the retrieval result as relevance feedback to refine the query and thereby addresses the vocabulary mismatch between user queries and relevant documents. Traditional pseudo-relevance feedback (PRF) and its vector-based extension (VPRF) improve retrieval performance by leveraging top-retrieved documents as relevance feedback. However, they are constructed based on two major hypotheses: the relevance assumption (top documents are relevant) and the model assumption (rewriting methods need to be designed specifically for particular model architectures). While recent large language models (LLMs)-based generative relevance feedback (GRF) enables model-free query reformulation, it either suffers from severe LLM hallucination or, again, relies on the relevance assumption to guarantee the effectiveness of rewriting quality. To overcome these limitations, we introduce an assumption-relaxed framework: Generalized Pseudo Relevance Feedback (GPRF), which performs model-free, natural language rewriting based on retrieved documents, not only eliminating the model assumption but also reducing dependence on the relevance assumption. Specifically, we design a utility-oriented training pipeline with reinforcement learning to ensure robustness against noisy feedback. Extensive experiments across multiple benchmarks and retrievers demonstrate that GPRF consistently outperforms strong baselines, establishing it as an effective and generalizable framework for query rewriting.
Retrieval-augmented generation (RAG) has become a cornerstone for enhancing large language models (LLMs) with real-time information from the Web, but its performance often heavily depends on the quality of the retrieved documents. Given that RAG systems frequently draw from vast and often noisy Web corpora, ensuring the reliability of retrieved content is paramount. While rerankers improve the factual accuracy of the RAG system by elevating the proportion of ground-truth documents (GD) in high-ranked results, the shifts of document type distributions during reranking remain unclear, hindering the understanding of the reranker's behavior. To bridge this gap, we conduct an empirical study to categorize documents and compare their distribution before and after reranking. We reveal a counterintuitive finding: though rerankers improve the proportion of GD, they also significantly increase the proportion of harmful documents (HD) in top-ranked retrieved documents. It not only narrows the potential context window for ranking the GD higher but also increases the risk of HD misleading the LLMs, potentially leading to the generation and propagation of misinformation across Web platforms. Motivated by this finding, we propose a risk-aware reranking method for RAG with LLMs, which balances the risk and benefit during reranking. Given a query, the RAG framework first retrieves relevant documents. Then, our approach quantifies the potential beneficial and harmful impacts of various documents on the LLMs' generation. To estimate the impacts, we conduct a dual-aspect document impact assessment via information gain, which employs a risk clipping to avoid the numerical fluctuations in the estimation. Finally, we conduct the reranking according to the potential impact of each document, enabling the reranker to significantly reduce the HD proportion. Experiments and analysis across multiple models and datasets, including Wikipedia, web news, and research papers, show the effectiveness of our method. Our code is available at https://github.com/lzz335/hidden_risk_of_reranking.
Although precise recall is a core objective in Retrieval-Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near-linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade-off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training-free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at https://github.com/BUPT-Reasoning-Lab/NeocorRAG.
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning and web-scale search. On today's Web, where users increasingly expect to locate not only relevant pages but also specific long videos or fine-grained clips, existing benchmarks fall short due to limited video duration, low-quality captions, and coarse annotation granularity. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code link at https://github.com/TechNomad-ds/LoVR-benchmark/.
The rapid spread of misinformation on social media has underscored the importance of automatic fact-checking. Existing fact-checking pipelines typically rely on multi-stage frameworks involving evidence retrieval and claim verification. However, these methods face two major challenges: (1) the retrieval process often introduces noisy evidence, which compromises the reliability of the final veracity prediction; and (2) the verification models may overlook critical factual details, resulting in hallucinated conclusions. To address these issues, we propose a fact-checking framework SLED with Self-supervised denoising evidence retrieval and LLM-Enhanced Debate-based verification. In the retrieval stage, SLED leverage trained verifier to assess credibility and necessity of retrieved evidence, enabling the elimination of noisy evidence. In the verification stage, SLED prompts the LLM to generate dual-perspective reasoning and simulates a multi-agent debate, followed by distillation into a lightweight model for final veracity prediction. Experiments on CHEF and HOVER datasets demonstrate that SLED achieves the state-of-the-art results in complex fact verification scenarios.
Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released at https://github.com/JJJJerry/WWW2026-MCoT-MVS.
Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations and knowledge gaps in Large Language Models (LLMs) for knowledge-intensive tasks by incorporating external web-based knowledge. However, when integrating diverse yet potentially conflicting web-sourced information, RAG systems are prone to knowledge conflicts that manifest as incorrect or inconsistent model behaviors, ultimately leading to unreliable responses. To address this challenge, this paper proposes Conflict-Aware RAG, a general training framework that leverages the model's inherent conflict-sensing capability to build a more robust RAG system via phased optimization. At the core of this framework lies ConScore, a conflict signal that quantifies the model's awareness of potential knowledge conflicts by comparing generative probabilities across distinct knowledge sources. This signal then guides both the construction of training data and a multi-stage optimization workflow: In the Supervised Fine-Tuning (SFT) stage, conflict features are employed to select representative distracting documents, laying the groundwork for core RAG capabilities; in the Direct Preference Optimization (DPO) stage, high-quality preference pairs are constructed using the conflict signal to boost the model's robustness against distracting knowledge; and in the Reranking stage, conflict confidence and information gain are integrated to synergistically optimize the collaboration mechanism between the retriever and LLM. Experiments on six knowledge-intensive question answering (QA) datasets demonstrate that Conflict-Aware RAG significantly outperforms mainstream baselines. Further ablation studies and quantitative analyses validate the method's stability and generalization, laying the foundation for robust RAG systems.
The deployment of self-hosted large language models (LLMs) has experienced unprecedented growth for enhanced data privacy and control. Yet, such deployment relies on diverse web services, whose vulnerabilities, although mentioned in a few studies, are largely underexplored, conflicting with the security tenet. From a systematic perspective, we propose LENS, a framework that explores and exploits vulnerabilities in self-hosted LLM services for comprehensive security evaluation. LENS integrates profiling and filtering, endpoint knowledge construction, and attack graph modeling for the automatic discovery, probing, and exploitation of public-facing LLM deployment targets, respectively. We conducted extensive empirical evaluation on real-world self-hosted LLM services across 16 mainstream platforms, 71,249 discovered deployment targets, and 307 API endpoints. Both quantitative and qualitative evidence reveal the prevalence of security vulnerabilities across different self-hosted LLM services. Notably, 75% of responsive targets allow web API interactions without authentication, rendering exploitation such as injection attacks (97% for Ollama), unauthenticated access (20.2% for AnythingLLM), and default credential abuse (60.6% for Dify). We have responsibly reported the findings to the relevant community and obtained 7 CVE IDs, including 4 critical vulnerabilities (CVSS > 9.0) and 2 high-severity ones.
The security of web services increasingly relies on accurate detection of advanced, previously unseen attacks hidden within complex host activities. Provenance-based intrusion detection systems (PIDSes) offer a promising foundation for this task by capturing rich causal and structural relationships across processes, files, and network interactions. However, recent studies show that these graph-driven methods remain vulnerable to graph manipulation attacks, where adversaries subtly alter provenance graphs to evade detection, which limits their practical deployment.
To address this challenge, we present ProvGuard, a robust anomaly detection framework that couples logic-aware multi-view augmentation with contrastive representation learning. Instead of applying arbitrary structural perturbations, ProvGuard employs Logic-Aware Noise Injection (LNI) to generate semantically valid graph views that preserve the causal semantics of provenance data. These views are then leveraged in a Logic-Preserving Contrastive Learning module, enabling the model to learn representations invariant to benign transformations yet sensitive to adversarial inconsistencies. Extensive evaluations on multiple provenance datasets show that ProvGuard surpasses state-of-the-art detectors in resisting graph manipulation attacks while maintaining high detection accuracy and efficiency, achieving an average F1-score above 96% with less than a 10% AUC drop.
Graph Neural Networks (GNNs) have become a pivotal framework for modeling graph-structured data, enabling a wide range of applications from social network analysis to molecular chemistry. By integrating large language models (LLMs), text-attributed graphs (TAGs) enhance node representations with rich textual semantics, significantly boosting the expressive power of graph-based learning. However, this synergy introduces critical vulnerabilities in both topology and text. Although specialized attack methods have been designed for each of these aspects, no work has yet unified them into a comprehensive approach. In this work, we propose the Interpretable Multi-Dimensional Graph Attack (IMDGA), a human-centric framework orchestrating multi-level perturbations across graph structure and textual features. IMDGA utilizes three tightly integrated modules to craft attacks that balance interpretability and impact, enabling a deeper understanding of Graph-LLM vulnerabilities. Through rigorous theoretical analysis and comprehensive empirical evaluations on diverse datasets and architectures, IMDGA demonstrates superior interpretability, attack effectiveness, stealthiness, and robustness compared to existing methods. By exposing these underexplored semantic vulnerabilities, our work offers valuable insights for improving Graph-LLM resilience. Our code is available at https://github.com/bwfan-bit/IMDGA.
Website Fingerprinting (WF) attacks aim to infer the websites visited by Tor users by analyzing patterns in encrypted network traffic. However, most existing WF attacks are evaluated on traffic collected in controlled environments with fixed configurations, failing to reflect the complexity and variability of real-world conditions. In practice, traffic is far more dynamic and diverse due to heterogeneous network conditions, the large number of subpages within individual websites, and continuous evolution of website content. These factors increase intra-class variability and induce temporal feature drift, which ultimately degrades the long-term effectiveness of existing attacks. In this paper, we propose TraVerse, an LLM-based representation learning framework designed to achieve robust WF attacks under real-world conditions. TraVerse applies architectural adaptation and large-scale fine-tuning on diverse unlabeled traffic to learn generalizable and resilient representations that remain effective in dynamic and evolving environments. Furthermore, TraVerse integrates a lightweight classifier atop the LLM-derived representations, enabling accurate website identification and efficient few-shot adaptation with minimal model updates. We prototype TraVerse and conduct comprehensive evaluations using real-user traffic. Experimental results show that TraVerse improves Accuracy@3 by an average of 176.3% and weighted F1 by 343.3% over state-of-the-art baselines, while maintaining strong performance throughout a three-month longitudinal evaluation.
Artificial intelligence (AI) models are increasingly deployed directly in web browsers to enable low-latency, privacy-preserving inference. While this shift offers significant usability and scalability benefits, it also exposes model code and parameters to untrusted environments, leaving them vulnerable to theft, reverse engineering, and tampering. Our analysis demonstrates that existing JavaScript-based inference frameworks are highly susceptible to model extraction, posing serious security and intellectual property risks. To address this gap, we present WAMO, a WebAssembly-based obfuscation framework that secures browser-side AI models. WAMO introduces a comprehensive conversion pipeline that translates mainstream model formats into Wasm-native modules, applying model-specific obfuscation at the Wasm layer to target weights, operators, and computation graphs. This design shifts model execution from easily inspected JavaScript assets to hardened Wasm binaries, significantly raising the difficulty of static and dynamic analysis. Evaluation shows that WAMO increases cyclomatic complexity by 71.0% and Halstead effort by 455.57%, while incurring < 1% accuracy loss and no inference slowdown.
Open-source ecosystems such as NPM and PyPI are increasingly targeted by supply chain attacks, yet existing detection methods either depend on fragile handcrafted rules or data-driven features that fail to capture evolving attack semantics. We present IntelGuard, a retrieval-augmented generation (RAG) based framework that integrates expert analytical reasoning into automated malicious package detection. IntelGuard constructs a structured knowledge base from over 8,000 threat intelligence reports, linking malicious code snippets with behavioral descriptions and expert reasoning. When analyzing new packages, it retrieves semantically similar malicious examples and applies LLM-guided reasoning to assess whether code behaviors align with intended functionality. Experiments on 4,027 real-world packages show that IntelGuard achieves 99% accuracy and a 0.50% false positive rate, while maintaining 96.5% accuracy on obfuscated code. Deployed on PyPI.org, it discovered 54 previously unreported malicious packages, demonstrating interpretable and robust detection guided by expert knowledge.
Video captioning aims to describe the content of a given video with condensed natural language sentences. Such a captioning task is full of challenges since the high requirements for visual-textual relevance and multimodal fusion understanding. Previous works primarily focus on visual content modeling, often overlooking the rich semantic correlations between visual and textual modalities, which results in incomplete understanding of the multimodal context and suboptimal caption accuracy. In this paper, we propose a multimodal graph conditioned diffusion model for video captioning, named MGCDVc. The idea behind our model is to incorporate graph-based relational reasoning with diffusion-based generative modeling to jointly model cross-modal relationships and capture latent semantic structure. Specifically, we learn a set of latent concept anchors to bridge the visual and textual modality nodes, enabling the construction of a weighted multimodal graph. Then we introduce the graph conditioned diffusion strategy which generates the textual semantic nodes and associated edges under the graph structure awareness condition. Furthermore, a soft pruning mechanism is designed to filter out low-quality nodes, thus further refining the generated multimodal graph to provide more accurate semantic structural guidance for caption generation. Experimental results on several popular datasets demonstrate that our model achieves better performance in video captioning task.
Long-term memory is critical for dialogue systems that support continuous, sustainable, and personalized interactions. However, existing methods rely on continuous summarization or OpenIE-based graph construction paired with fixed Top- k retrieval, leading to limited adaptability across query categories and high computational overhead. In this paper, we propose HingeMem, a boundary-guided long-term memory that operationalizes event segmentation theory to build an interpretable indexing interface via boundary-triggered hyperedges over four elements: person, time, location, and topic. When any such element changes, HingeMem draws a boundary and writes the current segment, thereby reducing redundant operations and preserving salient context. To enable robust and efficient retrieval under diverse information needs, HingeMem introduces query-adaptive retrieval mechanisms that jointly decide (a) what to retrieve: determine the query-conditioned routing over the element-indexed memory; (b) how much to retrieve: control the retrieval depth based on the estimated query type. Extensive experiments across LLM scales (from 0.6B to production-tier models; e.g., Qwen3-0.6B to Qwen-Flash) on LOCOMO show that HingeMem achieves approximately 20% relative improvement over strong baselines without query categories specification, while reducing computational cost (68%\downarrow question answering token cost compared to HippoRAG2). Beyond advancing memory modeling, HingeMem's adaptive retrieval makes it a strong fit for web applications requiring efficient and trustworthy memory over extended interactions.
Web-scale platforms and online services rely on log-based anomaly detection to safeguard availability, latency SLOs, and user experience. In real-world web interactions, system logs often exhibit irregular temporal intervals, bursty densities, and heterogeneous semantics, which pose significant challenges for log anomaly detection. Existing methods such as LSTM and Transformer assume a fixed input window, which conflicts with the inherently irregular nature of system logs. Moreover, most prior works build a single-view representation, overlooking the multi-relational nature of logs. To overcome these challenges, we propose DyLogNet, a dynamic multi-relational graph framework for log anomaly detection. Specifically, this framework constructs a density-aware dynamic graph with variable-length windows, and represents logs from three relational perspectives: temporal co-occurrence, semantic similarity, and anomaly tendency. Next, we design a cross-layer attention mechanism that integrates heterogeneous structures to highlight the most relevant relations and enhance event representations. Furthermore, a cross-snapshot memory injection module updates global memory through a recurrent unit and injects it into current graph representations via an affine transformation, enabling temporal continuity. Experiments on three public log datasets demonstrate that DyLogNet outperforms state-of-the-art methods, especially in few-shot scenarios.
The multimodal Chinese idiom reading comprehension task aims to select the most appropriate idiom from a candidate list via the given text and image. This poses a significant challenge for the model to comprehend each Chinese idiom accurately. Existing multimodal Chinese idiom reading comprehension methods primarily focus on aligning contextual text and images, while overlooking two key attributes of Chinese idioms.(1) There is a discrepancy between the literal and metaphorical meanings of Chinese idioms. (2) The same Chinese idiom has different meanings in different scenarios, which requires targeted understanding by experts who specialize in different fields. To address the above challenges, we rethink the solution to the multimodal idiom reading comprehension task from a metaphorical perspective and propose a framework named MePe. Firstly, we propose a literal metaphorical semantic graph that systematically transforms the implicit discrepancy between the literal and metaphorical meanings of Chinese idioms into structured explicit relationships, thereby making metaphorical meanings more understandable. Then, we propose a mixture of idiom experts consisting of a literal idiom expert and a metaphorical idiom expert. Through division of labor and collaboration among these experts, we achieve an understanding of the dual meanings of Chinese idioms across different scenarios. Finally, we employ the maximum mean discrepancy to adjust the variance between the literal and metaphorical semantic features of Chinese idioms. By mapping these features onto a shared reproducing kernel Hilbert space, the model can better distinguish between the two based on contextual clues. Extensive experiments demonstrate that MePe achieves state-of-the-art performance on the MChIRC dataset.
With the rapid growth of multi-modal content on the Web, robust vision-language models are essential for semantic understanding and classification of web images under diverse and dynamic contexts, supporting Web applications such as multimedia search and recommendation. Prompt learning has proven effective for enhancing vision-language models in semantic image classification tasks. However, previous methods often suffer from poor generalization: the learned prompts tend to overfit the base classes seen during training, leading to poor performance on unseen classes and under distribution shifts. This issue is especially challenging in Web-scale data, where new classes emerge and distributions shift dynamically. To address these limitations, we propose PLIKD, a novel prompt learning method that integrates instance-aware knowledge distillation for robust Web-scale semantic image classification. Specifically, PLIKD introduces an instance-aware knowledge extraction module, which leverages multi-modal large language models through a step-by-step strategy to extract external knowledge for each image instance. To incorporate this extracted knowledge, PLIKD further introduces an instance-aware knowledge distillation module, which consists of two key steps: (1) a dual-teacher strategy for robust and informative knowledge distillation, and (2) fine-grained cross-modal alignment via Smooth and Sparse Optimal Transport. Extensive experiments demonstrate that PLIKD significantly improves generalization to both seen and unseen classes, and remains robust under distribution shifts, outperforming existing state-of-the-art methods on Web-scale semantic image classification.
Information diffusion prediction aims to forecast the temporal spread of opinions and behaviors by identifying potential adopters. Existing methods typically treat information diffusion as a sequence of individual adoptions and rely on computationally expensive pairwise (one-to-one) influence computations, often restricting predictions to just the next adopter. This individual-level paradigm both misrepresents real-world collective (many-to-many) influences and suffers a critical efficiency trade-off: to remain feasible, such models must truncate long diffusion histories, thereby overlooking early initiators and opinion leaders. To overcome these limitations, we formalize a more practical task: Group-based Information Diffusion Prediction, and propose an effective and scalable GRID framework. Specifically, GRID first learns group-oriented graph embeddings via a task-regularized information bottleneck objective, which amplifies key influence pathways and produces reliable user embeddings for group identification. Built on these embeddings, the core GroupAttn module captures inter-group influence while reducing complexity from quadratic to linear in cascade length. This enables the modeling of ultra-long cascades (exceeding 10,000 users) without truncation while preserving representational fidelity within a provable error bound. Finally, a group-wise objective guides the model to predict semantically meaningful future groups. Extensive experiments on four real-world datasets show that GRID outperforms ten state-of-the-art baselines by an average of 10.65% in accuracy, while achieving an order-of-magnitude gain in efficiency and extending the supported cascade length by up to 10 times.
Spamming activities (e.g., fake reviews, click farming, and deceptive content promotion) are increasingly conducted through collusive groups that exploit collective dynamics to manipulate platform metrics and mislead users, posing serious threats to the fairness, credibility, and functionality of online systems. To counteract these harmful behaviors, the task of spam detection has emerged as a critical area of research. However, existing detection methods generally remain limited in three key aspects: (i) They treat detection as a standard classification task, where representation learning and optimization are loosely coupled and suboptimal for capturing complex behaviors; (ii) They rely primarily on individual-level representation modeling, making it difficult to detect collective cheating strategies; (iii) They lack dedicated objective functions explicitly designed to characterize group-level spamming activities. To overcome these limitations, we introduce a collusion-aware Set-level learning framework (SetDet) that redefines the spam Detection task as a unified setwise optimization problem. Our approach offers three core advantages: (i) It enables end-to-end optimization by jointly learning representations and performing detection in a single, integrated process; (ii) It incorporates a model-level design for collusion representation, effectively capturing the temporal and relational patterns of coordinated spam; (iii) It pioneers a dedicated set-level optimization criterion that aligns closely with the structural characteristics of group-based cheating behaviors and accounts for class imbalance in real-world scenarios. Extensive experiments confirm the generalizability and superior performance of our framework across diverse spam scenarios and collusion strategies.
Embedded AI applications usually require compact on-device models that can continually adapt to new tasks. However, recent studies have revealed that neural networks trained on non-stationary data streams gradually lose their ability to adapt to new tasks, a phenomenon known as plasticity loss. Moreover, to enable neural networks to run on resource-constrained embedded devices, model pruning is commonly applied for model compression, which may further affect their plasticity. To conduct efficient model adaptation on new tasks on embedded devices, we propose Plasticity-aware Continual Pruning (PaCP), a novel framework that operates in two stages. First, a pre-deployment stage uses a plasticity-aware strategy to prune the model while optimizing its initial structure for future adaptability. Second, during continual learning, the model's capacity is temporarily expanded at task boundaries to efficiently learn new information, before plasticity-aware pruning restores its compact form. Extensive experiments on multiple continual learning benchmarks demonstrate that PaCP significantly outperforms existing plasticity-maintenance methods and, remarkably, even surpasses non-pruned models lacking explicit plasticity preservation.
The World Wide Web increasingly relies on intelligent services that require accurate time series forecasting, from urban mobility platforms to adaptive web-based decision systems. In practice, building effective forecasting models typically requires abundant high-quality data, which may not always be available in all cities due to sensing limitations or data sparsity. To address this challenge, transfer learning methods aim to transfer knowledge from data-rich source cities to data-scarce target cities. However, source and target data distributions are often not identical: while some patterns from source cities may be beneficial, others can be irrelevant or even misleading. Existing transfer learning methods generally train the target model using all available source data without explicitly distinguishing between useful and non-useful knowledge, which may hinder performance. In this work, we propose xRAG4TS, a novel framework that integrates Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs) for cross-city time series forecasting. xRAG4TS introduces a Cross-City Selective Retriever Module that filters semantically relevant historical patterns and documents from data-rich source cities, and incorporates them as structured prompts in an LLM Inference Module to guide forecasting in data-scarce target cities. By enabling selective, interpretable, and context-aware knowledge transfer, our method enhances robustness and scalability in web-oriented spatio-temporal applications. Extensive experiments on real-world cross-city datasets demonstrate that xRAG4TS significantly outperforms state-of-the-art baselines, highlighting its potential for powering adaptive and trustworthy web services under severe data scarcity.
In an edge-assisted federated learning (FL) system, edge servers aggregate client models into intermediate models for the cloud server to produce the global model communication-efficiently. However, existing edge-assisted systems fail to accommodate model-heterogeneous clients, i.e., clients running models with different architectures. To tackle this problem, this paper proposes FedBridge, a novel edge-assisted system that enables FL across model-heterogeneous clients through a two-tier knowledge-sharing mechanism. It deploys an expandable fusion model on each edge server in the system to fuse the knowledge from heterogeneous client models through knowledge distillation. In the meantime, it employs a contrastive loss to mitigate data heterogeneity in client data by aligning the logits of the fusion model close to those of the global model. On the cloud server, it employs a block-based aggregation method to merge fusion models transmitted from the edge servers. We conduct extensive experiments with three models on two widely-used public datasets to evaluate the performance of FedBridge. The results demonstrate that, compared to state-of-the-art systems, FedBridge accelerates model convergence by up to 6.3x and improves model accuracy by 6.2%-17.7%, with 89.51%-96.13% reduction in communication overhead and 48.12%-61.77% in memory overhead.
Digital maps are crucial for web-based location services. The continuous collection of vehicle trajectories has made trajectory data a vital source for map inference. State-of-the-art (SOTA) methods formulate trajectory-based map inference as an image processing task: they first rasterize trajectories into density-based images, then extract keypoints and infer their connections to construct the road map. Despite performing well on standard road structures, these methods still suffer from low topological accuracy in complex scenarios because i) The discrete rasterized representation struggles to capture road adjacency in multi-level road structures. ii) The limited contextual awareness of keypoint-based inferring strategy leads to connectivity misjudgment in dense road areas. To solve these limitations, we propose D2Map, a Dual-view Map Inference Framework via Primal-Dual Graphs Co-generation. To precisely encode road adjacency, we introduce the serialized trajectory view as a complement to the rasterized view to reflect traversable relationships between roads, and devise a strategy-adaptive fusion module that dynamically selects and executes the optimal fusion operator to integrate dual view representations, yielding map element embeddings. To eliminate connectivity errors, we extend road map modeling from a keypoint-centric primal graph to primal and dual graphs. In the dual graph, roads are explicitly modeled as nodes, enabling context-aware topology inference. A co-generation strategy is then employed to jointly infer both graphs while maintaining their geometric consistency. Extensive experiments on two real-world datasets demonstrate the superiority of D2Map, which outperforms SOTA baselines by 11.44% in the TOPO metric.
The rapid growth of web content has spurred the widespread adoption of on-device AI assistants powered by large language models (LLMs). However, deploying and personalizing these assistants in real-world environments remains challenging due to limited annotation budgets and scarce on-device fine-tuning resources. Existing edge–cloud collaboration frameworks typically rely on costly cloud-based supervision or perform full-layer finetuning, leading to inefficiencies in both computation and adaptation. To address these limitations, we propose EcoTune, a budget-constrained framework for efficient edge–cloud collaborative adaptation. EcoTune jointly optimizes representative data selection for cloud annotation and selective on-device model adaptation within a unified closed-loop process. Specifically, it employs a multi-armed bandit–based strategy to identify highvalue user interactions for cloud supervision and a layer importance–driven adaptation mechanism to update only critical components of the small language model (SLM). This coordinated optimization enables dynamic, resource-efficient personalization under stringent annotation and tuning budgets. Experiments on real-world testbeds demonstrate that EcoTune achieves up to 20%-60% reduction in annotation costs and significantly lowers fine-tuning memory consumption compared to state-of-the-art baselines, providing a practical and scalable solution for personalized on-device LLMs.
Mobile real-time video streaming (RTVS) demands ultra-low latency to preserve content timeliness. Packet loss in mobile networks significantly inflates frame latency and thus degrades the quality of experience (QoE). As a promising solution, Forward Error Correction (FEC) encoding has been widely deployed in RTVS systems to recover from packet loss by introducing redundancy. However, existing schemes focus on per-frame FEC protection, failing to optimize QoE because they cannot precisely allocate redundancy to handle burst loss events. These events typically occur at the single-frame level, but can be smoothed out at the multi-frame level. We propose Breath, an adaptive FEC scheme that dynamically adjusts the protection boundary based on network and video dynamics. We have implemented Breath in a RTVS system and evaluated it in emulated mobile networks using network traces collected from the production system. Results show that, compared to state-of-the-art FEC schemes, Breath reduces deadline missing rate by 17.2%-22.5% while improving the average video bitrate by 10.6%-14.2%.
Collaborative learning between edge servers (e.g., base stations) and end devices (e.g., drones) enables simultaneous model training in web applications through knowledge sharing. The resulting models effectively reduce service latency. However, existing approaches either assume isomorphic models on edge servers and end devices or incur substantial transmission overhead when training. Moreover, edge servers are often unable to access data from end devices on time due to long-distance constraints or strict data privacy regulations. This paper proposes a Prototype Augmentation-based Edge-end Collaborative Learning method (PAECL). It simultaneously trains heterogeneous edge and end models in the absence of data on edge servers by transmitting only augmented class-wise feature vectors (prototypes), significantly reducing communication overhead compared to sharing models, data, or logits. Specifically, on end devices, prototype-implied latent knowledge is augmented via local prototype contrast and global prototype alignment. On edge servers, prototypes are further augmented to produce bounded virtual vectors by mixing them with random noise, and the augmented prototypes are then delivered to generative models to provide data during edge model training. Through simulations and field experiments, PAECL achieves the highest accuracy for edge and end models under limited training resources and reduces the transmission burden by at least 297 times compared to existing edge-end heterogeneous learning methods.
Large Language Models (LLMs) increasingly underpin intelligent web applications, from chatbots to search and recommendation, where efficient specialization is essential. Low-Rank Adaptation (LoRA) enables such adaptation with minimal overhead, while federated LoRA allows web service providers to fine-tune shared models without data sharing. However, in privacy-sensitive deployments, clients inject varying levels of differential privacy (DP) noise, creating privacy heterogeneity that misaligns individual incentives and global performance. In this paper, we propose WinFLoRA, a privacy-heterogeneous federated LoRA that utilizes aggregation weights as incentives with noise awareness. Specifically, the noises from clients are estimated based on the uploaded LoRA adapters. A larger weight indicates greater influence on the global model and better downstream task performance, rewarding lower-noise contributions. By up-weighting low-noise updates, WinFLoRA improves global accuracy while accommodating clients' heterogeneous privacy requirements. Consequently, WinFLoRA aligns heterogeneous client utility in terms of privacy and downstream performance with global model objectives without third-party involvement. Extensive evaluations demonstrate that across multiple LLMs and datasets, WinFLoRA achieves up to 52.58% higher global accuracy and up to 2.56× client utility than state-of-the-art benchmarks. Source code is publicly available at https://github.com/koums24/WinFLoRA.git.
Recent advances in large language models (LLMs) have enabled more semantic-aware recommendations through natural language generation. Existing LLM for recommendation (LLM4Rec) methods mostly operate in a System 1-like manner, relying on superficial features to match similar items based on click history, rather than reasoning through deeper behavioral logic. This often leads to superficial and erroneous recommendations. Inspired by this, we propose ThinkRec, a thinking-based framework that shifts LLM4Rec from an intuitive system to a rational system. First, ThinkRec introduces a thinking activation mechanism by injecting synthetic reasoning traces, making the recommendation process resemble the Chain of Thought (CoT) reasoning of LLMs. This mechanism analyzes interaction histories, identifies user preferences, and makes decisions based on target items. Furthermore, considering the highly diverse distribution of recommendation data, we propose an instance-wise expert fusion mechanism to reduce the reasoning difficulty. By dynamically assigning weights to expert models based on users' latent features, ThinkRec adapts its reasoning path to individual users, thereby enhancing precision and personalization. Extensive experiments on various real-world web user behavior preference datasets demonstrate that ThinkRec significantly outperforms baselines in terms of recommendation accuracy and interpretability, providing superior recommendations based on a deeper understanding of user intent and a more rigorous reasoning process. Code is available in https://github.com/Yu-Qi-hang/ThinkRec.
Despite their success in Collaborative Filtering (CF), Graph Convolutional Networks (GCNs) are often viewed as Low-pass Graph Filters (LGFs) with task-specific supervision, while the influence of the underlying graph's spectral properties on LGF performance remains underexplored. Our analysis reveals that the performance of LGFs is strongly affected by algebraic connectivity. When connectivity is strong, LGFs tend to perform well; when it is weak, their effectiveness diminishes noticeably. This spectral sensitivity highlights an important limitation of existing models. To address this limitation, we propose Graph Booster, a learnable module that adaptively improves graph connectivity by reweighting edges. Unlike heuristic preprocessing, Graph Booster identifies bottleneck edges via spectral embeddings and adjusts their weights with a monotonic network guided by a lightweight graph connectivity regularizer. Integrated into LightGCN framework, our model BoostGCN achieves improvements over state-of-the-art methods, underscoring the significance of algebraic connectivity for graph-based CF.
Quantum computing is an emerging research area. This paper investigates why and how quantum computing can be integrated into recommender systems. Although some existing recommendation methods explore quantum concepts, they either remain theoretical without empirical validation or provide limited insight into the use of quantum computing for designing core functions in recommendation. To fill these gaps, we first analyze the potential advantages of quantum computing for two key components (i.e., representation learning and matching learning) in recommender algorithms and formulate corresponding hypotheses. Then, based on our analysis and the quantum computing operations, we propose three quantum-enhanced recommendation paradigms. To show the extensibility of our paradigms, we further apply them to the graph-based and social recommendation scenarios. We conduct extensive experiments on the six real-world datasets, comparing our methods with various baselines. Experimental results not only validate our hypotheses but also show the strong performance of our proposed methods.
As users' preferences evolve over time, personalized online services increasingly rely on sequential recommender systems to predict future interactions by modeling patterns in historical user behavior. However, existing methods for sequential recommendation (SR) face two key challenges: they struggle to simultaneously leverage collaborative, semantic, and rating information, and the use of hard labels during training provides limited supervision. In this paper, we introduce SEAR, an LLM-powered Sequential recommEndation framework via fusion of collAborative, semantic, and Rating information. The proposed deep model comprises an embedding layer and a sequence encoder. The embedding layer transforms user-item interactions into three types of embeddings: collaborative, semantic, and rating. The sequence encoder then integrates these embeddings and identifies sequential patterns to model user representations. To enhance the utilization of item semantics, we integrate a large language model (LLM) to extract LLM embeddings. These embeddings are then employed to initialize the semantic embedding layer, collaborative embedding layer, and item embeddings. To capture more nuanced user behavior patterns, we generate preference-weighted soft labels based on the next k interactions. Extensive experiments validate the effectiveness of SEAR, and ablation studies further highlight the distinct contributions of the collaborative, semantic, and rating information.
Federated recommendation systems (FRSs) have recently gained widespread attention due to their ability to train collaborative recommendation models without exchanging raw user data. However, existing FRSs face a severe challenge of data sparsity, which manifests at both the user and item levels. First, user data sparsity: some users may only have a small number of interactions with items, struggling to adequately train the personalized user embedding locally. Second, item data sparsity: some items may only receive a small number of user ratings, causing the global model to lack knowledge about them. Considering these, we propose the Knowledge Enhanced Federated Recommendation System named as KE-FedRS, of which the core idea is to enhance the knowledge of users with few interactions and items with few ratings at both the local and global levels. Specifically, at the local level, we introduce an auxiliary user embedding and average and aggregate this auxiliary embedding across similar users, thereby enriching the knowledge of the local user embedding. At the global level, we propose a hybrid client selection strategy based on item embedding discrepancies, prioritizing clients that exhibit greater divergence in item embeddings from others, thus enhancing the knowledge of items with fewer interactions in the global model. We conduct comprehensive experiments on four real-world datasets, and the results show that the proposed method consistently outperforms baseline approaches in terms of HR@10 and NDCG@10.
Watch time prediction (WTP) has emerged as a pivotal task in short video recommendation systems, designed to quantify user engagement through continuous interaction modeling. Predicting users' watch times on videos often encounters fundamental challenges, including wide value ranges and imbalanced data distributions, which can lead to significant estimation bias when directly applying regression techniques. Recent studies have attempted to address these issues by converting the continuous watch time estimation into an ordinal regression task. While these methods demonstrate partial effectiveness, they exhibit notable limitations: (1) The discretization process frequently relies on bucket partitioning, inherently reducing prediction flexibility and accuracy. (2) The interdependencies among different partition intervals remain underutilized, missing opportunities for effective error correction.
Inspired by language modeling paradigms, we propose a novel Generative Regression (GR) framework that reformulates WTP as a sequence generation task. Our approach employs structural discretization to enable nearly lossless value reconstruction while maintaining prediction flexibility. Through carefully designed vocabulary construction and label encoding schemes, each watch time is bijectively mapped to a token sequence. To mitigate the training-inference discrepancy caused by teacher-forcing, we introduce a curriculum learning with embedding mixup strategy that gradually transitions from guided to free-generation modes. We test our models extensively on two public datasets, a large-scale offline industrial dataset, and an online A/B test on Kuaishou App with over 400 million daily active users (DAU) and GR consistently outperforms existing state-of-the-art approaches significantly. Our code is available at https://github.com/snailma0229/GR.git.
The persistent challenges of data sparsity and class imbalance have long limited the development of recommender systems. Fortunately, line graph theory offers a novel perspective to overcome these issues. By transforming the user-item interaction bipartite graph into a line graph, the problems of data sparsity and class imbalance are elegantly reformulated as those of insufficient labeled nodes and imbalanced label distribution in the line graph domain. This reformulation allows us to directly apply mature techniques from node classification and imbalanced graph learning to address these core challenges. Inspired by this insight, we propose a Line Graph Data Augmentation (LGDA) strategy, which features two distinct characteristics. Firstly, it is a plug-and-play module that resolves data sparsity and imbalance without modifying the underlying recommendation framework. Secondly, it employs a targeted augmentation and confidence filtering mechanism to generate high-quality, balanced augmented data. Extensive experiments on four real-world datasets validate that LGDA effectively alleviates data sparsity and class imbalance, leading to significant improvements in both recommendation performance and system robustness.
Federated cross-domain recommendation (Federated CDR) aims to collaboratively learn personalized recommendation models across heterogeneous domains while preserving data privacy. Recently, large language model (LLM)-based recommendation models have demonstrated impressive performance by leveraging LLMs' strong reasoning capabilities and broad knowledge. However, adopting LLM-based recommendation models in Federated CDR scenarios introduces new challenges. First, there exists a risk of overfitting with domain-specific local adapters. The magnitudes of locally optimized parameter updates often vary across domains, causing biased aggregation and overfitting toward domain-specific distributions. Second, unlike traditional recommendation models (e.g., collaborative filtering, bipartite graph-based methods) that learn explicit and comparable user/item representations, LLMs encode knowledge implicitly through autoregressive text generation training. This poses additional challenges for effectively measuring the cross-domain similarities under heterogeneity. To address these challenges, we propose an LLM-based framework for federated cross-domain recommendation, FeDecider. Specifically, FeDecider tackles the challenge of scale-specific noise by disentangling each client's low-rank updates and sharing only their directional components. To handle the need for flexible and effective integration, each client further learns personalized weights that achieve the data-aware integration of updates from other domains. Extensive experiments across diverse datasets validate the effectiveness of our proposed FeDecider.
Accurately detecting super host that establishes connections to a large number of distinct peers is significant for mitigating web attacks and ensuring high quality of web service. Existing sketch-based approaches estimate the number of distinct connections called flow cardinality according to full IP addresses, while ignoring the fact that a malicious or victim super host often communicates with hosts within the same subnet, resulting in high false positive rates and low accuracy. Though hierarchical-structure based approaches could capture flow cardinality in subnet, they inherently suffer from high memory usage. To address these limitations, we propose SegSketch, a segmented cardinality estimation approach that employs a lightweight halved-segment hashing strategy to infer common prefix lengths of IP addresses, and estimates cardinality within subnet to enhance detection accuracy under constrained memory size. Experiments driven by real-world traces demonstrate that, SegSketch improves F1-Score by up to 8.04× compared to state-of-the-art solutions, particularly under small memory budgets.
Attribute-Specific Fashion Retrieval (ASFR) aims to improve fine-grained image retrieval by focusing on specific attributes. However, existing patch-based attention and Transformer methods often misalign with irregular attribute regions and are prone to background noise, limiting their ability to capture subtle, pixel-level microstructures. To tackle these challenges, we propose Super Fashion. , the first ASFR framework that adopts superpixel tokens within a Transformer architecture. Super Fashion initially employs an attribute-guided attention mechanism to extract attribute-related features, which in turn guide the cropping of semantically meaningful image regions. Superpixel segmentation is then leveraged on these regions to generate compact, semantically coherent superpixel tokens. By incorporating modality-specific embeddings for both attribute and superpixel tokens, the superpixel token-based Transformer facilitates adaptive interaction and fusion, thereby enhancing attribute localization and discrimination. Extensive experiments on FashionAI, DARN, and DeepFashion demonstrate relative overall MAP improvements of 1.84%, 9.27%, and 9.35% over prior SOTA. Super Fashion offers a new solution for web-based image retrieval.
Logs serve as a primary source of information for engineers to diagnose failures in large-scale online service systems. Log parsing, which extracts structured events from massive unstructured log data, is a critical first step for downstream tasks like anomaly detection and failure diagnosis. With advances in large language models (LLMs), leveraging their strong text understanding capabilities has proven effective for accurate log parsing. However, existing LLM-based log parsers all focus on the constant part of logs, ignoring the potential contribution of the variable part to log parsing. This constant-centric strategy brings four key problems. First, inefficient log grouping and sampling with only constant information. Second, a relatively large number of LLM invocations due to constant-based cache, leading to low log parsing accuracy and efficiency. Third, a relatively large number of consumed constant tokens in prompts leads to high LLM invocation costs. At last, these methods only retain placeholders in the results, losing the system visibility brought by variable information in logs.
Facing these problems, we propose a variable-centric log parsing strategy named VarParser. Through variable contribution sampling, variable-centric parsing cache, and adaptive variable-aware in-context learning, our approach can efficiently capture the variable parts of logs and leverage their contributions to parsing. By introducing variable units, we preserve rich variable information, enhancing the integrity of log parsing results. Extensive evaluations on large-scale datasets demonstrate that VarParser achieves higher accuracy compared to existing methods, significantly improving parsing efficiency while reducing the LLM invocation costs.
Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.
Knowledge tracing (KT) aims to personalize online education on large-scale web-based platforms by modeling students' evolving knowledge states from their interaction sequences. However, most KT models rely on a single encoder architecture (e.g., self-attention or RNN), with fixed inductive biases that fails to capture the diversity of learning behaviors. Specifically, student learning unfolds across multiple timescales, and interaction sequences contain diverse frequency components ranging from short-term variations to long-term trends. Our data-driven analysis reveals that existing encoders exhibit characteristic frequency biases (e.g., self-attention tends to emphasize low-frequency patterns), highlighting the limitations of any single architecture. To address this problem, we propose FA-KT, a frequency-aware mixture of heterogeneous experts framework. FA-KT combines self-attention, Mamba, CNN, and LSTM experts, each with complementary frequency biases. A frequency-aware router analyzes each sequence's frequency characteristics and adaptively combines experts to create dynamic, personalized encoders for individual students. Across five benchmark datasets, FA-KT consistently outperforms 20 strong KT baselines in predicting future performance. Code is available at https://pykt.org/.
Predictive modeling on web-scale tabular data presents significant scalability challenges for industrial applications, often involving billions of instances and hundreds of heterogeneous numerical features. The inherent complexities of these features—characterized by anisotropy, heavy-tailed distributions, and non-stationarity—not only impose bottlenecks on the training efficiency and scalability of mainstream models like Gradient Boosting Decision Trees (GBDTs), but also compel practitioners into laborious, inefficient, and expert-dependent manual feature engineering. To systematically address this challenge, we introduce KMLP, a novel hybrid deep architecture. KMLP synergistically integrates a shallow Kolmogorov-Arnold Network (KAN) as a front-end with a Gated Multilayer Perceptron (gMLP) as the backbone. The KAN front-end leverages its learnable activation functions to automatically model complex non-linear transformations for each input feature in an end-to-end manner, thereby automating feature representation learning. Subsequently, the gMLP backbone efficiently captures high-order interactions among these refined representations. Extensive experiments on multiple public benchmarks and an ultra-large-scale industrial web dataset with billions of samples demonstrate that KMLP achieves state-of-the-art (SOTA) performance. Crucially, our findings reveal that KMLP's performance advantage over strong baselines like GBDTs becomes more pronounced as the data scale increases. This validates KMLP as a scalable and adaptive deep learning paradigm, offering a promising path forward for modeling large-scale, dynamic web tabular data.
Identifying central claims from long documents is a fundamental yet challenging task in automated fact-checking pipelines. Manual extraction at the document level is costly and requires domain expertise, while existing automatic methods for claim extraction and evaluation tend to overlook critical dimensions that human fact-checkers consider essential for determining claim quality. In this paper, we propose a novel Human Value-Aligned framework that enables zero-shot document-level claim extraction and evaluation by aligning with expert preferences. We first elicit a structured set of Human Value Alignment (HVA) dimensions from expert annotations and incorporate them into prompt design, instructing large language models (LLMs) to extract high-quality claims that align with expert values. To assess the quality of extracted claims, we further introduce an LLM-based automatic evaluator that scores claims across HVA dimensions and quantifies alignment with expert-written claims. Furthermore, we propose a multi-level agreement metric to evaluate the reliability of automatic HVA evaluator. Experiment results show that our method significantly improves central claim extraction performance, achieving state-of-the-art chrF and P@N scores. Moreover, the proposed HVA evaluator achieves high agreement with human judgments and offers interpretable dimension-level assessments of extracted claims. The HVA framework establishes a reliable and scalable way for human-aligned document-level claim extraction and evaluation in real-world scenarios.
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning and prediction across different domains. Yet, their ability to infer temporal regularities from structured behavioral data remains underexplored. This paper presents a systematic study investigating whether LLMs can predict time intervals between recurring user actions, such as repeated purchases, and how different levels of contextual information shape their predictive behavior. Using a simple but representative repurchase scenario, we benchmark state-of-the-art LLMs in zero-shot settings against both statistical and machine-learning models. Two key findings emerge. First, while LLMs surpass lightweight statistical baselines, they consistently underperform dedicated machine-learning models, showing their limited ability to capture quantitative temporal structure. Second, although moderate context can improve LLM accuracy, adding further user-level detail degrades performance. These results challenge the assumption that ''more context leads to better reasoning.'' Our study highlights fundamental limitations of today's LLMs in structured temporal inference and offers guidance for designing future context-aware hybrid models that integrate statistical precision with linguistic flexibility.
Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on iterative online experiments or parameter tuning over abstracted statistical models, which can be inaccurate, time-consuming and potentially impair players' experience. Although simplified offline simulation systems are frequently employed as alternatives, their low fidelity constrains agents' ability to faithfully replicate real players' reasoning processes and behavioral responses to interventions. To address these limitations, we propose a generative agent-based MMO simulation system with hundreds of agents empowered by Large Language Models (LLMs). By applying Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on large-scale real player behavioral data, we adapt LLMs from general priors to game-specific domains, enabling realistic and interpretable player decision-making. In parallel, a data-driven environment model trained on real gameplay logs reconstructs dynamic in-game systems. Experiments demonstrate strong consistency with real-world player behaviors and plausible causal responses under interventions, providing a reliable, interpretable, and cost-efficient framework for data-driven numerical design optimization.
With the rise of modern search and recommendation platforms, insufficient collaborative information of cold-start items exacerbates the Matthew effect of existing platform items, challenging platform diversity and becoming a longstanding issue. Existing methods align items' side content with collaborative information to transfer collaborative signals from high-popularity items to cold-start items. However, these methods fail to account for the asymmetry between collaboration and content, nor the fine-grained differences among items. To address these issues, we propose COINS, an item representation enhancement approach based on fused alignment of semantic IDs. Specifically, we use RQ-OPQ encoding to quantize item content and collaborative information, followed by a two-step alignment: RQ encoding transfers shared collaborative signals across items, while OPQ encoding learns items' differentiated information. Comprehensive offline experiments on large-scale industrial datasets demonstrate COINS's superiority, and rigorous online A/B tests confirm statistically significant improvements.
With the evolving generative models, generated images are closer to reality, raising concerns about information authenticity and malicious misuse. Invisible watermarks offer a practical approach to detecting and tracing them. However, while image watermarking inevitably introduces quality degradation, most existing methods primarily focus on improving watermark robustness. To address this limitation, we propose AWMA-MoE, a framework that enhances the quality of generated images while preserving strong watermark robustness. Specifically, we design an attention-based adapter that adaptively embeds watermarks with spatially varying strengths across image regions. Building upon this, we introduce an MoE architecture that leverages diverse experts to further improve image quality while retaining watermark robustness. Experiments demonstrate that AWMA-MoE can reduce the distortion of generated images and exhibit competitive watermark performance, thus striking an improved balance for watermarking generated image tasks and better linking post-hoc and in-generation methods.
Most existing content issue detection models are built upon Multimodal Large Language Models (MLLMs). Although such approaches achieve high accuracy, they often suffer from poor timeliness, limited feature richness, and high serving costs: MLLM-based models are typically updated only monthly or quarterly and rely mainly on intrinsic video features (e.g., sampled frames and captions), underutilizing post-hoc user feedback and aggregated video and author-side signals. To address these challenges, we propose a daily updated ranking model for Content Issue Detection based on a Cross-Order Representation Fusion (CORF) architecture. The model supports day-level updates and integrates user behavior features, video- and author-side information, and MLLM-generated scores as inputs. Compared with MLLM-based approaches, our model is significantly smaller and can efficiently score all videos on a daily basis. Experiments on the Adult-Content category show higher recall at the same level of precision, and online A/B tests further demonstrate reduced user exposure to such content issues, contributing to a safer and more positive viewing experience.
Event forecasting is inherently influenced by multifaceted considerations, including international relations, regional historical dynamics, and cultural contexts. However, existing LLM-based approaches employ single-model architectures that generate predictions along a singular explicit trajectory, constraining their ability to capture diverse geopolitical nuances across complex regional contexts. To address this limitation, we introduce ThinkTank-ME, a novel Think Tank framework for Middle East event forecasting that emulates collaborative expert analysis in real-world strategic decision-making. To facilitate expert specialization and rigorous evaluation, we construct POLECAT-FOR-ME, a Middle East–focused event forecasting benchmark. Experimental results demonstrate the superiority of multi-expert collaboration in handling complex temporal geopolitical forecasting tasks. The code is available at https://github.com/LuminosityX/ThinkTank-ME.
Multimodal disinformation may arise from the image, the text, or their cross-modal consistency, and these aspects can mislead either alone or in combination. However, existing research has not considered the full range of such combinations, leaving current methods vulnerable to unseen combinations. To address this gap, we propose a tri-axis formulation that explicitly separates image veracity, text veracity, and cross-modal consistency, yielding an 8-way space that fully enumerates all cases and turns multimodal disinformation detection, formerly a coarse, multi-label problem, into a well-defined, mutually exclusive classification task. To support this formulation, we construct OctantFake, an 8-way dataset sourced from 16 datasets with complete label coverage and broad class diversity. Experiments show that existing methods struggle on this challenging dataset. Building on this dataset, we introduce OctantAgent, an LVLM-based parallel framework with three modules, image check, text check, and consistency check, whose predictions are fused into an 8-way decision, achieving clear improvements over existing LVLM-based methods. Together, these contributions introduce an interpretable and exhaustive 8-way perspective that provides a solid foundation for analyzing multimodal disinformation.
Inspired by advances in LLMs, reasoning-enhanced sequential recommendation performs multi-step deliberation before making final predictions, unlocking greater potential for capturing user preferences. However, current methods are constrained by static reasoning trajectories that are ill-suited for the diverse complexity of user behaviors. They suffer from two key limitations: (1) a static reasoning direction, which uses flat supervision signals misaligned with human-like hierarchical reasoning, and (2) a fixed reasoning depth, which inefficiently applies the same computational effort to all users, regardless of pattern complexity. These rigidity lead to suboptimal performance and significant computational waste. To overcome these challenges, we propose DTRec, a novel and effective framework that explores the Dynamic reasoning Trajectory for Sequential Recommendation along both direction and depth. To guide the direction, we develop Hierarchical Process Supervision (HPS), which provides coarse-to-fine supervisory signals to emulate the natural, progressive refinement of human cognitive processes. To optimize the depth, we introduce the Adaptive Reasoning Halting (ARH) mechanism that dynamically adjusts the number of reasoning steps by jointly monitoring three indicators. Extensive experiments on three real-world datasets demonstrate the superiority of our approach, achieving up to a 24.5% performance improvement over strong baselines while simultaneously reducing computational cost by up to 41.6%.
We propose a novel MLaaS Dataset Generator (MDG) framework that creates configurable and reproducible datasets for evaluating Machine Learning as a Service (MLaaS) selection and composition. MDG simulates realistic MLaaS behaviour by training and evaluating diverse model families across multiple real-world datasets and data distribution settings. It records detailed functional attributes, quality of service metrics, and composition-specific indicators, enabling systematic analysis of service performance and cross-service behaviour. Using MDG, we generate more than ten thousand MLaaS service instances and construct a large-scale benchmark dataset suitable for downstream evaluation. We also implement a built-in composition mechanism that models how services interact under varied Internet of Things conditions. Experiments demonstrate that datasets generated by MDG enhance selection accuracy and composition quality compared to existing baselines. MDG provides a practical and extensible foundation for advancing data-driven research on MLaaS selection and composition.
We consider the problem of graph sparsification while preserving the test accuracy of Graph Neural Networks (GNNs). Prior work in this area is often motivated by the Lottery Ticket Hypothesis, which aims to prune redundant edges. However, these sparsification approaches typically operate as black boxes and provide no justification for which edges are removed. In contrast, edge importance scores obtained from GNN explanation methods provide a principled and interpretable basis for sparsification. In particular, we show that Shapley value–based explainers such as GNNShap enable effective sparsification, allowing up to 80% of edges to be removed without degrading model accuracy. We show that Shapley values are well-suited for this task due to their robustness in identifying less influential edges, resulting in sparse yet faithful subgraphs that are efficient for downstream applications.
To improve resource utilization in Cloud clusters and reduce operational costs, this paper proposes a two-level collaborative Vertical Pod Autoscaler (VPA) strategy for offline tasks in hybrid deployment environments. Offline tasks feature significant runtime variation and low periodicity—characteristics that make existing VPA strategies ineffective. The proposed approach combines coarse-grained adjustments (dynamically recommending resource settings based on cluster-wide utilization trends) with fine-grained adjustments (performing instance-level resource recalibration via sliding windows). It addresses the averaging perspective limitation of coarse-grained methods without requiring predictive models or prior knowledge, ensuring interpretability and operational simplicity. Production deployment verifies that the strategy elevates offline resource utilization to near-target levels, achieving an 8.62% improvement in offline resource utilization and a 41.5% increase in offline task deployment volume under Out-of-Memory (OOM) and computational constraints.
Attention mechanisms are essential to the success of Large Language Models (LLMs). In practice, models often overemphasize semantically low-value tokens, forming attention sinks while failing to capture truly informative tokens. Existing inference-time optimization methods mainly rely on static adjustments or attention redistribution, which often disrupt the correspondence between attention distribution and the actual semantics of the input, leading to a loss of semantic consistency and degraded performance. To address this problem, we propose PAOSC, a plug-and-play attention optimization model designed to maintain semantic consistency by dynamically adjusting attention. PAOSC employs a generator to identify informative tokens and a discriminator to optimize the generator via policy gradients based on confidence changes and loss fluctuations. Experiments on eight LLMs show up to a 9.68% improvement in the F1 score. On the constructed HTTP-RL dataset, PAOSC eliminates 18% of low-value tokens, improving inference efficiency while maintaining semantic consistency. Our code is available at https://github.com/ChangLi000/PAOSC.
Generative images have proliferated on Web platforms in social media and online copyright distribution scenarios, and semantic watermarking has increasingly been integrated into diffusion models to support reliable provenance tracking and forgery prevention for web content. Traditional noise-layer-based watermarking, however, remains vulnerable to inversion attacks that can recover embedded signals. To mitigate this, recent content-aware semantic watermarking schemes bind watermark signals to high-level image semantics, constraining local edits that would otherwise disrupt global coherence. Yet, large language models (LLMs) possess structured reasoning capabilities that enable targeted exploration of semantic spaces, allowing locally fine-grained but globally coherent semantic alterations that invalidate such bindings. To expose this overlooked vulnerability, we introduce a Coherence-Preserving Semantic Injection (CSI) attack that leverages LLM-guided semantic manipulation under embedding-space similarity constraints. This alignment enforces visual-semantic consistency while selectively perturbing watermark-relevant semantics, ultimately inducing detector misclassification. Extensive empirical results show that CSI consistently outperforms prevailing attack baselines against content-aware semantic watermarking, revealing a fundamental security weakness of current semantic watermark designs when confronted with LLM-driven semantic perturbations.
We provide the first anonymity-preserving algorithm for a centralized decision maker in linear bandit-based multi-user systems. Our algorithm employs successive elimination techniques for linear bandits to build an assignment multi-graph (from users to arms) along with a greedy matching algorithm that efficiently allocates the arms to users. We provide lower and upper bounds for this problem, showing that our algorithm is regret optimal up to a √(CK) factor.
Behavioral patterns captured in embeddings learned from interaction data are pivotal across various stages of production recommender systems. However, in the initial retrieval stage, practitioners face an inherent tradeoff between embedding expressiveness and the scalability and latency of serving components, resulting in the need for representations that are both compact and expressive. To address this challenge, we propose a training strategy for learning high-dimensional sparse embedding layers in place of conventional dense ones, balancing efficiency, representational expressiveness, and interpretability. To demonstrate our approach, we modified the production-grade collaborative filtering autoencoder ELSA, achieving up to 10× reduction in embedding size with no loss of recommendation accuracy, and up to 100× reduction with only a 2.5% loss. Moreover, the active embedding dimensions reveal an interpretable inverted-index structure that segments items in a way directly aligned with the model's latent space, thereby enabling integration of segment-level recommendation functionality (e.g., 2D homepage layouts) within the candidate retrieval model itself. Source codes, additional results, as well as a live demo are available at https://github.com/zombak79/compressed\_elsa.
Long-term time series forecasting (LTSF) remains challenging due to the trade-off between parallel efficiency and sequential modeling of temporal coherence. Direct multi-step forecasting (DMS) methods enable fast, parallel prediction of all future horizons but often lose temporal consistency across steps, while iterative multi-step forecasting (IMS) preserves temporal dependencies at the cost of error accumulation and slow inference. To bridge this gap, we propose Back to the Future (BTTF), a simple yet effective framework that enhances forecasting stability through look-ahead augmentation and self-corrective refinement. Rather than relying on complex model architectures, BTTF revisits the fundamental forecasting process and refines a base model by ensembling the second-stage models augmented with their initial predictions. Despite its simplicity, our approach consistently improves long-horizon accuracy and mitigates the instability of linear forecasting models, achieving accuracy gains of up to 58% and demonstrating stable improvements even when the first-stage model is trained under suboptimal conditions. These results suggest that leveraging model-generated forecasts as augmentation can be a simple yet powerful way to enhance long-term prediction, even without complex architectures.
General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query–table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperforms the existing baselines with an average R@1 improvement of 16.54%. Under cross-domain evaluation, CGPT further demonstrates strong cross-domain generalization and remains effective even when using smaller LLMs for synthetic query generation. These results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval. Our code is available at https://github.com/yumeow0122/CGPT.
Accurate estimation of delivery time (EDT) is a critical factor in web e-commerce user experience. The pursuit of higher EDT accuracy has predominantly centered on designing increasingly complex model architectures. While valuable, this architecture-centric paradigm creates a tension between its high iteration costs and the industrial demand for agile deployment. This work, therefore, explores a complementary dimension: enhancing model performance by optimizing the learning process itself. We propose EDTF, a novel, plug-and-play composite learning framework that empowers existing models by augmenting their learning objective. EDTF first transforms the traditional regression problem into a structured ordinal classification task to address the training difficulties inherent in direct regression and preserve temporal order. It then introduces a cross-view consistency paradigm, decomposing the prediction task into two related views: the macroscopic end-to-end delivery time and the microscopic next-hop duration. By enforcing a self-supervised signal that aligns the sum of future next-hop durations with the overall EDT, our framework enables models to learn more robust temporal representations without extra features. Extensive experiments on a large-scale industrial dataset show that EDTF, as a plugin, consistently enhances performance and accelerates convergence across five diverse architectures. Critically, an EDTF-optimized model has been successfully deployed in a live production environment, demonstrating significant improvements over its predecessor. This work thus presents a validated and valuable new paradigm for the economical and efficient application of web services reliant on trajectory-based forecasting, from e-commerce to ride-hailing and food delivery.
Money laundering enables malicious actors to integrate illegal profits into the legitimate economy and has long been a central concern in financial regulation. Blockchain systems introduce new channels for laundering through decentralized, pseudonymous, and cross-border asset transfers. In this context, blockchain exploiters often rely on laundering to conceal fund origins and enable cash-out.
While prior work has focused on detecting suspicious accounts or transactions, the behavioral patterns underlying laundering practices remain underexplored. This paper provides a behavioral perspective on post-exploit laundering on Ethereum. We use on-chain tracing to reconstruct token flows originating from exploiter-controlled addresses. We then define a set of behavioral metrics covering financial trajectories, temporal dynamics, structural topology, and value dispersion. Our empirical study reveals recurring patterns, including rapid fund movement, shallow transfer structures, and broad dispersion. These patterns exhibit measurable regularities, suggesting that behavioral dynamics could be leveraged to enhance existing laundering detection frameworks.
The proliferation of highly realistic deepfake videos threatens public trust and the integrity of digital information. However, detecting sophisticated deepfakes requires analysis beyond surface-level visual artifacts. We propose Harmonizing Action Units with Temporal-contextual Embeddings (HAUTE), integrating physiological muscle dynamics with holistic semantic context through adaptive attention mechanisms. HAUTE captures temporal Action Unit coordination patterns and high-level contextual embeddings, enabling the model to reveal synthesis-induced inconsistencies imperceptible to isolated modalities. Extensive experiments demonstrate state-of-the-art performance with strong cross-dataset adaptability, particularly on commercial tool-based high-quality deepfakes, advancing trustworthy content verification for web ecosystems.
Self-explaining recommenders enhance user trust by providing justifications for their suggestions. Sequence-aware models have advanced the field by leveraging user interaction history to personalize recommendations and explanations. However, generative models often struggle with sparse data, producing repetitive or irrelevant explanations. This paper explores the optimal methods for infusing rich textual information from past user interactions directly into the item embeddings to feed a user reasoning path leading to personalized explanations. We conduct a comprehensive analysis of various techniques, including: (1) multiple text aggregation strategies to pool fine-grained attributed item opinions into user-aggregated item text representations; (2) several fusion mechanisms to combine text and collaborative modalities, from early fusion to a late fusion approach within the Transformer architecture; and (3) different training regimes for explanation generation. Experiments on three real-world datasets demonstrate which steps to follow in order to successfully leverage textual information into a sequence-aware explainable recommendation model and boost recommendation performance as well as explanation quality.
Modern web interfaces increasingly support complex decision workflows, such as travel planning and multi-criteria selection, yet remain largely static and insensitive to users' moment-to-moment cognitive states during interaction. Travel planning, in particular, requires users to synthesize dispersed information under multiple constraints, making it a representative high-load interactive decision task. This study presents MACA (Multi-Agent Cognitive Adaptation), a framework that enables real-time cognitive adaptation in web-based decision environments by integrating hierarchical Monte Carlo Tree Search with a Planner–Critic–Executor multi-agent architecture. MACA continuously estimates users' emotional and attentional states using facial expression analysis (ResEmoteNet) and gaze stability tracking (MediaPipe), and uses these signals to regulate agent collaboration, reasoning depth, and feedback pacing during interaction. We evaluated MACA in a 2×2 within-subject study (N = 30) comparing Single versus Multi-agent and Fixed versus Adaptive configurations. Results show that the Multi-Adaptive condition significantly improved decision quality (F(3,116) = 2.96, p = 0.035) while reducing mental effort (F(3,116) = 2.82, p = 0.042), yielding a 10.7% gain in decision efficiency without increasing cognitive burden. These findings demonstrate that multimodal user-state sensing combined with cooperative multi-agent reasoning can enhance interactive web-based decision making while maintaining user well-being.
Mixture-of-Experts (MoE) models are central to scaling Large Language Models (LLMs), but stateless and compute-intensive routing repeatedly re-explores expert assignments for similar inputs, causing computational redundancy and unstable behavior in web-scale applications like search and dialogue. We reframe expert routing as a retrieval-augmented process and propose the Retrieval-Memory Synergy Mixture-of-Experts (RMS-MoE), which integrates a Co-Activation Memory (CAM) to store and retrieve effective expert teams and a learnable, input-dependent gate to fuse retrieved priors with live routing predictions, enabling consistent expert coordination for semantically related inputs. Extensive experiments on web-scale QA and dialogue tasks show that RMS-MoE achieves a 26% latency reduction, a 2.7-point accuracy gain, and a 3.3% improvement in routing stability, demonstrating architectural memory as a principled path toward more efficient, stable, and scalable LLMs.
Tabular Foundation Models (TFMs) have recently shown strong in-context learning capabilities on structured data, achieving zero-shot performance comparable to traditional machine learning methods. We find that zero-shot TFMs already achieve strong performance, while the benefits of fine-tuning are highly model- and data-dependent. Meta-learning and PEFT provide moderate gains under specific conditions, whereas full supervised fine-tuning often reduces accuracy or calibration quality. This work presents the first comprehensive study of fine-tuning in TFMs across benchmarks including TALENT, OpenML-CC18, and TabZilla. We compare zero-shot, meta-learning, supervised (SFT), and parameter-efficient (PEFT) approaches, analyzing how dataset factors such as imbalance, size, and dimensionality affect outcomes. Our findings cover performance, calibration, and fairness, offering practical guidelines on when fine-tuning is most beneficial and its limitations.
Retrieval-augmented generation (RAG) is an effective approach to enhancing the factual accuracy of radiology reports. However, existing methods primarily model coarse-grained image–report correspondences, ignoring semantic relations among reports that capture hierarchical and fine-grained pathological knowledge. As a result, the learned representations fail to reflect detailed clinical semantics, causing factual inconsistencies in generated reports. Therefore, we propose a multi-granularity knowledge-integrated RAG framework for radiology reports. Specifically, we utilize multi-granularity semantic similarities, derived from the text modality, to adjust the original cross-modal contrastive learning loss. This guides the multimodal retriever to learn a finer-grained clinical semantic alignment. Then, we utilize cross attention to obtain enhanced visual features by integrating the retrieved reports with the original images, thus enhancing the factual accuracy of report generation. The effectiveness of our method was verified on two widely used benchmarks, achieving superior performance in both language generation and key clinical metrics.
Perceived trustworthiness underpins how users navigate online information, yet it remains unclear whether large language models (LLMs), increasingly embedded in search, recommendation, and conversational systems, represent this construct in psychologically coherent ways. We analyze how instruction-tuned LLMs (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) encode perceived trustworthiness in web-like narratives using the PEACE-Reviews dataset annotated for cognitive appraisals, emotions, and behavioral intentions. Across models, systematic layer- and head-level activation differences distinguish high- from low-trust texts, revealing that trust cues are implicitly encoded during pretraining. Probing analyses show linearly decodable trust signals and fine-tuning effects that refine rather than restructure these representations. Strongest associations emerge with appraisals of fairness, certainty, and accountability-self– dimensions central to human trust formation online. These findings suggest that modern LLMs internalize psychologically grounded trust signals without explicit supervision, offering a representational foundation for designing credible, transparent, and trustworthy AI systems in the web ecosystem. Code and appendix are available at: https://github.com/GerardYeo/TrustworthinessLLM.
Active retrieval-augmented generation (RAG) triggers external knowledge retrieval during generation based on model-side uncertainty signals to support knowledge-intensive, multi-hop reasoning. However, existing methods often retrieve only after producing a complete answer, failing to surface and fill information gaps in time; moreover, relying on a single internal signal as the trigger cannot adequately capture the multifaceted nature of uncertainty. We therefore propose a conflict-aware active RAG framework. We first decompose complex questions into a sequence of step-level sub-problems. At each step, we quantify local distributional uncertainty via a sliding-window peak token entropy, and estimate cross-sample consensus via the variation ratio computed over multiple Monte Carlo samples. After calibrating both signals onto a probabilistic scale, we quantify their conflict using a symmetric, bounded divergence over Bernoulli parameters, and fuse the three quantities into a single uncertainty score that gates retrieval. Experiments demonstrate the effectiveness of our framework.
Health misinformation poses serious public health risks, yet little is known about how ordinary users verify such content in real time. In this study, we investigate how behavioral fact-checking strategies, specifically, lateral reading (consulting multiple external sources) versus vertical reading (spending more time on fewer sources), affect accuracy in detecting health-related misinformation. We conducted a large-scale online experiment with 1,842 participants using a designed for the experiment purposes standalone web platform that logs respondents' behavioral signals (tab/window switches and time spent outside the experimental interface). This design enables indirect measurement of fact-checking in an ecologically valid setting. Using linear mixed-effects models with semiparametric bootstrap confidence intervals, we find that lateral reading significantly improves accuracy (β = 0.09, 95% CI [0.079, 0.181], p = 0.001), while vertical reading has no effect (β = -0.01, p = 0.332). Crucially, confirmation bias, operationalized as alignment between statement valence and user attitude, does not moderate this relationship (p = 0.876). This suggests that lateral reading remains effective even when users encounter ideologically congruent misinformation. Our results demonstrate that simple, scalable behavioral cues like tab-switch frequency can serve as reliable indicators of verification quality. For web platforms and digital literacy initiatives, this implies that nudging users to ''open another tab'' may be a lightweight, bias-resistant intervention to improve misinformation resilience, especially in health contexts where errors carry real-world consequences.
Off-policy evaluation (OPE) is widely used to compare contextual bandit policies in recommender systems. While there a lot of recent methodological developments, suggesting novel OPE schemes, they are typically validated in the synthetic environments, which not necessarily possess the structure of the real-world datasets. In this paper, we consider the inverse propensity score (IPS) method and its modifications, and study how empirical conclusions inferred from the data depend on evaluation pipelines. We show, that even in the synthetic environments, rankings of different estimators are sensitive to random seeds, log generators, and sample size. Using the popular benchmark, the Open Bandit Dataset, we analyze logging behavior and data characteristics that may violate the i.i.d. assumptions of the log generation.
Retrieval-Augmented Generation (RAG) systems are increasingly deployed in web-based educational environments, yet transparency can be seen as a primarily ethical, and, too often, optional, concern, rather than foundational. This paper presents design patterns for building transparent RAG systems, derived from developing and deploying SAGE-RAI, an advanced multi-purpose RAG system, in an educational context. Through systematic evaluation combining quantitative rating data (n=26, mean rating=4.62/5) and qualitative interviews (n=4), we demonstrate that transparency serves dual pedagogical and ethical functions. Our empirical findings reveal high user satisfaction (92.3% rating 4-5 stars) while identifying critical tensions between AI assistance and learning independence. Our findings suggest that as RAG systems increasingly mediate access to web-based knowledge, transparency must evolve from an optional feature to an architectural requirement.
Medical concepts, the core entities in Electronic Health Records (EHRs), provide essential inputs for clinical decision-making systems. However, most existing healthcare models still rely on massive concept-specific embedding tables, resulting in substantial memory overhead. Recent studies compress medical concepts into discrete code sequences for memory efficiency, but their flat semantic quantization fails to explicitly encode the hierarchical structure of medical ontologies, thereby limiting clinical interpretability. To this end, we propose MedRQ, an ontology-driven residual vector quantization framework that aligns discrete codes with multi-level clinical ontologies. By incorporating hierarchical supervision into the quantization process, MedRQ generates compact and ontology-consistent concept representations that generalize seamlessly across healthcare prediction tasks. Experiments on two real-world EHR datasets demonstrate that MedRQ significantly outperforms state-of-the-art baselines while reducing memory usage.
Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user–item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.
Recent advances in vision–language models (VLMs) have sparked growing interest in using them to automate web tasks, yet their feasibility as independent agents that reason and act purely from visual input remains underexplored. We investigate this setting using Qwen2.5-VL-32B, one of the strongest open-source VLMs available, and focus on improving its reliability in web-based control. Through initial experimentation, we observe three key challenges: (i)~inaccurate localization of target elements, the cursor, and their relative positions, (ii)~sensitivity to instruction phrasing, and (iii)~an overoptimistic bias toward its own actions, often assuming they succeed rather than analyzing their actual outcomes. To address these issues, we fine-tune Qwen2.5-VL-32B for a basic web interaction task: moving the mouse and clicking on a page element described in natural language. Our training pipeline consists of two stages: (1)~teaching the model to determine whether the cursor already hovers over the target element or whether movement is required, and (2)~training it to execute a single command (a mouse move or a mouse click) at a time, verifying the resulting state of the environment before planning the next action. Evaluated on a custom benchmark of single-click web tasks, our approach increases success rates from 86% to 94% under the most challenging setting.
Short-video recommendation presents unique challenges, such as modeling rapid user interest shifts from implicit feedback, but progress is constrained by a lack of large-scale open datasets that reflect real-world platform dynamics. To bridge this gap, we introduce the VK Large Short-Video Dataset (VK-LSVD), the largest publicly available industrial dataset of its kind. VK-LSVD offers an unprecedented scale of over 40 billion interactions from 10 million users and almost 20 million videos over six months, alongside rich features including content embeddings, diverse feedback signals, and contextual metadata. Our analysis supports the dataset's quality and diversity. The dataset's immediate impact is confirmed by its central role in the live VK RecSys Challenge 2025. VK-LSVD provides a vital, open dataset to use in building realistic benchmarks to accelerate research in sequential recommendation, cold-start scenarios, and next-generation recommender systems.
Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over νm10000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paper-level and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.
In large-scale e-commerce platforms, Conversion Rate (CVR) prediction is crucial for recommender system, yet existing approach face a fundamental granularity mismatch: models operate at the item level while users purchase at the fine-grained Stock Keeping Unit (SKU) level. This mismatch causes loss of fine-grained user intent signals. Moreover, it also introduces price inconsistency bias due to the gap between static exposure prices and actual transaction prices. While direct SKU-level modeling would resolve these issues, it is impractical for industrial deployment due to the extreme data sparsity and prohibitive inference costs.
To address these challenges, we propose PSKU4Rec, a novel framework that operates at the Price SKU (PSKU) granularity by aggregating SKUs by price to retain critical price signals while reducing sparsity by 6 times. PSKU4Rec consists of: (1) a PSKU-aware prediction network that models intra-item PSKU contextual information and captures user PSKU preferences; and (2) a PSKU-aware application module that generates personalized estimated transaction prices for pCVR refinement and enables personalized main-image display. Offline experiments based on the dataset collected from Taobao App show substantial improvements in CVR prediction accuracy and price consistency. Online A/B testing further validates the effectiveness of our approach.
Trusted multi-view classification (TMC) aims to improve prediction reliability by integrating evidence from multiple views. Existing TMC methods extract evidence from single view and use a regularization term to shape the evidence distribution. However, existing methods typically enforce a uniform regularization objective across all views, overlooking critical view-specific biases: intra-view class ambiguity caused by confusable features and inter-view quality disparities reflected in evidence uncertainty. To address these issues, we propose an adaptive regularization strategy that enhances robustness on two levels. At the intra-view level, it quantifies feature ambiguity to apply targeted relaxation to confusable classes, preventing over-penalization of inherent uncertainty. At the inter-view level, it evaluates relative view quality to impose stronger constraints on unreliable views and suppress noise from low-quality ones. Extensive experiments across multiple benchmarks demonstrate the superiority and reliability of the proposed method.
Tabular data drive most real-world machine learning applications, yet building general-purpose models for them remains difficult. Mixed numeric and categorical fields, weak feature structure, and limited labeled data make scaling and generalization challenging. To this end, we introduce Orion-Bix, a tabular foundation model that combines biaxial attention with meta-learned in-context reasoning for few-shot tabular learning. Its encoder alternates standard, grouped, hierarchical, and relational attention, fusing their outputs through multi-CLS summarization to capture both local and global dependencies efficiently. A label-aware in-context learning (ICL) head adapts on the fly and scales to large label spaces via hierarchical decision routing. Delivered as a scikit-learn–compatible foundation model, it outperforms gradient-boosting baselines and remains competitive with state-of-the-art tabular foundation models on public benchmarks, showing that biaxial attention with episodic meta-training enables robust, few-shot-ready tabular learning.
This study presents the first large-scale comparison of persuasion techniques present in crowd- versus professionally-written debunks. Using extensive datasets from Community Notes (CNs), EUvsDisinfo, and the Database of Known Fakes (DBKF), we quantify the prevalence and types of persuasion techniques across these fact-checking ecosystems. Contrary to prior hypothesis that community-produced debunks rely more heavily on subjective or persuasive wording, we find no evidence that CNs contain a higher average number of persuasion techniques than professional fact-checks. We additionally identify systematic rhetorical differences between CNs and professional debunking efforts, reflecting differences in institutional norms and topical coverage. Finally, we examine how the crowd evaluates persuasive language in CNs and show that, although notes with more persuasive elements receive slightly higher overall helpfulness ratings, crowd raters are effective at penalising the use of particular problematic rhetorical means.
User interactions on e-commerce platforms are inherently diverse, involving behaviors such as clicking, favoriting, adding to cart, and purchasing. The transitions between these behaviors offer valuable insights into user-item interactions, serving as a key signal for understanding evolving preferences. Consequently, there is growing interest in leveraging multi-behavior data to better capture user intent. Recent studies have explored sequential modeling of multi-behavior data, many relying on transformer-based architectures with polynomial time complexity. While effective, these approaches often incur high computational costs, limiting their applicability in large-scale industrial systems with long user sequences. To address this challenge, we propose the Transition-Aware Graph Attention Network (TGA), a linear-complexity approach for modeling multi-behavior transitions. Unlike traditional transformers that treat all behavior pairs equally, TGA constructs a structured sparse graph by identifying informative transitions from three perspectives: (a) item-level transitions, (b) category-level transitions, and (c) neighbor-level transitions. Built upon the structured graph, TGA employs a transition-aware graph Attention mechanism that jointly models user-item interactions and behavior transition types, enabling more accurate capture of sequential patterns while maintaining computational efficiency. Experiments show that TGA outperforms all state-of-the-art models while significantly reducing computational cost. Notably, TGA has been deployed in a large-scale industrial production environment, where it leads to impressive improvements in key business metrics.