WWW Companion '26: Companion Proceedings of the ACM Web Conference 2026

Full Citation in the ACM Digital Library

SESSION: History of the Web

Welcome from the History of the Web Track Chairs

The History of the Web track highlights how key historical decisions, institutional incentives, and socio-technical shifts have shaped the Web's evolution, and how those choices continue to influence the systems we build and rely on today. This year's program traces a connected trajectory from early information retrieval and ranking, through the rise of user experience, engagement, and platform dynamics, to the Semantic Web's vision of machine-understandable data and knowledge graphs. It then brings that historical arc into direct conversation with the current wave of large language models and agentic AI, showing how contemporary AI capabilities are deeply entangled with the Web as both infrastructure and archive: a source of training data, a medium for evaluation, and a space where social impact is felt. Across talks and the panel, the program emphasizes recurring ideas and persistent tensions, such as relevance versus reliability, openness versus governance, scale versus accountability, and shows how historical perspective can sharpen our understanding of what is genuinely new, what is cyclical, and what remains unresolved. Ultimately, the track positions history as an analytical lens for guiding more responsible, informed, and durable choices in future Web research and innovation.

Evolving Notions of Success on the Web: From Retrieval to Experience

The early Web was centered on documents, hyperlinks, and retrieval, with success largely defined in terms of efficient access to relevant information. As the Web expanded in scale, content, and modes of interaction, its focus gradually shifted toward users and their experiences, bringing attention, engagement, and satisfaction to the forefront. This shift marked a fundamental transformation in how Web systems are designed, evaluated, and reasoned about.

This talk offers a historical perspective on how the Web's goals and success metrics evolved in response to content abundance, changing patterns of use, and rising user expectations. It traces how classical notions of relevance, rooted in information retrieval, broadened over time to encompass user intent, context, satisfaction, and trust. These evolving priorities influenced not only how information is ranked and surfaced, but also how online experiences are structured and sustained over time.

The talk connects these conceptual shifts to the emergence of experience-oriented Web platforms, where success is increasingly measured through longer-term signals of value rather than isolated interactions. It also reflects on how recent advances in AI, particularly large-scale machine learning and generative models, are reinforcing and accelerating these trends by enabling more interactive, adaptive, and intent-aware experiences. The talk concludes by discussing what this historical trajectory reveals about today's attention-driven Web, and how understanding the evolution of success metrics can inform the design and governance of future online systems.

From Lexical and Semantic to Predictive Search: Evolution or Involution?

We cover the evolution of web search from lexical and semantic search to the generative AI-powered chatbots predicting knowledge that are common today. For that we need to deep dive into the technical mechanics of vector models; content, link and usage features; entities; embeddings; large language models; and neural information retrieval, including dense retrieval and retrieval augmented generation (RAG).

To understand the impact of these new tools we need to look at bigger philosophical AI questions: What are we losing when we let machines ''guess'' for us? Are we widening the digital divide? Are we getting a little too lazy? And we will lose cognitive skills? Should we respect content authorship?

We end with predictive search, that is, predicting answers using large language models. Should we use these models as search engines? Should we trust their answers? And should we use them as relevancy judges?

On the Historic Roots of Agentic AI in Semantic Web Services

In our short talk, we point out that (1) the Semantic Web vision already provided some of the underlying ideas of today's endeavours around agentic AI, and (2) that abstractions, approaches and methods developed in the Web Services and Semantic Web Services communities can be beneficially applied to agentic AI. At the same time, we argue that looking back to history reveals certain still unresolved challenges.

SESSION: Tutorials

SESSION: Demonstrations

STIndex: A Context-Aware Multi-Dimensional Spatiotemporal Information Extraction System

Extracting structured knowledge from unstructured data still faces practical limitations: entity and event extraction pipelines remain brittle, knowledge graph construction requires costly ontology engineering, and cross-domain generalization is rarely production-ready. In contrast, space and time provide universal contextual anchors that naturally align heterogeneous information and benefit downstream tasks such as retrieval and reasoning. We introduce STIndex, an end-to-end system that structures unstructured content into a multidimensional spatiotemporal data warehouse. Users define domain-specific analysis dimensions with configurable hierarchies, while large language models perform context-aware extraction and grounding. STIndex integrates document-level memory, geocoding correction, and quality validation, and offers an interactive analytics dashboard for visualization, clustering, burst detection, and entity network analysis. In evaluation on a public health benchmark, STIndex improves spatiotemporal entity extraction F1 by 4.37% (GPT-4o-mini) and 3.60% (Qwen3-8B). A live demonstration and open-source code are available at https://stindex.ai4wa.com/dashboard.

SJBP: A Platform to Launch a Novel Jailbreak Attack on Large Language Models Based on Content's Spatial Distribution

Attackers craft malicious prompts to induce large language models (LLMs) to generate content that violates legal or ethical norms, posing serious security risks as LLMs are increasingly integrated into web applications. Prior jailbreak attacks mainly rely on obfuscating prompt intent (e.g., role-playing), but recent safety alignment has made LLMs more robust to such strategies. In this paper, we introduce SpatialJB, a novel spatial jailbreak targeting LLMs embedded in web applications. By distributing harmful content across space, SpatialJB bypasses existing safety defenses. Extensive evaluations across commercial LLMs show that SpatialJB achieves 95% attack success rate on GPT-4, representing a 7 times higher than widely used black-box jailbreak techniques like Base64. Furthermore, Current guardrails catch under 12% of spatial attacks, underscoring SpatialJB's stealthiness. To address this emerging threat, we develop a post-defense mechanism, SpatialD, which analyzes spatial characteristics in model outputs to identify harmful content. Our demonstration further guides users in launching SpatialJB on multiple commercial LLMs (e.g., Qwen, GPT) within realistic web-based settings, illustrating its behavior and broad applicability. Our Demo video can be accessed at: https://1drv.ms/v/s!ApaP6YaJA87pgk-Rk7ZCvPeYcIZa?e=J8kbg0.

SESSION: 1st Workshop on Applied AI and Multimodal Visualization Technologies (AAIMVT)

1st Workshop on Applied AI and Multimodal Visualization Technologies (AAIMVT 2026)

The 1st Workshop on Applied AI and Multimodal Visualization Technologies (AAIMVT 2026) was held in conjunction with The ACM Web Conference 2026 in Dubai, UAE. This workshop addresses the growing need for integrating applied artificial intelligence (AI) with multimodal visualization technologies in web-based environments. As AI systems increasingly operate over heterogeneous inputs, including text, images, audio, video, and sensor data, visualization and interactive web interfaces become essential for explainability, interpretability, trust, and actionable decision support. This workshop combined peer-reviewed paper presentations with structured brainstorming sessions to accelerate interdisciplinary exchange and to identify concrete, web-driven research directions for multimodal AI and visualization. The accepted papers illustrate both methodological advances and applied impact across different domains. Contributions include (i) a privacy-preserving voice-cloning and speech-synthesis platform with a unified web interface, (ii) a homonym-aware ''living'' cybersecurity ontology platform designed to bridge governance and operational semantics via web-based knowledge management and visualization, (iii) a robust harmful-meme detection under missing modalities using shared representation learning, and (iv) a vision-language region understanding to reconstruct static infographics into editable Google slides, (v) graph-neural approaches for spatial–temporal inference in indoor 5G coverage mapping from sparse UAV sensing, and (vi) a large-scale web-accessible multilingual dataset for Vietnamese medicinal biodiversity. Together, these works highlight AAIMVT's focus on web-based AI systems and the multimodal visualization techniques required to make them reliable, interpretable, and useful in real-world settings.

VOICY: A Privacy-Centric Modular Architecture for Zero-Shot Voice Cloning and Fine-Tuned Speech Synthesis

Recent advancements in neural text-to-speech (TTS) have enabled high-fidelity voice cloning, allowing for the synthesis of realistic speech from minimal reference audio. However, the majority of state-of-the-art solutions rely on cloud-based inference via proprietary APIs, raising significant data privacy concerns regarding biometric data retention and misuse. This paper presents ''Voicy'' a locally executed, modular TTS platform capable of both Zero-Shot Voice Cloning and Domain-Adapted Fine-Tuning. Utilizing the Coqui XTTS-v2 architecture, the system allows users to synthesize speech from short reference clips (5–10 seconds) or train robust models on custom datasets (10–30 minutes) using Low-Rank Adaptation (LoRA). We propose a comprehensive architecture that integrates secure authentication, an automated dataset preprocessing pipeline leveraging OpenAI Whisper, and an embedded analytics engine for audio similarity evaluation. Experimental results on consumer-grade hardware (Alienware 17 R4) demonstrate a cosine similarity score exceeding 0.998 for fine-tuned models, validating the efficacy of local inference for professional-grade, privacy-preserving speech synthesis. Furthermore, we demonstrate the system's scalability via consistent inference latency and usability through a unified web interface, benchmarking favourably against existing local solutions in both efficiency and perceptual quality.

Bridging the Semantic Gap: A Homonym-Aware Ontology Platform for Cybersecurity Interoperability

The cybersecurity domain exhibits a substantial semantic gap between the operational reality of threat detection and the available static compliance standards (e.g., ISO 27001, NIST SP 800-53). In addition, a disconnect is evident between high-level business goals to protect assets (e.g., Security Services such as Confidentiality) and low-level technical root causes (e.g., Vulnerabilities such as Buffer Overflow). Our previous Systematic Literature Review (SLR) has identified numerous existing ontologies. A persistent structural deficiency is identified: the lack of a dynamic, interoperable platform that supports safe, continuous mapping among Security Services, Security Mechanisms, Attacks, Exploited Vulnerabilities, Attackers, and Assets, with continuous community evaluation. Existing solutions limit practitioners to static, read-only standards for achieving security services, resulting in a lack of agility and a semantic void between the how and the why. This paper proposes a wiki-style, agile, and homonym-aware editing prototype system, which can lead to a novel hybrid cybersecurity ontology with a management and visualisation platform to bridge this semantic gap. We implement the Golden Chain of knowledge identified in prior SLRs (Security Services ? Security Mechanisms ? Attacks ? Vulnerabilities ? Attackers ? Assets) through three core technical contributions: (a) A Concept-Term Separated Architecture that the potential to resolve synonymy and homonymy; (b) An Additive Seeder Engine that enables systematic, preserving distributed updates; and (c) A Hybrid Contribution Model we bridge CSV ? JSON batch processing with granular Web-based expert review. We evaluate this prototype by unifying a number of concepts from different sources (NIST, MITRE, ISO, CyBOK) into a single, cohesive, graph-ready knowledge base. Our results demonstrate that this ''living'' ontology approach can successfully interpret homonyms (e.g., ''Virus'' as malware vs. pathogen) and facilitate cross-domain knowledge retrieval.

Spatial-Temporal Inference of Indoor RSSI and CQI for 3D Coverage Mapping using Dynamic Graph Attention Networks

The reliability of indoor 5G connectivity remains a critical bottleneck in modern urban infrastructure, predominantly due to the complex attenuation caused by architectural elements in high-rise environments. While Unmanned Aerial Vehicles (UAVs) offer a versatile platform for external Radio Frequency (RF) sensing, deriving an accurate internal signal landscape from sparse exterior measurements presents a significant inversion problem. To address this, we introduce a novel framework leveraging Dynamic Graph Attention Network (DGAT) to reconstruct indoor 5G signal quality. Deviating from traditional Euclidean-based connectivity models, our approach infers a latent connectivity structure governed by feature affinity, thereby capturing the non-linear propagation characteristics inherent to complex buildings. This learned topology supports a deep residual propagation encoder, which feeds into a bifurcated decoding mechanism specifically optimized for the divergent physical behaviors of Received Signal Strength Indicator (RSSI) and Channel Quality Indicator (CQI). Furthermore, prediction variance is mitigated through a meta-ensemble fusion layer. Experimental validation on a multi-level 5G dataset confirms that the proposed architecture significantly outperforms existing location-aware baselines, achieving a reduction in Root Mean Square Error (RMSE) of 76% for RSSI and 89% for CQI, with R2 fidelity scores surpassing 0.99. By effectively bridging the gap between non-invasive aerial sensing and interior coverage estimation, this framework provides a robust foundation for next-generation urban wireless planning.

Viet Medi Species 2026: A Web-Accessible Multilingual Dataset for Vietnamese Medicinal Biodiversity

While online biodiversity platforms have made species data more accessible, they may not be particularly useful for those who do not speak English or for those specific to a particular region. In Vietnam, there is a wealth of knowledge about medicinal plants, but that knowledge is scattered, there is little consistency in the way plants are named, and it is poorly connected to the larger global systems for biodiversity. In this paper, we introduce Viet Medi Species 2026, a massive multilingual image dataset tailored for medicinal plants in Vietnam. Using a large-scale extraction method, the scientific names of the plants listed in the Vietnamese Medicinal Plant Catalogue were matched with the GBIF (The Global Biodiversity Information Facility) taxonomic backbone and the Vietnamese vernacular names of the species were gathered. We validated the images that were collected and compiled 310647 images of 4799 species from multiple kingdoms using a completely reproducible asynchronous pipeline. The dataset captures valuable metadata that is tied to Vietnamese culture and still connects to GBIF ID numbers. We also provide an online Explorer that allows access to and searching for images in Vietnamese. We hope this work provides a reusable template for integrating these undocumented languages into global datasets of biodiversity, and thus facilitates further research into identifying medicinal plants, biodiversity informatics and AI-assisted conservation.

Robust Harmful Meme Detection under Missing Modalities via Shared Representation Learning

Internet memes are powerful tools for communication, capable of spreading political, psychological, and sociocultural ideas. However, they can be harmful and can be used to disseminate hate toward targeted individuals or groups. Although previous studies have focused on designing new detection methods, these often rely on modal-complete data, such as text and images. In real-world settings, however, modalities like text may be missing due to issues like poor OCR quality, making existing methods sensitive to missing information and leading to performance deterioration. To address this gap, in this paper, we present the first-of-its-kind work to comprehensively investigate the behavior of harmful meme detection methods in the presence of modal-incomplete data. Specifically, we propose a new baseline method that learns a shared representation for multiple modalities by projecting them independently. These shared representations can then be leveraged when data is modal-incomplete. Experimental results on two benchmark datasets demonstrate that our method outperforms existing approaches when text is missing. Moreover, these results suggest that our method allows for better integration of visual features, reducing dependence on text and improving robustness in scenarios where textual information is missing. Our work represents a significant step forward in enabling the real-world application of harmful meme detection, particularly in situations where a modality is absent.

SESSION: The Workshop on Adverse Impacts and Collateral Effects of Artificial Intelligence Technologies (AiOfAi)

SESSION: 4th Workshop on Augmented Intelligence in Technology-Assisted Review Systems (ALTARS 2026)

Train Large-Use Small: Optimizing within the Theoretical Limits of Embedding-based Retrieval

Technology-Assisted Review (TAR) systems are becoming indispensable in specialized domains that demand extensive document retrieval with high precision. The use of Large Language Models (LLMs) has significantly expanded the potential of TAR systems, for handling huge document corpora to extract relevant candidates for such systems. Semantic similarity or relevance using high-dimensional dense vector representations of input text from large models forms a core component in Information Retrieval tasks. Recent studies on the theoretical limitations of such single-vector embedding based retrieval approaches have showcased significantly poor performance of state-of-the-art models even on small retrieval-oriented datasets with relatively simple queries.

In this work, we explore whether the theoretical limitation is solely responsible for the performance degradation of large models, or is it also affected by the curse-of-dimensionality. To this end, we showcase that simple dimensionality reduction techniques can significantly boost the retrieval performance of models by around 7%. Even within the theoretical constraints, we show that meta-embedding technique can further improve performance by 2% - thereby reducing the gap between the performances of single-vector models compared to multi-vector ones. In line with existing literature, we observe that larger models tend to fare much better than smaller models, and hence conclude that we should definitely ''train large'' (models) but ''use small'' (dimensionality).

Towards Automating Articles Screening Processes Using Chain-of-Thought Large Language Models

The rapid expansion of the biomedical literature has made manual screening for meta-analyses increasingly hard. While automation attempts exist, they often rely on active-learning powered binary classification, which is still time intensive; or simple summarization, frequently struggling with factual consistency and completeness of the information reported. This work presents a novel pipeline using Reasoning-Large Language Models (Chain-of-Thought) to automate the summarization of informative elements from full-text biomedical documents, focusing on PIO elements (Patient/Population, Intervention, and Outcome) in a human-in-the-loop scenario. PDF documents are segmented into structured XML semantic chunks. Then, the pipeline employs a Large Language Model, where several prompts are adaptively used based on the semantic complexity of the chunks, measured using Effective Rank. Consensus-based filtering mechanisms are then used to synthesize multiple candidate responses into a single, high-fidelity summary.

We evaluated the pipeline on the EBM-NLP dataset using multiple LLMs. Results demonstrate that an 8B parameter model, when integrated into our pipeline, achieves (BERTScores) semantic precision exceeding 0.88 and (BERTScores) semantic recall exceeding 0.91, rivaling much larger models. A subsequent human evaluation by expert researchers confirmed the high factuality and completeness of the extracted information. These findings suggest that reasoning-enhanced pipelines are a promising tool to significantly reduce screening time while maintaining the semantic consistency required for rigorous evidence synthesis.

SESSION: 6th International Workshop on Computational Methods for Online Discourse Analysis (BeyondFacts'26)

6th International Workshop on Computational Methods for Online Discourse Analysis (BeyondFacts'26)

This workshop focuses on the convergence of computational and interdisciplinary methods for analyzing online discourse, such as claims, arguments, and opinions on contentious issues. As mis- and disinformation, bias, and echo chambers continue to grow, NLP-based techniques—including argument mining, stance detection, and fact verification—have become increasingly important. At the same time, these tasks depend on strong conceptual foundations that span communication studies, computational linguistics, and computer science. BeyondFacts aims to bring together researchers from a wide range of communities—such as the social sciences, political science, computational journalism, and computer science—to advance the machine-based interpretation and analysis of societal debates using methods from web mining, AI, and NLP.

EmeraldApp: An AI-Driven Tool for Detecting Greenwashing in Sustainability Claims

Greenwashing, i.e., presenting misleading environmental claims, undermines the trustworthiness of sustainability reporting. We present EmeraldApp, a tool designed to verify corporate sustainability claims. EmeraldApp classifies claims as greenwashing, not greenwashing, or abstains, and generates fact-based justifications. The tool supports different LLM pipelines: EM-RAG, which retrieves evidence from EmeraldDB, a vector database of ESG (Environmental, Social, Governance) report chunks; EM-KGRAG, which retrieves evidence from EmeraldGraph, a domain-specific knowledge graph; EM-HYBRID, which combines the other two retrieval pipelines; and EM-NR, without retrieval capabilities. EmeraldApp supports quick claim checks for end users, while advanced configuration enables pipeline comparison and deeper inspection of factual evidence and justification. Previously checked claims are stored in ClaimsDB and are accessible through the EmeraldApp to improve auditability and transparency.

Towards Understanding Professional AI Assistant Use: Human-in-the-Loop Topic Modeling of Multi-turn Conversations

Understanding what people use AI assistants for in the workplace is essential for assessing their impact and informing future design. We present a novel framework HILTON: Human-In-the-Loop Topic mOdeling of multi-turn coNversations that combines the summarization capabilities of large language models (LLMs) with unsupervised, non-parametric clustering. The framework first summarizes multi-turn conversations between users and AI assistants, then clusters these summaries into coherent topics, which are further organized hierarchically, and dynamically updated as new data arrives. To ensure quality and practical relevance, we employ a human-in-the-loop process: annotators post-process discovered clusters and provide feedback, which is then used to refine topics and create evaluation data for future iterations. We further train a lightweight statistical classifier that enables efficient topic inference for unseen conversations, making the pipeline suitable for real-world deployment. Our results show that our approach produces coherent topic structures suitable for workplace analytics of AI assistants. We open-source our code for the application.

Proactive Misinformation Forecasting via Incremental Topic Detection and Sentiment Polarization

The unprecedented scale and velocity of unverified content on online social media platforms (OSMPs) threatens public discourse, emergency response, and democratic processes. This paper operationalizes a socio-cognitive hypothesis, that polarization and confirmation bias create ''fertile ground'' for misinformation, into an engineering framework for proactive topic risk forecasting. We introduce a semi-supervised pipeline that continuously tracks topics in a tweet stream and measures topic-specific sentiment divergence as an early warning signal. Our core methodology integrates: (i) Incremental Clustering with Burst Detection (ICBD), an online clustering algorithm that adapts to topic drift via exponentially weighted centroid updates and identifies volatility through a statistical burst detector; and (ii) a Topic-specific Sentiment Classifier (TSC) that conditions sentiment inference on topic context, reducing ambiguity in polarizing language. The resulting polarization metrics is used as a predictive feature in ensemble veracity models. On two large-scale COVID-19 Twitter collections, adding polarization-aware features yields consistent gains across baselines and reaches an F1-score of 0.879 with XGBoost. Beyond accuracy, we demonstrate a one-month lead-time forecasting protocol where polarization features improve F1 by 11.9%, validating their suitability for preemptive moderation and public-information interventions.

SESSION: Joint Workshop on Diffusion of Harmful Content on Online Web and ... (DHOW-MiLLA)

DHOW-MiLLA: Joint Workshop on Diffusion of Harmful Content on Online Web and Countering Misinformation in the Age of LLMs and Agents

With the advancement of digital technologies and gadgets, online content has become easily accessible. At the same time, harmful content also spread widely. There are different harmful content types present on various platforms in multiple languages. The topic of harmful content is broad and covers multiple research directions. Users of platforms are affected by all of them. In research, the different forms are mostly analysed separately, e.g. misinformation, cyber-bullying and hate speech. Most research has been conducted for only one platform, for a monolingual situation or on a particular issue. Counter-measures like blocking are down-ranking can make harmful content spreaders to switch platforms and languages to continuously reach a user base. Harmful content does not only appear on social media but also on news media. Spreader share harmful content in posts, news articles, comments and hyperlinks. There is a great need to study harmful content across platforms, languages, and topics. We brought the research on harmful content under one umbrella such that different approaches and novel methods can be shared. The workshop covers the currently ongoing issues of war and elections. We propose the workshop, DHOW: Diffusion of Harmful Content on Online Web, which brings together the research on different topics of harmful content. We expect to discuss innovative research work and future research directions.

Deception Factories: Industrial-Scale Cloaking in Harmful Synthetic Media

Generative AI has drastically lowered the cost of producing and deploying deceptive online content. Scam and disinformation operations increasingly rely on synthetic and heavily manipulated media, yet most existing work still focuses on individual artifacts or detection methods rather than how real campaigns are structured and scaled. In this paper, we analyze a large collection of real-world fraudulent and disinformation campaigns built around AI-generated and manipulated media. We observe a wide range of campaign sizes, from small operations to campaigns reaching industrial scale. We find that scaling is not achieved by creating new semantic content, but primarily by mass-producing superficial variants of a small number of templates. We further show that large campaigns systematically rely on cloaking: a relatively stable core fraudulent payload is embedded in a much larger volume of auxiliary and decoy content, which dominates the overall campaign structure and becomes the main driver of volume as campaigns grow. Overall, our results highlight a shift from isolated deepfake artifacts to industrialized deception operations, where scale is achieved through automation, reuse, and structural manipulation of content. This perspective motivates defenses that target campaign-level behavior rather than individual files.

RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking

The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA is already publicly available at \href https://github.com/kalendsyang/RAMA

ToxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms

The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes substantially improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80% accuracy under channel-specific training (with 13% improvement over BERT and F1-score of 76%). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.

Context-Aware Silhouette-Based Age Estimation for Child Sexual Abuse Material (CSAM) Detection

Experts in the Association of Internet Hotline Providers (INHOPE) work to quickly identify and remove Child Sexual Abuse Material (CSAM) from the Internet. A key factor in report classification is determining the age group of individuals in the material. We introduce a Multitask Paired Learning approach for silhouette-based age classification in CSAM detection. Unlike existing methods, our approach does not require face visibility and leverages both individual and relative human features within a photo. We tested seven architecture variants using 10-fold cross-validation on real reported materials. Results show that multitask learning reduces the error rate by 3%-9% for minor-adult classification. Further improvements are reported by the application of Pair Learning, achieving a total reduction of 14%-21%. Our method is statistically significant and improves both binary and 3-class age classification, providing a valuable tool to support expert assessments.

On VLMs for Diverse Tasks in Multimodal Meme Classification

The use of multimodal memes to spread hatred, propaganda, and violence across social and digital media necessitates effective content moderation, which can be addressed through AI-based meme analysis. In this paper, we present a comprehensive and systematic analysis of Vision-Language Models (VLMs) for disparate meme classification tasks, and introduce a novel approach: Combining VLM Explanation to Fine-tune LLMs (CoVExFiL). In the proposed CoVExFiL, we generated a VLM-based understanding of the meme images and used this information to fine-tune Large Language Models (LLMs) based on the embedded meme text. Our contributions are threefold: (1) Benchmarking VLMs using diverse prompting strategies for these sub-tasks; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing the novel CoVExFiL approach, where detailed meme interpretations generated by VLMs are utilized to train smaller language models (LLMs), thereby significantly improving classification. Following extensive experimentation, we observed that CoVExFiL improved the baseline performance by 8.34%, 3.52%, and 26.24% for sarcasm, offensive content, and sentiment classification, respectively. These findings shed light on the capabilities and shortcomings of VLMs, while also establishing CoVExFiL as a promising strategy for advancing meme understanding. The code is available at https://github.com/gavit21/Memes-Understanding-with-VLMs https://github.com/gavit21/Memes-Understanding-with-VLMs. CAUTION: This paper may contain harmful content.

An Interpretable Agentic Framework for Multimodal Hate Video Analysis with Explicit Evidence Attribution

The proliferation of harmful and toxic content in social media videos poses significant risks to users and online communities. While many state-of-the-art approaches rely on supervised classifiers trained on large benchmark datasets, such models are often evaluated primarily on predictive performance and provide limited transparency for deployment scenarios where explanation and auditability are required. We present an evidence-centric multimodal video analysis system that treats visual, audio, textual, and contextual signals as independent sources of evidence rather than fused features. The system extracts object detections and on-screen text from video frames, speech transcripts from audio, and toxicity scores and named entities from text, while contextual information is obtained from a fixed offline retrieval corpus. All signals are stored in a shared evidence structure and prioritized using a deterministic ranking mechanism that emphasizes semantic contribution, explicitly models cross-modal agreement and conflict, and represents modality absence by marking non-contributory signals as null. Final decisions are produced through a robust orchestration layer that supports a fully deterministic baseline and an optional constrained large language model-based reasoner with fallback guarantees. We evaluate the system on HateMM, ImpliHateVid, and a manually annotated YTHate dataset of 200 real-world YouTube videos. Across all datasets, the system exhibits consistently high recall for hateful content, reflecting a deliberate design choice to reduce false negatives in moderation support settings. Qualitative analysis indicates improved sensitivity to implicit and context-dependent hate, including cases involving visual symbols, late-occurring speech, and cross-modal inconsistencies.

From Generation to Detection: Leveraging Empirically Derived Linguistic Hints for LLM-Based Fake News Detection

The rapid advancement and widespread use of Large Language Models (LLMs) have raised concerns about their potential to generate persuasive and deceptive content at scale. As LLM-generated text becomes increasingly indistinguishable from human writing, identifying linguistic features in LLM-generated fake news is critical for both the detection and mitigation of fake news. This research examines linguistic differences between real news and LLM-generated fake news headlines across four dimensions: toxicity, sentiment, moral framing, and lexical similarity. Using prompt engineering, we created five datasets of AI-generated fake news headlines (approximately 22,000 each) and compared them with a dataset of real news headlines of the same size. Our analysis shows that LLM-generated fake news are more toxic, negative, and subjective, and rely more heavily on authority-based language. To verify the effectiveness of these linguistic features, we conducted classification experiments using two state-of-the-art LLMs (GPT-4o-mini and GPT-5-mini) under zero-shot, few-shot, and linguistically guided conditions. Our results show that while zero-shot performance is modest (GPT-4o-mini mean F1 = 67.89%, GPT-5-mini mean F1 = 58.47%), both few-shot and linguistic hint approaches achieve consistent and robust improvements over zero-shot, with mean F1 scores in a similar range (approximately 70%), indicating that linguistic markers can be systematically leveraged to improve automated fake news detection. Overall, these findings suggest that LLMs produce consistent linguistic features and that such features can be effectively exploited in scalable fake news detection strategies.

Exploring the Domain Sensitivity of AI Chatbots in Correcting Misinformation

Misinformation arises when individuals confidently maintain incorrect factual beliefs. The issue affects social systems and is highly challenging to rectify. AI-based chatbots have been increasingly proposed as scalable corrective tools. While prior research has primarily focused on enhancing chatbot effectiveness through design features such as anthropomorphism or expertise cues, less is known about how chatbot effectiveness varies across different misinformation domains of health, science, and politics. We conducted an experimental study in which participants were exposed to misinformation from three domains (health, science, and politics) and received corrective responses from a constant AI chatbot. A mixed design with a 2 (group: experimental vs. control) × 3 (domain: health, science, political) interaction was employed, involving N = 259 participants. Results indicated that the chatbot was more effective in reducing belief in health and science misinformation than in political misinformation, with significant domain-based differences in corrective impact. These findings highlight the domain-sensitive nature of AI-based misinformation interventions and underscore the importance of contextual factors when deploying automated correction tools.

SESSION: International Workshop on Foundations and Architectures for the Agentic Web (FAAW)

The 1st International Workshop on Foundations and Architectures for the Agentic Web

The Agentic Web is emerging as billions of AI agents discover, communicate, and coordinate across the open Web, shifting from isolated models to Web-integrated entities. This workshop explores the state of the art, open challenges, and emerging research directions in the foundations and architectures of the Agentic Web. It provides a forum for researchers and practitioners to examine interoperable architectures, protocols, and standards enabling AI agents to operate as first-class Web entities. Beyond technical interoperability, it also adresses economic and societal mechanisms including reputation, governance, accountability, and large-scale coordination.

Real-Time Procedural Learning From Experience for AI Agents

Learning how to do things from trial and error in real time is a hallmark of biological intelligence, yet most LLM-based agents lack mechanisms to acquire procedural knowledge after deployment. We propose Procedural Recall for Agents with eXperiences Indexed by State (PRAXIS), a lightweight post-training learning mechanism that stores the consequences of actions and retrieves them by jointly matching environmental and internal states of past episodes to the current state. PRAXIS augments agentic action selection with retrieved state-action-result exemplars that are generated in real time. When evaluated on the REAL web browsing benchmark, PRAXIS improves task completion accuracy, reliability, and cost efficiency across different foundation model backbones, and shows preliminary generalization to unseen tasks in similar environments. These results demonstrate that PRAXIS enables the practical adoption of AI agents in fast-evolving stateful environments by helping them learn new procedures effectively.

An Agentic System for Automated Polymer Property Prediction in Low-Code Environments

The polymer industry faces increasing complexity in material design and property characterization, often requiring advanced data handling skills that domain experts may lack. This work presents Poly-WorkFlowX, an integrated agentic system combining machine learning and low-code automation to bridge the gap between materials science and artificial intelligence. We propose an architecture where a conversational agent, powered by Large Language Models (LLMs), orchestrates data workflows based on natural language instructions. The system integrates a custom Multi-Task Learning model for polymer property prediction directly into the n8n low-code platform. This approach enables non-expert researchers to predict physico-chemical properties from molecular representations (PSMILES) and automatically generate reproducible data processing pipelines. We demonstrate the system's efficiency through a complexity analysis, highlighting the reduction in manual configuration steps required to deploy scientific workflows.

Can Large Language Models Govern Themselves? A Technical, Ethical, and Governance Perspective

The rapid deployment of Large Language Models (LLMs) in increasingly autonomous roles has raised fundamental questions about governance, accountability, and control. A recurring question is whether LLMs can govern themselves. This paper examines self-governance in LLMs from technical, ethical, and philosophical perspectives. We propose an operational definition of self-governance grounded in self-monitoring, self-regulation, and internal value alignment. We analyze architectural mechanisms that enable partial self-governance, such as internal feedback loops, embedded constraints, and meta-learning. We argue that while LLMs can implement forms of operational self-regulation, they cannot achieve full normative self-governance due to limitations in moral agency, value grounding, and accountability. We conclude that LLM self-governance should be understood as a complement to, rather than a replacement for, human-centered governance frameworks.

A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web

The evolution of Large Language Models (LLMs) and the software agents built on them (AI agents) marks a turning point in the transition from a human-centric Web to an ''Agentic Web'' driven by AI agents. However, for AI-Generated Content (AIGC), which is expected to dominate the Web, there is currently no mechanism for agents to verify its reliability, reproducibility, or license compliance during generation. This lack of transparency risks causing chained hallucinations and compliance violations through the reuse of AIGC. Consequently, a framework to manage the provenance and generation conditions of AIGC is essential. In this paper, we present a framework that automatically attaches structured metadata to AIGC at generation time, including modularized prompts, contexts, thoughts, model information, hyperparameters, and confidence. The metadata is enveloped together with verifiable credentials to support the reliable assessment and reuse of AIGC. This framework enables efficient curation of structured AIGC and facilitates its safe use for applications such as fine-tuning and knowledge distillation.

WAB: Overcoming Memory, Network, and Security Walls in Native Agentic Browsers with WebAssembly

In the era of the agentic web, autonomous web agents are increasingly expected to execute multi-step workflows within the browser, achieving interactive latency while keeping sensitive user context local. We call a browser built for this setting a native agentic browser, which treats agent execution as a first-class workload. In practice, native agent execution within browsers repeatedly hits three runtime walls. (1) A memory wall: agents need low-tail-latency access to evolving semantic context, but browser memory and persistent storage are constrained and heterogeneous. (2) A network wall: stepwise tool invocation turns long-horizon workflows into RTT-amplified pipelines whose end-to-end latency and variance compound with each step. (3) A security wall: untrusted content and prompt injection can steer agents toward high-impact side effects without enforceable least-privilege boundaries. To tackle them, we explore WebAssembly (Wasm) as a promising solution because Wasm offers a portable execution target, a sandboxed execution environment, and efficient performance for compute-intensive kernels. Specifically, we design and implement WAB as a proof of concept, which integrates three mechanisms. (i) Context: in-browser semantic retrieval over tiered storage, loading context within Wasm memory on demand; (ii) Actions: a Wasm-based execution module that batches multiple action steps per LLM call, reducing round-trip time between the browser and cloud-hosted LLMs; (iii) Authority: an allowlisted capability API for the Wasm sandbox, with explicit grants and revocation for external accesses. We conduct a case study of WAB on a representative WebQA workflow, highlighting the potential of a Wasm-centered runtime to mitigate these walls in a coherent design.

How to Measure Node Importance in the Agentic Web

The rapid progress of large language models is giving rise to an agentic web, a paradigm in which autonomous agents, rather than humans, become the primary actors of the internet. In such systems, identifying which agents play a central role is a fundamental problem, as importance metrics govern the spread of information, influence, and other network processes [1].

Classical centrality measures, such as PageRank and betweenness centrality, were developed for human-centric networks and may fail to capture the utility-driven dynamics of agentic systems. We propose a game-theoretic perspective in which agents seek to maximize informational utility while minimizing communication costs under progressive information degradation. Within this framework, we introduce Information Spreading Betweenness Centrality (ISBC), a directed and distance-aware measure that models cumulative information decay along shortest paths.

We validate ISBC through a controlled synthetic experiment simulating LLM-mediated information propagation on news data annotated with semantic highlights. The results show that ISBC identifies vertices that are more effective in terms of both information throughput and semantic preservation than those selected by PageRank or classical betweenness centrality. These findings highlight the need for centrality measures tailored to the dynamics of autonomous agent communication.

SESSION: International Workshop on Federated Foundation Models for the Web 2026 (FL@FM)

SESSION: Graph-enhanced LLMs for Trustworthy Web Data Management (GLOW)

The First Workshop on Graph-enhanced LLMs for trustwOrthy Web data management (GLOW'26)

The widespread adoption of Large Language Models (LLMs) is reshaping how Web data is accessed, interpreted, and managed at scale. By enabling natural language interaction and flexible reasoning over heterogeneous information sources, LLMs introduce new opportunities for Web data management, while also raising fundamental challenges related to reliability, factual accuracy, explainability, and accountability, especially when mediating access to structured and semi-structured Web data. Graph-based representations, such as knowledge graphs and property graphs, offer a complementary foundation by providing explicit structure, semantics, and relational context. Integrating graph-based models with LLMs enables more grounded reasoning processes and supports validation, transparency, and reliability in Web-scale systems, reflecting a broader shift toward hybrid neural–symbolic approaches. The Graph-enhanced LLMs for Trustworthy Web Data Management (GLOW) workshop focuses on the systematic investigation of graph-enhanced LLM approaches for Web data management, emphasizing architectural, methodological, and system-level perspectives. By addressing trustworthiness and explainability within Web-scale data infrastructures, GLOW contributes to the development of frameworks for accountable and reliable Web-based intelligent systems.

SMART: Spatio-Temporal Attention-based Large Language Model for Real-Time Traffic Prediction

Traffic prediction remains challenging due to the complex spatial and temporal dependencies inherent in road networks. Recently, large language models (LLMs) have shown strong potential for traffic forecasting; however, most existing methods primarily focus on temporal patterns and fail to capture non-sequential, proximity-driven spatial interactions. As a result, these models struggle to accurately represent dynamic traffic patterns. To address these limitations, this work introduces SMART, a spatio-temporal attention-based large language model for real-time traffic prediction. SMART integrates Partially Frozen Attention (PFA) to preserve pre-trained temporal reasoning while selectively adapting to evolving and dynamic traffic conditions. It further incorporates graph-based attention, which embeds adjacency information directly into the attention mask, enabling spatially connected nodes to interact regardless of sequence order and effectively addressing non-sequential spatial dependencies. To enhance robustness, DoRA-based fine-tuning is employed to enable efficient and stable adaptation of the attention layers. Experiments on real-world traffic datasets demonstrate that SMART effectively captures complex spatio-temporal patterns and consistently outperforms the state-of-the-art models.

Incorporating Graph-Constrained RAG Pipeline for Medicine data

Recent advancements in LLMs have transformed the research from a normal document-based search engine to a semantic-based contextual question-answering service. Understanding of relationships between the input query and related information is a crucial phase in this knowledge-extensive task. The aim of this paper is to engineer a knowledge graph-based LLM for scientific knowledge retrieval that adheres to fluent results with zero hallucinations. The proposed architecture incorporates the graph-constrained RAG pipeline, which is utilized as a natural language interface, whereas the knowledge graph is the mediator of facts and claims. We evaluated the proposed system on a curated dataset of samples, achieving retrieval precision of 71% with accurate types of KG nodes. We discovered the hybrid system outperforms an individual LLM for all evaluation metrics.

ConceptFormer: Towards Graph-Native Grounding of Large Language Models via Latent Concept Injection

Retrieval Augmented Generation (RAG) has enjoyed increased attention in the recent past, yet recent advancements in Large Language Models (LLMs) have highlighted the difficulty of ensuring trustworthiness and factuality. Current RAG methodologies often modify the internal architecture of pre-trained language models (PLMs) or rely on textifying knowledge graphs (KGs), which results in lossy linearization and context saturation. This paper introduces ConceptFormer, a neuro-symbolic approach to ground LLMs in structured knowledge from the Web of Data (e.g., Wikidata), without altering their internal structure or relying on textual input of KGs. ConceptFormer operates in the LLM embedding vector space, creating and injecting concept vectors that encapsulate the topological structure of the KG nodes directly. Trained in conjunction with a frozen LLM, ConceptFormer generates a comprehensive lookup table that maps KG nodes to their respective concept vectors. The approach aims to enhance the factual reliability of LLMs by enabling them to process these concept vectors natively, thus enriching them with structured world knowledge in a graph-native manner. Our experiments demonstrate that the addition of concept vectors to GPT-2 0.1B substantially increases its factual recall ability (Hit@10) by up to 272% when tested on sentences from Wikipedia and up to 348% on synthetically generated sentences. Even injecting only a single concept vector into the prompt increases factual recall ability (Hit@10) by up to 213% on Wikipedia sentences, significantly outperforming RAG with graph textification. This demonstrates that preserving topological structure in latent space is more effective for factuality than textual linearization, while coincidentally reducing token consumption by 130x.

Better Negatives, Better Predictions: Negative Sample Selection Strategies for Enhancing Biomedical KG Edge Classification

The selection of negative samples plays a crucial role in a wide range of machine learning algorithms and is particularly critical in edge classification tasks, where this choice has a direct impact on predictive performance. In this paper, we propose a set of strategies for generating negative edges in large, heterogeneous biomedical knowledge graphs, tailored to different link prediction scenarios. In these graphs, the absence of an observed edge does not necessarily indicate the absence of a relationship; instead, it may simply reflect missing or undiscovered knowledge. Leveraging latent-space graph embeddings, we analyze the impact of different negative sample selection strategies that account for both node types and edge semantics. Our initial experiments on two biomedical knowledge graphs demonstrate substantial improvements in classification performance, independent of the underlying predictive model, highlighting the robustness and effectiveness of the proposed approach. Results show that our strategies for generating negative edges in a knowledge graph outperform random negative sampling, yielding statistically significant improvements in balanced accuracy. Code and data for reproducing experiments are available at https://github.com/SLIMlaboratory/glow26 and https://zenodo.org/records/18074722.

(wt) GC-DPG: Graph-Constrained Dual-Phase Generation for Safe and Verifiable Chinese Medical Question Answering

Large language models (LLMs) have shown strong performance on open-domain question answering, but their tendency to hallucinate unverifiable or unsafe medical content limits their applicability in clinical decision support and patient education. In contrast, high-quality medical knowledge graphs (KGs) encode curated relations between diseases, symptoms, drugs, examinations, and diet. We propose GC-DPG (Graph-Constrained Dual-Phase Generation), a two-phase framework that tightly couples an LLM with a disease-centric KG for Chinese medical QA. In the first phase, GC-DPG performs graph-based answer planning : given a user question and a disease KG, it constructs a structured, disease-specific Answer Plan that enumerates which KG facts are permissible for each detected intent (e.g., symptoms, checks, drugs). In the second phase, GC-DPG performs graph-constrained generation and consistency checking : it converts the Answer Plan into a constrained prompt that explicitly restricts the LLM to the planned entities, then applies a bidirectional graph--answer consistency checker to compute a support ratio and detect plan-violating hallucinations. We instantiate GC-DPG on a large Chinese Disease KG and evaluate it on a disease-centric QA dataset. Empirical results show that GC-DPG substantially improves entity-level precision and reduces hallucinations compared to strong KG-RAG and LLM-only baselines, indicating that explicit graph-based planning and guardrails are crucial for making LLM-based medical QA both accurate and verifiable.

ReGA: Zero-Overhead Graph Alignment for Structural Hallucination Detection Without Generation

Large language models are increasingly deployed to query structured web data through retrieval-augmented generation over knowledge graphs. Yet even when retrieval succeeds, models routinely generate structurally inconsistent answers where correct entities are linked in wrong directions or correct roles are reversed. Classical graph isomorphism is brittle under paraphrase while neural judges based on text generation add latency, cost, and opacity. We introduce ReGA, a cascading framework that treats hallucination detection as semantic alignment plus structural verification rather than generation. For most cases, a hybrid pattern-based extractor converts text to graphs in sub-millisecond time, enabling real-time verification. Feature-based ReGA computes interpretable swap indicators in 14ms on CPU, achieving strong discrimination on functional hallucinations (AUC = 0.90) and role reversals (AUC = 0.80). On symmetric domains where simple features fail, Deep ReGA, a neural graph matcher trained via structural energy minimization, recovers directional discrimination (AUC = 0.85), reaching 0.98 in structured domains. On a controlled probe distinguishing ''Microsoft bought Activision'' from ''Activision bought Microsoft'', Deep ReGA assigns energy 0.00 to the correct direction versus 1.82 to the inverted one. We evaluate on 10,000 samples across five hallucination categories. While an LLM judge (Llama-3.1-70B) achieves near-perfect accuracy at 1.2 seconds per sample, the ReGA cascade achieves strong structural discrimination at 14-17ms, a 70-88x speedup suitable for real-time self-refinement during generation.

Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering

Retrieval-Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi-hop question answering (QA). For multi-hop QA tasks, current iterative approaches predominantly rely on LLMs to self-guide and plan multi-step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT-RAG), a novel hierarchical framework for complex multi-hop QA. RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus-based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom-up traversal strategy employs iterative query rewriting and refinement to collect high-quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT-RAG substantially outperforms state-of-the-art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT-RAG in complex multi-hop QA.

SESSION: The 2nd Workshop on Human-Centered Recommender Systems (HCRS)

The 2nd Workshop on Human-Centered Recommender Systems

Recommender systems shape how people discover information, form opinions, and connect with society. Yet, as their influence grows, traditional metrics, e.g., accuracy, clicks, and engagement, no longer capture what truly matters to humans. The workshop on Human-Centered Recommender Systems (HCRS) calls for a paradigm shift from optimizing engagement toward designing systems that truly understand, involve, and benefit people. It brings together researchers in recommender systems, human-computer interaction, AI safety, and social computing to explore how human values, e.g., trust, safety, fairness, transparency, and well-being, can be integrated into recommendation processes. Centered around three thematic axes-Human Understanding, Human Involvement, and Human Impact-HCRS features keynotes, panels, and papers covering topics from LLM-based interactive recommenders to societal welfare optimization. By fostering interdisciplinary collaboration, HCRS aims to shape the next decade of responsible and human-aligned recommendation research.

Post-Training Fairness Control: A Single-Train Framework for Dynamic Fairness in Recommendation

Despite growing efforts to mitigate unfairness in recommender systems, existing fairness-aware methods typically fix the fairness requirement at training time and provide limited post-training flexibility. However, in real-world scenarios, diverse stakeholders may demand differing fairness requirements over time, so retraining for different fairness requirements becomes prohibitive. To address this limitation, we propose Cofair, a single-train framework that enables post-training fairness control in recommendation. Specifically, Cofair introduces a shared representation layer with fairness-conditioned adapter modules to produce user embeddings specialized for varied fairness levels, along with a user-level regularization term that guarantees user-wise monotonic fairness improvements across these levels. We theoretically establish that the adversarial objective of Cofair upper bounds demographic parity and the regularization term enforces progressive fairness at user level. Comprehensive experiments on multiple datasets and backbone models demonstrate that our framework provides dynamic fairness at different levels, delivering comparable or better fairness-accuracy curves than state-of-the-art baselines, without the need to retrain for each new fairness requirement. Our code is publicly available at https://github.com/weixinchen98/Cofair.

RecoWorld: Building Simulated Environments for Agentic Recommender Systems

We present RecoWorld, a blueprint for building simulated environments tailored to agentic recommender systems. Such environments give agents a proper training space where they can learn from errors without impacting real users. RecoWorld distinguishes itself with a dual-view architecture: a simulated user and an agentic recommender engage in multi-turn interactions aimed at maximizing user retention. The user simulator reviews recommended items, updates its mindset, and when sensing potential user disengagement, generates reflective instructions. The agentic recommender adapts its recommendations by incorporating these user instructions and reasoning traces, creating a dynamic feedback loop that actively engages users. This process leverages the exceptional reasoning capabilities of modern LLMs. We explore diverse content representations within the simulator, including text-based, multimodal, and semantic ID modeling, and discuss how multi-turn RL enables the recommender to refine its strategies through iterative interactions. RecoWorld also supports multi-agent simulations, allowing creators to simulate the responses of targeted user populations. It marks an important first step toward recommender systems where users and agents collaboratively shape personalized information streams. We envision new interaction paradigms where ''user instructs, recommender responds,'' jointly optimizing user retention and engagement.

STCRank: Spatio-temporal Collaborative Ranking for Interactive Recommender System at Kuaishou E-shop

As a popular e-commerce platform, Kuaishou E-shop provides precise personalized product recommendations to tens of millions of users every day. To better respond real-time user feedback, we have deployed an interactive recommender system (IRS) alongside our core homepage recommender system. This IRS \footnotemark is triggered by user click on homepage, and generates a series of highly relevant recommendations based on the clicked item to meet focused browsing demands. Different from traditional e-commerce RecSys, the full-screen UI and immersive swiping down functionality present two distinct challenges for regular ranking system. First, there exists explicit interference (overlap or conflicts) between ranking objectives, i.e., conversion, view and swipe down. This is because there are intrinsic behavioral co-occurrences under the premise of immersive browsing and swiping down functionality. Second, the ranking system is prone to temporal greedy traps in sequential recommendation slot transitions, which is caused by full-screen UI design. To alleviate these challenges, we propose a novel Spatio-temporal collaborative ranking (STCRank) framework to achieve collaboration between multi-objectives within one slot (spatial) and between multiple sequential recommondation slots. In multi-objective collaboration (MOC) module, we push Pareto frontier by mitigating the objective overlaps and conflicts. In multi-slot collaboration (MSC) module, we achieve global optima on overall sequential slots by dual-stage look-ahead ranking mechanism. Extensive experiments demonstrate our proposed method brings about purchase and DAU co-growth. The proposed system has been already deployed at Kuaishou E-shop since 2025.6.

Fairness at a Glance: Can We Audit Model Fairness Before Training Completes?

Fairness auditing is crucial for ensuring the accountability of AI systems, but current auditing practices evaluate the fairness of models after full model training, forcing developers to complete numerous training runs before identifying those that meet fairness requirements. This post-hoc paradigm makes fairness assurance time-consuming and computationally costly. In this work, we explore whether model fairness can be estimated earlier—either before training begins or during its early stages. Our analysis across multiple datasets shows that it is difficult to predict the final fairness based solely on pre-training configurations. Fortunately, once training starts, fairness trajectories exhibit consistent and predictable patterns, with early-stage fairness strongly correlated with final fairness. These findings suggest that fairness auditing can shift from purely post-hoc evaluation to early-stage assessment. By detecting and stopping unfair models early, our approach can save 73.8% of the training cost in the fairness audit workflow.

Orthogonal Meta-learning Boosted Bayesian Optimization for Uncertain Multi-Objective Recommendation

Recommender systems (RSs) have become integral to our digital experiences, shaping how we access and engage with information in various domains. While early research focuses primarily on improving recommendation accuracy, this singular focus has led to unintended consequences such as echo chambers and limited user experiences. Drawing parallels with autonomous driving, we introduce a novel framework that defines five distinct levels of autonomy for RSs, ranging from simple rule-based accuracy-objective RSs to personalized behavior-based uncertain multi-objective RSs (i.e., users may have diverse needs on, e.g., accuracy, diversity and fairness ). Accordingly, we propose to automatically identify and optimize multiple objectives based on individual user needs, thus functioning as more ethical and intelligent user-centric assistants. To address the challenges in exploring such uncertainty in RSs, we present a novel Bayesian optimization (BO) method to capture users' personalized preferences for different objectives and the uncertain relationships among objectives. Moreover, to improve the efficiency and effectiveness of BO learning, we design a novel orthogonal meta-learning paradigm, speeding up the optimization process by exploiting shared knowledge across similar tasks and reducing conflicts among objectives by exploring their orthogonal information. Finally, we empirically demonstrate the effectiveness of our proposed method in optimizing uncertain multi-objectives for individual users, paving the way for more user-centric RSs.

Evaluating LLM Recommendations Under Stress

Large Language Models (LLMs) are increasingly deployed in high-stakes settings where errors are costly and decisions must be made under time pressure, ambiguity, and strict constraints. However, most existing benchmarks evaluate model performance under relaxed conditions and fail to capture how LLM behavior degrades under stress. This paper presents a structured framework for stress-testing LLM robustness across five real-world scenarios with increasing difficulty. We evaluate three widely used models—ChatGPT (GPT-5.1), Gemini 3, and Claude 4.5 Sonnet—using a 45-item rubric assessing accuracy, safety, logical consistency, and constraint adherence. To quantify robustness, we introduce a Stress Sensitivity Index (SSI) that measures performance degradation as task difficulty increases. Our results show that ChatGPT remains largely stable under stress, Gemini exhibits moderate degradation, and Claude displays substantial sensitivity, particularly in constraint-following tasks. These findings highlight the importance of stress-oriented evaluation for understanding LLM reliability prior to their deployment in real-world, high-stress human environments.

Beyond Engagement: A Human-Centered Evaluation Framework for LLM-Powered Recommender Systems

Recommender systems are at the heart of how individuals are engaged with information, products, and opportunities across the web. Although some recent progress, including the inclusion of large language models (LLMs) within them, has improved personalization and interaction, nonetheless, most state-of-the-art recommender systems are assessed for their performance through metrics largely centered around engagement, such as clicks, dwell time, or conversions. However, such metrics commonly do not address what matters most from a human experience standpoint, such as trust, agency, transparency, and overall satisfaction.

In this work, we develop a human-centered assessment framework for LLM-assisted recommenders that reorients evaluation practices in more long-term engagement-focused assessment contexts towards more human-aligned frameworks. To assess the utility of our framework, we develop a brief qualitative assessment of an LLM-assisted recommendation tool that investigates trade-offs between engagement-focused design choices versus more human centered ones through qualitative assessment. These design choices include factors like ranking strategies, diversity concerns, trust mechanisms, and transparency techniques. We thus explore these concepts in more depth within our qualitative assessment framework.

Our analysis draws attention to persistent discrepancies between engagement optimization and human-centered outcomes, and demonstrates how LLM-based recommender systems can be evaluated and designed in more socially aligned ways compared to engagement-centric approaches. We end with a discussion of the implications of our results for recommender system design and how a human-centered assessment is crucial to the next generation of socially aligned recommender systems.

Knowledge Graph Editing Model for Enhancing Recommendation Systems

Knowledge Graphs (KGs) play a pivotal role in enhancing recommendation systems by providing structured, semantically rich relationships between items and entities. However, real-world KGs often contain noisy, redundant, or task-irrelevant associations that may obscure meaningful patterns and negatively impact recommendation accuracy. To address this challenge, we propose a knowledge editing recommendation framework that refines KGs through two complementary components. The first component, Knowledge Filtering, employs diffusion models to denoise KG structures while preserving their semantic integrity. The second component, Task-Specific Triplet Reconstruction, identifies and enhances task-relevant triples through attention-based scoring and masked self-supervised reconstruction. Extensive experiments conducted on three publicly available datasets demonstrate that our framework consistently outperforms strong baselines, confirming the effectiveness of our approach in producing cleaner KGs and better user-item representations.

Learning to Alleviate Familiarity Bias in Video Recommendation

Modern video recommendation systems aim to optimize user engagement and platform objectives, yet often face structural exposure imbalances caused by behavioral biases. In this work, we focus on the post-ranking stage and present LAFB (Learning to Alleviate Familiarity Bias), a lightweight and model-agnostic framework designed to mitigate familiarity bias in recommendation outputs. LAFB models user–content familiarity using discrete and continuous interaction features, and estimates personalized debiasing factors to adjust user rating prediction scores, thereby reducing the dominance of familiar content in the final ranking. We conduct large-scale offline evaluations and online A/B testing in a real-world recommendation system, under a unified serving stack that also compares LAFB with deployable popularity-oriented remedies. Results show that LAFB increases novel watch-time share and improves exposure for emerging creators and overall content diversity, while maintaining stable overall watch time and short-term satisfaction. LAFB has already been launched in the post-ranking stage of YouTube's recommendation system, demonstrating its effectiveness in real-world applications.

SESSION: LLM & Agents for Recommendation Systems (LARS)

LLM and Agents for Recommendation Systems (LARS)

Digital marketplaces spanning e-commerce, content platforms, and service ecosystems are undergoing a fundamental shift from centralized recommendation algorithms to autonomous multi-agent coordination. Intelligent agents now plan, negotiate, and interact emergently across product discovery, personalized recommendations, fulfillment optimization, and conversational commerce. Industry deployment of these agents is accelerating. However, this shift surfaces critical challenges. Current frameworks lack interpretability for emergent coordination when multiple agents interact. Robustness guarantees fail when cascading failures propagate across multi-agent systems. Most critically, evaluation remains anchored in short-term metrics click-through rates, conversions, session engagement that cannot capture coordination quality, long-horizon trust, fairness across populations, or multi-stakeholder alignment beyond immediate revenue. This workshop brings together researchers and practitioners from recommender systems, information retrieval, multi-agent learning, mechanism design, and platform operations to establish principles for transparent, robust, and trustworthy agent-based systems. We invite contributions on agent architectures, coordination mechanisms, evaluation frameworks, trust and safety, and real-world deployment experiences that advance how autonomous agents can enhance rather than compromise marketplace integrity and user welfare.

LLM-augmented Hybrid Recommendation System for Structured Classification of Customs Commodity Descriptions

This paper presents a pipeline for structured information extraction and commodity classification from unstructured customs declaration texts using a sparse Mixture-of-Experts large language model with deterministic post-processing and rule-based validation, and integrates the extracted outputs into an interactive recommendation system that supports customs officers in classification and decision-making tasks. A Mixtral-8×7B decoder-only architecture is adapted to the customs domain using Low-Rank Adaptation, enabling parameter-efficient fine-tuning while processing long inputs of up to 32k tokens. On a corpus of 847 Russian-language customs product descriptions, the model produces syntactically valid JSON outputs in 98.1% of cases, with deterministic post-processing reducing the invalid rate to 1.06%. Fine-tuning with LoRA on the original dataset yields validation losses in the range 0.015–0.025, while training on an augmented dataset of approximately 106 records reduces validation loss to below 0.002, at the cost of increased training time. Inference benchmarks show that the sparse MoE model, activating 2 of 8 experts per token, achieves higher scores than dense baselines under comparable inference budgets, with gains of approximately 10–15 percentage points on reasoning-oriented tasks. The results also identify limitations related to semantic correctness, numeric and unit interpretation, routing stability, and uncertainty estimation, indicating directions for further development of language-model-based systems for regulatory and decision-support applications.

GALRec: Aligning Graph with Large Language Model for Sequential Recommendation

In recent years, Large Language Models (LLMs) have achieved significant success across various domains due to their powerful generative capabilities and impressive comprehension abilities. Consequently, many researchers have sought to integrate LLMs with recommender systems to leverage the extensive world knowledge embedded in these generative models, addressing challenges such as data sparsity. However, most LLM-assisted recommenders primarily focus on item-specific features while neglecting the latent information arising from user-item interactions, resulting in suboptimal performance. A straightforward approach is to model these interactions as graphs for input, but the heterogeneous nature of the interaction and item feature spaces makes direct fusion challenging. To address this issue, we propose GALRec, a Graph-Augmented LLM Recommender that uses a novel tuning strategy, GAL-tuning, successfully incorporating collaborative information to LLM-based sequential recommender systems. This strategy contains a graph encoder to capture complex interaction data and integrate it with an LLM predictor. To further improve the performance of this two-part recommender, we introduce a mutual alignment method, GAL-alignment, incorporating graph-to-LLM and LLM-to-graph alignment, which optimizes the framework by minimizing the gap between graphs and LLMs. Experiments show that GALRec enhances performance in two sequential recommendation tasks—item recommendation and preference prediction—across two widely-used datasets and in two well-established scenarios: full dataset fine-tuning and few-shot learning, proving the method's effectiveness and data efficiency.

Smart Book Seeker: Agent-Augmented Retrieval System for Metadata-Sparse Libraries

Many libraries catalog books with titles alone, lacking rich metadata such as subject headings or abstracts. This metadata sparsity creates a cascade of retrieval challenges: users struggle to formulate effective queries without knowing exact titles, vocabulary mismatch causes relevant books to be missed, and users often lack domain expertise to evaluate whether retrieved results truly match their needs. To address these limitations, we propose the Smart Book Seeker Agent-Augmented Retrieval System for Metadata-Sparse Libraries (SBS-AARS), a novel multi-agent collaborative architecture comprising three coordinated components. First, a User-Needs Agent clarifies ambiguous search requirements through interactive dialogue. Second, a Librarian Agent formulates and executes keyword-based retrieval to obtain candidate books. Third, a User Agent acts as a domain-knowledge-equipped proxy that collaborates with the Librarian Agent through multi-turn discussions to evaluate and select optimal results—the key innovation that compensates for users' frequent lack of expertise in assessing retrieval outcomes. Experimental evaluation on 20,000 Chinese books demonstrates substantial improvements over baseline methods: our system achieves 87% hit rate and 57% precision at top-5 results, representing 28% and 90% relative improvements over traditional keyword-based retrieval, while maintaining 55% precision and recall at top-10 results. These findings validate that multi-agent collaboration, particularly the use of an LLM-based agent as a user's knowledge proxy, effectively overcomes metadata sparsity in library book discovery.

Ethical Decision-Making under Moral Uncertainty

Decision-making under uncertainty is central to many real-world applications. The multi-armed bandit (MAB) framework offers an effective model for balancing exploration and exploitation in such environments. In practice, decisions must frequently balance between conflicting moral principles, introducing moral uncertainty, where it is unclear which ethical framework (e.g., utilitarianism, deontological ethics) should be prioritized. This paper proposes a novel algorithm that incorporates moral uncertainty into the MAB framework. The algorithm extends the Upper Confidence Bound (UCB) approach by assigning dynamic moral weights to competing ethical frameworks, allowing it to adapt as it learns from outcomes. We derive theoretical results, including upper limits of regret, and discuss the performance of the algorithm under different levels of moral uncertainty. Our results show that the proposed algorithm successfully balances ethical considerations and performance, offering a flexible solution for decision-making under moral uncertainty in a range of applications.

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data for LLM-based Multi-Behavior Recommender Systems

Large language models (LLMs) and Vision Language Models (VLMs) have recently transformed recommender systems by leveraging textual and visual information to capture user behaviors and item characteristics for recommendation. However, relying solely on these modalities overlooks the rich relational data that underpins user–item interactions. In this paper, we present Triple Modality Fusion (TMF), an end-to-end framework that seamlessly integrates visual, textual, and user behavior graph data into an LLM-based recommender system. TMF encodes multi-behavior signals as graph embeddings along with textual and visual embeddings. A modality fusion module is built with self-attention and cross-attention mechanisms to project disparate modality representations into a unified embedding space. We fine-tune an LLM-based recommender with the token-aware prompt designed for multi-behavior recommendations and convert user behavior sequences into natural language prompts. Extensive evaluations on three benchmark datasets demonstrate that TMF significantly outperforms state-of-the-art models, delivering robust recommendation quality. TMF has been deployed in the Walmart recommendation production environment to generate item candidates for marketing campaigns, reaching 19.7% and 25.5% improvement in Electronics and Sports categories.

AID: Hypothesis-Grounded Probing for Conversational Recommendation

In many enterprise settings, state-of-the-art conversational recommendation systems often underperform, not due to limitations in their underlying architectures, but because users lack the ability to formulate effective queries. This information asymmetry, that is, users lacking visibility into what knowledge exists in the system, leads them to ask structurally ambiguous queries that map to multiple interpretations, contributing to generic responses that often fail to address users' actual needs.

To solve this, we propose Assisted Intelligent Dialogue (AID), a knowledge-graph conditioned, multi-agent framework that treats clarification as a structured decision problem. In our setting, the recommended items are ranked solution bundles: the smallest set of documents and steps needed to resolve the user's issue, while keeping the number of turns and probes low. Our key insight is that ambiguity is often structural, and not always linguistic. For instance, a query like ''my application is slow'' may map to distinct graph regions (resource contention, bandwidth constraints, configuration errors), each requiring different diagnostic paths. In such scenarios, AID reasons over these competing hypothesis subgraphs, probing only when multiple interpretations remain plausible, with targeted questions designed to collapse the hypothesis space efficiently.

The framework orchestrates four specialized agents through a stateful dialogue loop with adaptive retrieval. We introduce two novel metrics: Hypothesis Resolution Rate (HRR) for disambiguation success, and Unnecessary Probe Rate (UPR) for probing waste.

Our experiments on enterprise document recommendation (2,169 documents, 2,156 queries) have shown that AID resolves 89% of ambiguous queries within two probes, improving answer completeness by 34% over a single-turn RAG baseline. AID is deployed at Cisco, handling real customer queries daily, with internal testing showing improved resolution outcomes. This work provides guidance for deploying conversational recommendation systems that bridge the gap between what users ask and what knowledge bases contain.

LLMs as Orchestrators: Constraint-Compliant Multi-Agent Optimization for Recommendation Systems

Recommendation systems must optimize multiple objectives while satisfying hard business constraints such as fairness and coverage. For example, an e-commerce platform may require every recommendation list to include items from multiple sellers and at least one newly listed product; violating such constraints—even once—is unacceptable in production. Prior work on multi-objective recommendation and recent LLM-based recommender agents largely treat constraints as soft penalties or focus on item scoring and interaction, leading to frequent violations in real-world deployments. How to leverage LLMs for coordinating constrained optimization in recommendation systems remains underexplored. We propose DualAgent-Rec, an LLM-coordinated dual-agent framework for constrained multi-objective e-commerce recommendation. The framework separates optimization into an Exploitation Agent that prioritizes accuracy under hard constraints and an Exploration Agent that promotes diversity through unconstrained Pareto search. An LLM-based coordinator adaptively allocates resources between agents based on optimization progress and constraint satisfaction, while an adaptive ?-relaxation mechanism guarantees feasibility of final solutions. Experiments on the Amazon Reviews 2023 dataset demonstrate that DualAgent-Rec achieves 100% constraint satisfaction and improves Pareto hypervolume by 4–6% over strong baselines, while maintaining competitive accuracy–diversity trade-offs. These results indicate that LLMs can act as effective orchestration agents for deployable and constraint-compliant recommendation systems.

RobustExplain: Evaluating Robustness of LLM-Based Explanation Agents for Recommendation

Large Language Models (LLMs) are increasingly used to generate natural-language explanations in recommender systems, acting as explanation agents that reason over user behavior histories. While prior work has focused on explanation fluency and relevance under fixed inputs, the robustness of LLM-generated explanations to realistic user behavior noise remains largely unexplored. In real-world web platforms, interaction histories are inherently noisy due to accidental clicks, temporal inconsistencies, missing values, and evolving preferences, raising concerns about explanation stability and user trust. We present RobustExplain, the first systematic evaluation framework for measuring the robustness of LLM-generated recommendation explanations. RobustExplain introduces five realistic user behavior perturbations evaluated across multiple severity levels and a multi-dimensional robustness metric capturing semantic, keyword, structural, and length consistency. Our goal is to establish a principled, task-level evaluation framework and initial robustness baselines, rather than to provide a comprehensive leaderboard across all available LLMs. Experiments on four representative LLMs (7B–70B) show that current models exhibit only moderate robustness, with larger models achieving up to 8% higher stability. Our results establish the first robustness benchmarks for explanation agents and highlight robustness as a critical dimension for trustworthy, agent-driven recommender systems at web scale.

Campaign-2-PT-RAG: LLM-Guided Semantic Product Type Attribution for Scalable Campaign Ranking

E-commerce campaign ranking models require large-scale training labels indicating which users purchased due to campaign influence. However, generating these labels is challenging because campaigns use creative, thematic language that does not directly map to product purchases. Without clear product-level attribution, supervised learning for campaign optimization remains limited. We present Campaign-2-PT-RAG, a scalable label generation framework that constructs user--campaign purchase labels by inferring which product types (PTs) each campaign promotes. The framework first interprets campaign content using large language models (LLMs) to capture implicit intent, then retrieves candidate PTs through semantic search over the platform taxonomy. A structured LLM-based classifier evaluates each PT's relevance, producing a campaign-specific product coverage set. User purchases matching these PTs generate positive training labels for downstream ranking models. This approach reframes the ambiguous attribution problem into a tractable semantic alignment task, enabling scalable and consistent supervision for downstream tasks such as campaign ranking optimization in production e-commerce environments. Experiments on internal and synthetic datasets, validated against expert-annotated campaign–PT mappings, show that our LLM-assisted approach generates high-quality labels with 78--90% precision while maintaining over 99% recall.

TCR: A Trust-Aligned Text-Only Cross-Attention Ranker Distilling Multimodal LLM Reasoning for Recommendation Agents

Large language and vision–language models enable semantic reasoning but remain impractical for real-time recommendation and search systems. We present TCR (Text-Only Cross-Attention Ranker), a lightweight DeBERTa-based re-ranking agent distilled from multiple vision–language large language models through multi-teacher soft-label fusion. TCR compresses multimodal reasoning into a compact text-only model fine-tuned with LoRA, achieving a 98.5 percent parameter reduction and maintaining production-grade latency while retaining near-teacher ranking quality on 4.4 million real-world query–content pairs. By aligning model calibration and transparency with user-perceived trust, TCR bridges deep model reasoning and deployable, trust-aware recommendations for creative content and interactive search. This work demonstrates how multi-teacher distillation and parameter-efficient fine-tuning make human-aligned, low-latency recommendation systems feasible at production scale.

SCORE: Story Coherence and Retrieval Enhancement for AI Narratives

Large Language Models (LLMs) can generate creative and engaging narratives from user-specified input, but maintaining coherence and emotional depth throughout these AI-generated stories remains a challenge. In this work, we propose SCORE, a framework for Story Coherence and Retrieval Enhancement, designed to detect and resolve narrative inconsistencies. By tracking key item statuses and generating episode summaries, SCORE uses a Retrieval-Augmented Generation (RAG) approach to identify related episodes and enhance the overall story structure. Experimental results from testing multiple LLM-generated stories demonstrate that SCORE significantly improves the consistency and stability of narrative coherence compared to baseline GPT models, providing a more robust method for evaluating and refining AI-generated narratives.

SESSION: Fourth International Workshop on Multimodal Content Analysis for Social Good (MM4SG)

SESSION: Third International Workshop on Prompt Engineering for Pre-Trained Language Models (PromptEng)

[PromptEng] Third International Workshop on Prompt Engineering for Pre-Trained Language Models

The recent achievements and availability of Large Language Models have paved the road to a new range of applications and use-cases. Pre-trained language models are now being involved at-scale in many fields where they were previously absent.%until now absent from. More specifically, the progress made by causal generative models has opened the door to using them through textual instructions aka. prompts. Unfortunately, %the performances of these prompts are the prompt effectiveness is highly dependent on the exact phrasing used and therefore practitioners need to adopt fail-retry strategies. Based on the success of the past editions, this third international workshop on prompt engineering gathers practitioners (both from Academia and Industry) to exchange about good practices, optimizations, results, and novel paradigms about the design of efficient and safe prompts.

SESSION: The 2nd International Workshop on Social Science Meets Web Data: ... (R2CASS)

Social Bias Benchmarks for Large Language Models

Pretrained multilingual models exhibit social biases similar to those observed in models trained on English-only data. This presentation is based on a systematic review of recent research that extends bias evaluation and mitigation beyond English, with a focus on multilingual and non-English contexts. We analyze existing benchmarks through the lenses of linguistic diversity, cultural awareness, and the evaluation metrics and mitigation strategies they employ. Our review highlights key gaps in prevailing methodological choices, while also cataloguing common challenges and solutions in adapting bias benchmarks across languages and cultures.

SESSION: 4th International Workshop on AI and Semantic Technologies ... (SemTech)

SemTech 2026: The 4th International Workshop on AI and Semantic Technologies for the Scientific, Technical, and Legal Web

The rapid expansion of scientific, technical, and legal data on the Web, ranging from patents and technical reports to scholarly articles, presents significant challenges for large-scale semantic processing. These data sources are inherently heterogeneous and semi-structured, often embedding complex entities such as figures and tables that cannot directly be processed by machines. SemTech 2026 addresses these challenges by exploring the intersection of Semantic Web technologies, Natural Language Processing (NLP), and Large Language Models (LLMs). In particular, the workshop emphasizes research at the intersection of symbolic AI and LLMs, investigating how structured knowledge representations and reasoning can enhance the interpretability, robustness, and reliability of data-driven language models.

The LOPE Method: Improving Consistent Property Extraction for Scientific Knowledge Graphs Using LLMs

In the era of Generative AI, Scientific Knowledge Graphs (SKGs) have gained substantial importance as they provide a structured data foundation for fact grounding and scientific verification. They have become a cornerstone for Retrieval-Augmented Generation (RAG) and help detect and mitigate hallucinations of Large Language Models (LLMs), one of the major challenges for creating reliable outputs. However, creating comprehensive, content-rich SKGs remains a significant challenge, as current automated methods often fail to capture the semantic depth required to describe research contributions accurately. Conversely, manual crowdsourcing approaches are often time-consuming and error-prone, leading to semantic inconsistencies in the data. To address these limitations, we present the LOPE (LLM-driven Ontology-based Property Extraction) method. Our automated approach advances LLM-based property extraction by combining semantically optimized prompting with a high-performance open-weight model and a vector-based ontology matching step. By aligning extracted terms to a standardized vocabulary, our solution improves semantic consistency compared to existing crowdsourcing approaches, thereby increasing the machine actionability and interoperability of the resulting SKG for downstream scientific analysis. The paper concludes with an evaluation using a validated LLM judge, demonstrating that LOPE highly significantly outperforms baseline methods.

ORKG Properties Ontology Consolidated: LLM-Driven Refinement of Crowdsourced Knowledge for Machine-Actionability

In the era of Generative AI, Scientific Knowledge Graphs (SKGs) have become instrumental for grounding Retrieval-Augmented Generation (RAG), especially for mitigating hallucinations. As a prominent example, the Open Research Knowledge Graph (ORKG) serves as a foundational element of emerging research data infrastructures. It distinguishes itself by modeling not only metadata but also the semantic content of research publications as fine-grained triples. However, the quality and utility of these triples heavily depend on the predicates defined in the ORKG ontology. Currently, the ontology suffers from quality degradation inherent to uncontrolled crowdsourcing, such as widely duplicated properties, inconsistent naming conventions, and ambiguous semantics, thereby inhibiting machine-actionability. In this paper, we introduce the Consolidated ORKG Properties Ontology (OPO-Consolidated), a consolidated schema designed to resolve these inconsistencies. We present a semi-automated workflow that combines Large Language Models (LLMs) with human engineering to systematically clean and restructure the existing property set. Our evaluation, utilizing an established Gold Standard dataset of 153 research papers, demonstrates that OPO-Consolidated substantially improves schema conciseness (by resolving synonymy and redundancy) and semantic consistency (by enforcing uniform naming conventions) while maintaining the valuable semantic coverage of the data. Furthermore, we show that OPO-Consolidated ensures backward compatibility with existing data, providing a seamless migration path while establishing a machine-actionable foundation for established ORKG comparisons and future downstream tasks.

Beyond the Rules: Understanding the Design Logic of Internet Standards

RFCs form the backbone of Internet protocols, yet the reasoning behind their design—shaped by years of debate, trade-offs, and practical compromises within IETF working groups (WGs)—remains buried in informal discussions and inaccessible to most. These rationales are known only to protocol authors, leaving learners with limited understanding and implementers prone to misinterpreting ambiguous specifications. To address this gap, we present a Retrieval-Augmented Generation (RAG) framework tailored for RFC design rationale extraction and generation. Our system retrieves rationale-rich context from extensive IETF email archives and WG deliberations, then generates clear explanations of the motivations and constraints behind protocol design decisions. By making this knowledge explicit, our work marks a milestone toward transparent Internet standards, enabling deeper understanding and more accurate implementations. Our tailored retriever improves targeted email discovery, especially at larger search scales, while returning much shorter contexts to reduce token cost for RAG.

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite being foundational to more complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks by systematically varying option-label formats (alphabetic, numeric, Roman) while preserving semantic equivalence across four experimental paradigms. (1) With explicit instructions, semantically invariant label changes induce large performance shifts (e.g., -30.45% for Roman versus numeric), revealing strong instruction-format bias. (2) Removing instructions further degrades performance (up to -10.84%) and amplifies label sensitivity, highlighting the importance of explicit guidance. (3) When option content is removed, models fail to consistently meet random-choice baselines except with numeric labels, indicating weak adherence to atomic directives under semantic underspecification. (4) Three-shot exemplars yield no significant gains in robustness or output fidelity, and generation analyses highlight persistent label violations, particularly for non-numeric formats. Although larger models achieve higher overall accuracy, instruction adherence remains inconsistent across scales, demonstrating that improved task performance does not entail reliable instruction-following. These findings expose limitations of current instruction-tuning paradigms and motivate evaluation methods and training strategies that explicitly target atomic instruction-following.

Saliency-Guided Embedding Alignment for Query-to-Document Legal Case Retrieval

Legal Case Retrieval (LCR) aims to find relevant cases given a query. Most existing approaches use a Document-to-Document (D2D) paradigm, treating a full legal document as the query. In real scenarios, however, users provide short descriptions, forming a Query-to-Document (Q2D) paradigm where D2D-trained embeddings struggle due to length mismatch. To address this, we propose a Q2D-oriented framework that improves short-query and long-document matching through segment selection and semi-supervised training. We introduce a Chunk Saliency Assessor (CSA) to identify legal document segments most relevant to a query. Building on this, we design a two-stage fine-tuning strategy: (1) Legal Inverse Cloze Task (LICT) to strengthen understanding of legal structures, and (2) Legal Content-Aligned Embedding Tuning (LCAET) to align query and document semantics. Experiments across multiple crime categories, with category-specific fine-tuning, show significant improvements in retrieval accuracy. Ablation results demonstrate the effectiveness of our alignment mechanism, demonstrating precise identification of query-relevant segments and bridging the semantic gap.

SESSION: The 16th Temporal Web Analytics Workshop (TempWeb)

SESSION: Trustworthy Foundation Models for Web Intelligence: Causal Perspectives and Challenges (TrustFM)

Trustworthy Foundation Models for Web Intelligence: Causal Perspectives and Challenges

Foundation Models (FMs) are increasingly underpinning critical Web applications, from search and recommendation systems to social media analytics. Ensuring the trustworthiness of these models—covering aspects like fairness, transparency, causality, and robustness—is paramount, especially when trained on heterogeneous, dynamic, and massive web-scale data. This workshop provides a focused, cross-disciplinary forum to explore the emerging challenges in this space, with a specific emphasis on Causal Reasoning as a principled framework for enhancement and evaluation.

Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild

As foundation models (FMs) approach human-level fluency, distinguishing synthetic from organic content has become a key challenge for Trustworthy Web Intelligence.

This paper presents JudgeGPT and RogueGPT, a dual-axis framework that decouples ''authenticity'' from ''attribution'' to investigate the mechanisms of human susceptibility. Analyzing 918 evaluations across five FMs (including GPT-4 and Llama-2), we employ Structural Causal Models (SCMs) as a principal framework for formulating testable causal hypotheses about detection accuracy.

Contrary to partisan narratives, we find that political orientation shows a negligible association with detection performance (r=-0.10). Instead, ''fake news familiarity'' emerges as a candidate mediator (r=0.35), suggesting that exposure may function as adversarial training for human discriminators. We identify a ''fluency trap'' where GPT-4 outputs (HumanMachineScore: 0.20) bypass Source Monitoring mechanisms, rendering them indistinguishable from human text.

These findings suggest that ''pre-bunking'' interventions should target cognitive source monitoring rather than demographic segmentation to ensure trustworthy information ecosystems.

LLM and AI Agent: Golden Rule is All We Need

We propose that treating others as one would wish to be treated (the Golden Rule), can serve as universal foundation for moral reasoning. We first formalize this principle within game-theoretic frameworks and show that it enables the derivation of core ethical values such as fairness, honesty, and compassion. We demonstrate that adherence to the Golden Rule promotes cooperative behavior and social equilibria. In this work, we have modeled how the rule can accommodate diverse moral perspectives and cultural variation. We also explored its usefulness in ethical decision-making within a reinforcement learning system, and we argue that the Golden Rule offers an interpretable foundation for aligning autonomous agents with human values.

Hierarchical Prompt Contrastive Learning for Weakly Supervised Histopathology Segmentation

Weakly supervised histopathology segmentation is a hot topic in the computer vision community by generating pixel-level masks with image-level labels due to the expensive and time-intensive pixel-level annotations. The inter-class homogeneity and intra-class heterogeneity are challenging issues for histopathological image segmentation. Recent works addressing these issues mainly focus on learning intra-class similarity, resulting in the spurious features of each category being captured and the discriminative ability of feature representation for each category being decreased. In this paper, we propose a hierarchical prompt contrastive method for histopathological image segmentation with hierarchical text prompt contrastive loss and hierarchical class prompt contrastive loss. Specifically, for hierarchical text prompt contrastive loss, we align the features of different layers from the image encoder to the adaptive text prompt for each category by adding linear layers following a frozen text encoder. Then we can generate several classification activation maps to produce high-confidence pseudo segmentation masks, and they are aligned to text features for contrastive learning with the foundation model to generate more confident segmentation masks. While in hierarchical class prompt contrastive loss, the subclass prompts are obtained by dividing each category into several clusters. Then an intra-class contrastive strategy and an inter-class contrastive strategy are developed by aligning the subclass prompt to the image. These contrastive losses are adopted to guarantee that the structures within each category are preserved and the discriminative ability of different categories is improved. Extensive experiments show that the proposed method outperforms the state-of-the-art methods on the popular benchmark datasets.

TRACE: Transparent Web Reliability Assessment with Contextual Explanations

In an era of AI-generated misinformation flooding the web, existing tools struggle to empower users with nuanced, transparent assessments of content credibility. They often default to binary (true/false) classifications without contextual justifications, leaving users vulnerable to disinformation. We address this gap by introducing TRACE: Transparent Reliability Assessment with Contextual Explanations, a unified framework that performs two key tasks: (1) it assigns a fine-grained, continuous reliability score (from 0.1 to 1.0) to web content, and (2) it generates a contextual explanation for its assessment. The core of TRACE is the TrueGL-1B model, fine-tuned on a novel, large-scale dataset of over 140,000 articles. This dataset's primary contribution is its annotation with 35 distinct continuous reliability scores, created using a Human-LLM co-creation and data poisoning paradigm. This method overcomes the limitations of binary-labeled datasets by populating the mid-ranges of reliability. In our evaluation, TrueGL-1B consistently outperforms other small-scale LLM baselines and rule-based approaches on key regression metrics, including MAE, RMSE, and R2. The model's high accuracy and interpretable justifications make trustworthy information more accessible. To foster future research, our code and model are made publicly available here: https://github.com/zade90/TrueGL github.com/zade90/TrueGL.

Multi-Branch Cooperation Networks for Enhanced Click-Through Rate Prediction in Large-Scale E-Commerce Search

Existing Click-Through Rate (CTR) prediction models use various feature interaction techniques, each with unique strengths, but relying on a single type limits their ability to capture complex relationships. Recent research shows that effective CTR models often combine an MLP network with a dedicated feature interaction network in a two-parallel structure. However, the interplay and cooperative dynamics between different streams or branches remain under-researched. In this work, we introduce a novel Multi-Branch Cooperation Network (MBCnet) which enables multiple branch networks to collaborate with each other for better complex feature interaction modeling. Specifically, MBCnet consists of three branches: the Extensible Feature Grouping and Crossing (EFGC) branch that promotes the model's memorization ability of specific feature combinations, the low rank Cross Net branch and Deep branch to enhance explicit and implicit feature crossing for generalization. Among them, a novel cooperation scheme is proposed based on two formulated objectives: branch co-teaching that encourages well-learned branches to support poorly-learned ones on specific training samples, and moderate differentiation that advocates branches to maintain a reasonable level of difference in their feature representations on the same inputs. This cooperation strategy improves learning through mutual knowledge sharing and boosts the discovery of diverse feature interactions across branches. Extensive experiments on large-scale industrial datasets and online A/B test demonstrate MBCnet's superior performance.

Struggling at the Start: Structural Causes of Decoding Difficulty in Code Generation

Large Language Models (LLMs) demonstrate strong performance in code generation, yet generation errors remain prevalent. Prior analyses primarily focus on final outputs and aggregate performance metrics, with limited examination of the decoding process that gives rise to these outputs. To address this, we conduct a systematic token-level comparison between natural language and code generation, and identify consistent differences in their decoding behaviors. We find that decoding difficulty in code is not uniformly distributed, but is concentrated at structurally critical positions, particularly at line-initial tokens. Based on these observations, we propose an interpretation in which decoding difficulty is associated with increased predictive uncertainty at high-level structural decision points during generation. To probe this interpretation, we introduce a lightweight prompt-level intervention that provides structural guidance, enabling a controlled diagnostic analysis without modifying LLMs or decoding strategies. Experiments on HumanEval and MBPP show that this intervention consistently reduces predictive entropy at line-initial positions, highlighting the localized nature of decoding difficulty and motivating future work on structure-aware code generation.

SESSION: 2nd International Workshop on Transformative Insights in Multi-faceted Evaluation (TIME)

Rethinking Object Detection and Tracking

Recent years have witnessed a profound transformation in object detection and tracking, driven by advances in transformers, diffusion models, multimodal learning, and large-scale pretraining. Beyond performance gains, the field is undergoing a conceptual shift, from closed-set, task-isolated pipelines toward open-world, multi-task, and semantically grounded visual perception systems. This survey provides a review of very recent object detection and tracking research, systematically analyzing more than one hundred representative works across 2D, 3D, multi-view, multimodal, and vision-language settings. By consolidating models, datasets, evaluation protocols, and targeted challenges, we expose cross-task patterns that are often overlooked in existing surveys. Our analysis shows several emerging trends: the convergence of detection and tracking into unified formulations, the growing role of generative and diffusion-based temporal modeling, the rise of open-vocabulary and language-conditioned tracking, and the increasing importance of uncertainty modeling and multimodal fusion in 3D and adverse environments. In addition, we provide a quantitative analysis of dataset usage, evaluation metrics, and challenge prevalence over time, highlighting how benchmark choices and metric design shape research directions. The survey concludes by identifying open problems and underexplored intersections, such as scalable open-world tracking, unified evaluation across modalities, and principled handling of uncertainty and semantics, that point toward the next phase of visual perception research. By offering both breadth and synthesis, this work aims to serve as a reference and a roadmap for future advances in object detection and tracking.

How Quantum Period Finding Breaks Rivest Shamir Adleman Algorithms

The Rivest Shamir Adleman (RSA) algorithm underpins much of modern digital security. It protects messages, passwords, and web services. However, advances in quantum computing pose a long-term threat to RSA, as quantum algorithms can exploit its mathematical structure. In this paper, we analyze how quantum computation can undermine the RSA algorithm using an explainable, dictionary-based quantum emulation framework. In this approach, each quantum state is represented as dict[k] = a, where k is a bitstring and a is a complex amplitude, enabling transparent tracking of quantum state evolution. We emulate key quantum gates through updated rules. The Hadamard gate creates superposition by splitting dictionary keys; phase gates modify amplitudes via complex rotations; and controlled-X and Toffoli gates perform conditional bit flips. These operations are consistent with standard quantum gate behaviour while remaining easy to analyse. We defined RSA variables such as prime_p, prime_q, modulus_n, public_key_e, private_key_d, cipher_c, input_x, and period_r. We demonstrated how quantum processes can identify periodic structure, factor the modulus, and recover the private key. By revealing the full attack path, this research provides an interpretable view of quantum threats to the RSA algorithm. The proposed framework supports a useful understanding of quantum security risks. It also highlights the importance of transitioning toward post-quantum cryptographic systems that ensure long-term security for all users.

Orchestrating Heterogeneous Experts: A Scalable MoE Framework with Anisotropy-Preserving Fusion

In cross-border e-commerce, search relevance modeling faces the dual challenge of extreme linguistic diversity and fine-grained semantic nuances. Existing approaches typically rely on scaling up a single monolithic Large Language Model (LLM). However, our empirical analysis reveals that single models suffer from uneven capability distributions across regions. For example, excelling in English while underperforming in specific Southeast Asian languages. In this work, we shift the paradigm from scaling a single model to orchestrating heterogeneous experts. We propose a scalable Coarse-grained Mixture-of-Experts (MoE) framework that leverages the inherent complementarity of distinct open-source LLMs (e.g., Qwen, Gemma) without expensive pre-training. Unlike standard token-level MoE, our framework dynamically routes entire queries to specialized experts and, crucially, employs an Information-Preserving Concatenation Fusion strategy. We theoretically posit that preserving the distinct embedding manifolds of heterogeneous experts—rather than compressing them via weighted averaging—is essential for capturing complex relevance signals in a multi-model latent space. On datasets spanning six Southeast Asian markets, our MoE improves AUC by 0.72 percentage points over a dense baseline with the same active parameters. Meanwhile, the optimized pipeline achieves 13.72 queries per second (QPS), a 9% throughput improvement.

Measuring Privacy Risks and Tradeoffs in Financial Synthetic Data Generation

We explore the privacy-utility tradeoff of synthetic data generation schemes on tabular financial datasets, a domain characterized by high regulatory risk and severe class imbalance. We consider representative tabular data generators, including autoencoders, generative adversarial networks, diffusion, and copula synthesizers. To address the challenges of the financial domain, we provide novel privacy-preserving implementations of GAN and autoencoder synthesizers. We evaluate whether and how well the generators simultaneously achieve data quality, downstream utility, and privacy, with comparison across balanced and imbalanced input datasets. Our results offer insight into the distinct challenges of generating synthetic data from datasets that exhibit severe class imbalance and mixed-type attributes.

SESSION: International Workshop on Trustworthy Multimodal Learning for Social Media Analysis (TML)

International Workshop on Trustworthy Multimodal Learning for Social Media Analysis

With the rapid proliferation of social media platforms, the web has become a large repository of multimodal content, including text, images, audio, and video. However, the scale and noise of social media data also make multimodal fusion and processing challenging. To address these issues, the International Workshop on Multimodal Deep Learning for Social Media Analysis (IMDL) aims to bring together researchers, practitioners, and industry experts to discuss the latest advancements, challenges, and future directions in analyzing multimodal social media content using Large Multimodal Models (LMMs), as well as to rigorously evaluate their performance and safety for real-world deployment on web platforms. This workshop emphasizes two themes: (i) multimodal social media analysis with LMMs, focusing on multimodal fusion and information alignment; and (ii) performance and safety evaluation for real-world deployment, covering instruction following and generation quality as well as risks such as hallucinations and ''jailbreak'' attacks. This edition received 14 submissions from 7 countries (China, Singapore, the United States, India, the United Arab Emirates, Canada, and Russia). We accepted 10 papers (acceptance rate: 71.4%), including 7 workshop full papers and 3 workshop short papers.

GeoPBR: Multimodal-Guided PBR Material Generation from a Single View

Existing 3D texture generation methods largely follow text-driven paradigms; however, the inherent semantic ambiguity of natural language makes precise control over material appearance fundamentally challenging. In addition, most approaches output RGB textures that bake in static illumination and shadows, resulting in a mismatch with the physical consistency required for dynamic relighting. These limitations indicate that single-modality guidance and RGB-only representations are insufficient for controllable material synthesis. To address this, we propose a multimodal collaborative generation framework that integrates image, text, and geometric constraints. Reference images provide non-verbalizable high-frequency texture cues beyond textual descriptions, while geometric priors serve as structural anchors for spatially consistent texture synthesis. Furthermore, the generated RGB appearance is decoupled into industry-standard PBR maps, including albedo, metallic, and roughness, enabling relightable 3D assets in dynamic Web environments. Overall, our framework aims to reduce semantic uncertainty in material generation and improve the practical usability and stability of generated assets under varying lighting conditions.

HGSA: Heterogeneity-Guided Sarcasm Adaptation for Video Sentiment Analysis via Polarity Disentanglement

With the continuous evolution of social networks, social media comments have become increasingly semantically rich and diverse. Among these, sarcasm represents critical components of daily human expression and constitute a major bottleneck in the field of multimodal sentiment detection. Sarcasm is an intricate phenomenon, as the sentiment conveyed by sarcastic samples often contradicts or conflicts with their explicit informational content. Traditional solutions typically treat multimodal sentiment and sarcasm as a collective information source, utilizing fused sentiment features to extract and refine sarcasm information through various mechanisms. However, such approaches face an inherent challenge: the fusion of sentiment leads to the dilution and homogenization of opposing polarities, resulting in a pervasive ambiguity in semantic polarity.

This study investigates the multimodal information interaction and The polarity distribution of the sarcasm feature. According to the observation, we propose the Heterogeneity-Guided Sarcasm Adaptation (HGSA) framework. The core of HGSA lies in the Modality-Specific Polarized Refinement (MSPR) mechanism. Critically departing from conventional fusion strategies, MSPR utilizes independent, unfused unimodal sentiment features as polarity guides to perform parallel polarity disentanglement on unified sarcasm features. This approach effectively steers and amplifies the sarcasm-related polarity of sentiment focal points within general sarcasm representations. Subsequently, the Heterogeneity-Aware Sarcasm Integration (HASI) module aggregates these refined, polarity-oriented features to generate heterogeneity-salient sarcasm-aware features, achieving precise enhancement of sentiment features.

Extensive experiments on the MUStARD and MUStARD++ benchmarks demonstrate the superiority of the HGSA framework. Our methodology provides a more effective iterative framework for video sentiment analysis and scientifically validates that explicitly maintaining and leveraging modality-conflict polarity is a critical strategy for addressing phenomena dependent on inter-modal incongruity, such as sarcasm. Furthermore, its effectiveness as a plug-and-play mechanism for mainstream architectures proves that explicitly leveraging modality-conflict polarity is a robust and versatile strategy for addressing complex inter-modal incongruity in video sentiment analysis.

Sparse Causal Latent Features for Robust Multimodal Learning under Distribution Shifts

Multimodal learning systems often suffer from significant performance degradation when facing distribution shifts. The root cause is that these models learn features that exhibit spurious correlations with the labels. This paper introduces sparse causal latent features (SCLF), a framework that enforces robustness through explicit feature selection. It utilizes a gating mechanism to precisely select a small number of stable causal features, thereby enhancing the model's generalization ability in out-of-distribution scenarios. By integrating adversarial training, invariant risk minimization penalty, and feature stability scoring, SCLF can effectively identify and preserve causal features that remain stable across environments. Experimental results demonstrate that SCLF achieves an accuracy of 73.0%\pm1.1% in strongly negatively correlated out-of-distribution environments, while maintaining an in-domain accuracy of 74.0%\pm0.5%. These results highlight the potential of SCLF for reliable and trustworthy multimodal learning.

Evidential Mixture-of-Experts for Sentiment Regression

Multimodal sentiment analysis aims to understand human emotions by integrating text, audio, and visual information. Existing methods often struggle with noisy or missing modalities and rarely account for the reliability of each modality. In this paper, we propose E-MoE, an evidence-aware mixture-of-experts framework that models sentiment prediction as an evidential inference problem. Specifically, we first employ modality-specific experts to estimate positive and negative evidences for each modality, capturing both sentiment tendencies and uncertainty. Then, an uncertainty-driven fusion mechanism aggregates these evidences, assigning higher weight to reliable modalities while mitigating the impact of noisy or missing inputs. Finally, the aggregated evidences parameterize a Beta distribution from which the final sentiment prediction is derived. Extensive experiments on CMU-MOSI and CMU-MOSEI demonstrate that E-MoE outperforms state-of-the-art approaches, effectively integrating heterogeneous modalities while providing interpretable and robust sentiment predictions.

PLM: Point-Language Maps for Zero-shot Object Goal Navigation

Object Goal Navigation requires agents to locate targets in unknown 3D environments through user instructions. Under this circumstance, 3D-aware agents can enhance navigation performance via fine-grained spatial understanding. However, off-the-shelf methods demand high-dimensional 3D scene representations to learn this capability, incurring substantial training costs. To address this, we propose Point-Language Maps (PLM ) --- a 3D-aware zero-shot framework achieving physical-semantic joint reasoning without training. Our core innovation lies in appreciating pre-trained 3D vision-language models (VLMs) to encode environmental point clouds and textual instructions into a unified semantic-geometric space, thereby constructing a language-conditioned 3D potential map. By combining this map with the cost-utility method, PLM can achieve efficient exploration in a zero-shot manner while maintaining 3D semantic awareness. Experiments on three mainstream datasets demonstrate that PLM : (I) exhibits a significant improvement in navigation success rate compared with baseline methods (e.g., an improvement of 2.0%~29.6% on HM3D); (II) shows competitiveness in navigation efficiency with the best performing method; (III) manifests adapter-style compatibility with various 3D VLMs.

Task-Aligned Unlearning for Multimodal Large Language Models

Multimodal large language models (MLLMs) are prone to memorizing and revealing sensitive personal information, raising concerns about privacy and safety. Existing MLLM unlearning methods typically rely on a single forgetting objective, which often leads to unbalanced forgetting across tasks, degraded generation quality, or unnecessary loss of general ability. We propose TAU, a multimodal unlearning framework that optimizes multiple task-aligned objectives corresponding to heterogeneous tasks in multimodal benchmarks. Although all training data are uniformly formatted as VQA-style triplets, MLLMs are evaluated through distinct task settings on MLLMU-Bench, including classification, free-form generation, and cloze-style completion. TAU decomposes unlearning into three task-aligned objectives that target different behaviors: reducing discriminative confidence for classification, inducing controlled refusal generation, and suppressing entity–attribute recall under both image-conditioned and text-only prompts. These objectives are jointly optimized with a retain loss to effectively forget designated information while preserving non-target knowledge. Experiments on different MLLMs demonstrate that TAU achieves more balanced and effective unlearning across tasks.

Unsupervised Multi-Source Causal Feature Selection

Multi-source feature selection (MSFS) aims to identify informative features from data collected across multiple sources, where each source contains distinct samples but shares a common feature space. Despite the growing interest in this setting, most existing unsupervised feature selection methods designed for single-source data and are therefore suboptimal for multi-source scenarios that exhibit sample heterogeneity, as they fail to capture shared underlying structures across sources. Moreover, current approaches primarily focus on statistical correlations while overlooking the underlying causal relationships between features. Consequently, the selection process is susceptible to confounding factors and source-specific biases, which can induce spurious associations and degrade feature selection performance. To address these challenges, we propose a novel MSFS approach, called Unsupervised Multi-source Causal Feature Selection (UMCFS). Specifically, UMCFS employs adversarial balancing to learn domain-invariant sample weights and enables causal reweighting that mitigates confounding-induced spurious correlations without label supervision. Furthermore, it explicitly separates shared global feature importance from source-specific deviations to enhance the reliability of feature evaluation across heterogeneous sources. Extensive experimental results demonstrate that UMCFS consistently outperforms state-of-the-art MSFS methods on benchmark datasets.

EmoTg: Modeling Social Response to Telegram Posts Using LLM and NLP Features

This paper presents a pipeline for analyzing Telegram posts and predicting audience reactions in the absence of comment data. Raw messages were processed using both NLP methods and GPT-based structured prompts to extract linguistic, stylistic, semantic, and emotional features. Comment-level responses were independently annotated using a dedicated prompt to generate thread-level labels capturing sentiment, toxicity, stance, humor, and conflict. These labels were then used as targets in a post-to-comment inference task formulated as a multi-task classification problem. XGBoost models achieved up to 0.85 accuracy for emotional attributes (cuteness, conflict) and 0.71 for toxic comment prediction, confirming that post content contains reliable signals of potential audience behavior. SHAP-based explainability showed that LLM-derived features are interpretable and behaviorally meaningful. Overall, the approach demonstrates that content dynamics in comment threads can be anticipated from post features alone, enabling proactive moderation, risk assessment, and socially aware AI support for online communication.

SESSION: Emerging Trends in Web Advertising (WebAds)

SESSION: 12th International Smart City Workshop -- Data-Driven Smart Cities (WebAndTheCity)

WebAndTheCity 12th International Smart City Workshop - Data-Driven Smart Cities

This is the 12th edition of the workshop series labeled ''Web and the City – The Web and Smart Cities'', which is the successor of the series of workshops that started in Florence in 2015 and has continued taking place every year in conjunction with the WWW conference series. Each year, the focus of the workshop is actualized, and this year, the workshop focuses on data-driven smart cities. In the era of IoT, AI, and agentic AI, integration, the citiverse (metaverse for people-centric cities), cities are being transformed into urban environments that use data as a foundational asset to improve decision-making, optimize services, and enhance citizen well-being. At the same time, data is processed using various techniques and methods, which influence the outcomes. This workshop aims to explore how the Web supports this transformation and how technologies can improve smart cities.

A Toolset for Modelling Non-Mediated Governance

This discussion paper summarizes the visionary concepts non-mediated Governance (nm-Gov) and Liquid Democracy and describes a diagraming technique used to model use-cases to better understand how data in nm-Gov can be translated to concrete government action.

Boltzmann Interaction Models with Gibbs Decision Policies in Multi-Agent Delivery Systems

This paper introduces the Boltzmann Interaction Multi-Drone Coordination (BIMDC) framework, a thermodynamically grounded approach to decentralized coordination in data-driven smart city environments. We consider urban drone fleets as self-organizing cyber-physical agents that make stochastic local decisions based on energy functions encoding travel cost, service imbalance, and interagent interference. Agent behavior follows a Gibbs policy, where a temperature parameter T regulates the exploration–exploitation trade-off and gives rise to distinct coordination regimes. Using a mean-field formulation, we connect local interaction energies to macroscopic order parameters, enabling analytical insight into cityscale fleet behavior. Experiments in a simulated urban delivery scenario (L=50, M=12, N ≤ 24) reveal clear temperature–density dependencies: highly coordinated service allocation at low T(Φ≈0.99), degraded coordination at intermediate T ≈ 1.5 (Φ≈0.7–0.8), and partial recovery at higher temperatures. The resulting phase diagram Φ(T, p) provides a principled tool for adaptive fleet management,capacity planning, and analysis in smart city logistics.

LLM-Enabled Participatory Platforms for People-Centric Smart Cities

Existing digital participatory platforms are often constrained to basic consultation and voting workflows, limiting their ability to sustain large-scale, inclusive, and deliberative civic engagement in smart cities. This article addresses these limitations by proposing a novel human-in-the-loop architecture that leverages Large Language Models and 6G connectivity to enable richer co-creation in people-centric smart cities. The primary contribution of this work is the development of a socio-technical framework that delineates the boundaries of model autonomy across four critical participatory stages: proposal drafting, convergence, large-scale synthesis, and behavioral moderation. By grounding this framework in the requirements of transparency, accountability, and representational fairness, the study identifies specific design constraints needed to mitigate algorithmic and automation biases. Our analysis provides a roadmap for institutional adoption, demonstrating that for Large Language Models to preserve democratic legitimacy, their use must be explicitly bounded and open to contestation. These findings offer actionable insights for researchers and urban planners aiming to integrate generative artificial intelligence into robust, transparent, and inclusive 6G-enabled participatory ecosystems.

Super-Resolution of Urban Socioeconomic Indicators via Graph-Based Recommender Systems

Detailed socioeconomic insights are essential for effective urban policy, yet traditional census data remains static, costly, and restricted to coarse spatial resolutions due to privacy constraints. To bridge this ''granularity gap'', we introduce a Socioeconomic Super-Resolution framework that infers fine-grained indicators from user-generated digital traces. % We propose a Spatial-GNN framework that learns business representations from user-business interaction graphs, explicitly enriching them with semantic categories and coarse geographical context. We show that these embeddings, trained solely on coarse aggregate labels (Postal Codes), capture sufficient latent signal to recover fine-grained attributes at the Census Block Group level. % Experiments on the Yelp dataset confirm that our approach effectively disaggregates urban data, with the learned embeddings naturally capturing spatial and socioeconomic homophily. This work bridges Recommender Systems and Urban Computing, offering a scalable methodology for high-resolution demographic inference.

Closed-Loop Intelligence Architectures for Data-Driven Smart Cities

Data-driven smart cities increasingly rely on predictive analytics and digital platforms to manage complex and dynamic urban challenges. However, predictive intelligence is often developed in isolation and remains weakly connected to operational decision-making and coordinated action, limiting its effectiveness in time-critical contexts. This paper presents a closed-loop intelligence architecture for smart cities that integrates predictive digital twins, web-based data and service platforms, and reliable communication mechanisms into a continuous operational cycle. The proposed architecture connects data collection and AI/ML-based prediction with decision-support tools and coordinated action, enabling predictive insights to be operationalised and continuously refined through feedback from real-world interventions. By closing the loop between sensing, analytics, decision support, and action, the approach supports a transition from reactive to preventive, resilient, and human-centric smart city operation.

A Contrastive Learning and Prompt-Driven Framework for Low-Resource User Geolocation

Social media user geolocation is a fundamental yet challenging problem due to the scarcity of geotagged data and the heterogeneity of online user information. To address these challenges, we propose FewUser, a contrastive learning and prompt-driven framework for few-shot social media user geolocation. FewUser aligns user and location representations through a dual-objective framework that jointly optimizes contrastive and matching losses with hard negative mining, enabling robust geolocation under limited supervision. The model comprises a user representation module that fuses heterogeneous social media inputs via a pre-trained language model (PLM) and a lightweight user encoder, and a geographical prompting module that employs hard, soft, and semi-soft prompts to bridge PLM semantics with location-specific knowledge. To facilitate few-shot and cross-platform evaluation, we construct two new datasets, TwiU and FliU, featuring rich and standardized user- and post-level metadata. Extensive experiments on TwiU, FliU, and two public benchmarks demonstrate that FewUser consistently outperforms competitive baselines in various few-shot settings.

Big Data and AI in Smart City Negotiations: A Literature Review

This is a work in progress, which includes a literature review that examines the most cited studies on the use of big data and artificial intelligence (AI) in negotiations and public policy, drawing on sources indexed in Scopus, Web of Science, and ScienceDirect. Spanning disciplines including urban planning, environmental governance, health policy, media studies, behavioral science, organizational studies, and computer science, the reviewed literature highlights the growing integration of large-scale datasets, algorithmic tools, and AI systems into policy design and negotiation processes. The findings indicate that big data and AI not only support decision-making but increasingly shape negotiation dynamics by redefining uncertainty, feasibility, and legitimacy, while also altering power relations among negotiating actors. Four interrelated themes emerge from the literature: (1) the role of big data in structuring negotiation spaces, (2) the influence of algorithms on public discourse and agenda-setting, (3) the use of AI tools in decision-making and negotiated outcomes, and (4) the continuing role of human agency amid algorithmic mediation. Across these themes, evidence suggests that analytical capacity functions as a source of structural bargaining power, often pre-configuring options before deliberation begins. At the same time, algorithmic curation of information shapes public debate and the legitimacy of policy choices, complicating consensus-building processes. While AI systems can enhance efficiency and analytical rigor, they also embed normative assumptions and risk reinforcing existing inequalities if governance, transparency, and inclusiveness are insufficiently addressed. The review concludes that, rather than replacing human negotiators, AI systems interact with and reshape human agencies, raising critical governance challenges for democratic accountability, fairness, and power asymmetries in contemporary policy negotiations.

SESSION: Zero-knowledge Proof and Blockchain for WEB 4.0: Advancing the Post-quantum and Decentralized Era