DocEng '25: Proceedings of the 2025 ACM Symposium on Document Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Detecting and Documenting Plagiarism and GenAI Use

Debora Weber-Wulff

Despite there being many software systems that announce that they can detect plagiarism and AI-generated text, they do not actually work as many people suppose or desire. Unauthorized content generation, whether plagiarism, ghostwriting, or AI-generated text, comes in many varieties and not all are easy to detect.

There are also multiple algorithms used to detect the different kinds of unauthorized content generateion, and they do not always produce the same results. There are, however, interesting forensic indicators that can point to plagiarism, ghostwriting, or AI-use. The software systems can be used as one potential tool, but not as a decision system for determining plagiarism. It is, however, very easy to document some forms of plagiarism.

It is a different story for ghostwritten or AI-generated texts. There is no proof to be found that a text was produced by another person or a large language model, only probabilities. Depending on the use case, the amount of false positives and false negatives can also preclude the use of such detection systems. Here, too, there are forensic indicators that can show the probable use of large language models or a ghostwriter, but they still cannot provide absolute proof of use.

Issues in Document Security

Charles Nicholas

When hitherto separate areas of science intersect, research opportunities tend to pop up. So it is with the fields of Document Engineering and Cybersecurity. We present an overview of certain problems that are related to these fields. This overview includes a brief summary of recent and ongoing work in our lab, which in turn includes some that has appeared at previous DocEng conferences. We summarize recent work on detecting and dealing with malicious PDF files, construction of useful malware data sets, and certain applications of tensor decomposition in the analysis of such data. We also describe some ongoing work in malware clustering. In recent months the topic of AI-generated documents, especially software, has created much discussion, and we will comment on this. We will conclude by pointing out certain themes in our work, as well as certain outstanding problems.

SESSION: Tutorials

LLM-assisted Automatic Feature Extraction for Document Understanding and Analytics

Sirisha Velampalli

With the exponential growth in the volume and complexity of structured and semi-structured documents, organizations face increasing challenges in extracting meaningful features for downstream analytics. This tutorial introduces a cutting-edge approach to automatic feature extraction using Large Language Models (LLMs), integrating generative AI with document engineering workflows.

Participants will explore how LLMs can extract actionable insights from varied document types such as invoices, medical notes, legal filings, and audit reports. We demonstrate the integration of LLMs with machine learning models in end-to-end pipelines involving document parsing, feature extraction, and analytics.

Real-world scenarios covered include business intelligence and globally significant domains such as: (1) detecting corruption from audit trails, (2) analyzing pregnancy risks from clinical narratives, (3) uncovering food adulteration in lab reports, (4) investigating accidents through unstructured reports, (5) monitoring war or geopolitical crises via leaked or public documents.

This tutorial illustrates how global challenges---often rooted in complex documentation---can be addressed by transforming unstructured data into impactful solutions with social, economic, and ethical relevance.

Well-Tagged PDF and Universal Accessibility with LATEX

Frank Mittelbach
Ulrike Fischer
David Carlisle
Joseph Wright

SESSION: Editorials

Celebrating 25 Years of Document Engineering

Ethan V. Munson

Binarizing Photographed Document Images 2025 Quality, Time and Space Assessment

Gustavo P. Chaves
Thaylor Vieira
Gabriel de F. P e Silva
Rafael Dueire Lins
Steven J. Simske

Image binarization is fundamental for document image processing. The performance of binarization algorithms depends on several factors that range from the quality of the digitalization devices to the intrinsic features of the document itself and the kind and intensity of the noises present in the image. This assessment on binarizing photographed documents evaluated the quality, time, space, and performance of five new algorithms and ninety-eight "classical" algorithms. The test data set is composed of laser and deskjet printed documents, photographed using six widely used mobile devices with the strobe flash on, off, and in auto modes under three different angles and places of capture.

SESSION: Document Information Retrieval

Session details: Document Information Retrieval

Patrick Healy

Exploiting Query Reformulation and Reciprocal Rank Fusion in Math-Aware Search Engines

Besat Kassaie
Andrew Kane
Frank Wm. Tompa

Mathematical formulas introduce complications to the standard approaches used in information retrieval. By studying how traditional (sparse) search systems perform in matching queries to documents, we hope to gain insights into which features in the formulas and in the accompanying natural language text signal likely relevance.

In this paper, we focus on query rewriting for the ARQMath benchmarks recently developed as part of CLEF, the Conference and Labs of the Evaluation Forum. In particular, we improve mathematical community question answering applications by using responses from a large language model (LLM) to reformulate queries. Beyond simply replacing the query by the LLM response or concatenating the response to the query, we explore whether improvements accrue from the LLM selecting a subset of the query terms, augmenting the query with additional terms, or re-weighting the query terms. We also examine whether such query reformulation is equally advantageous for math features extracted from formulas and for keyword terms. As a final step, we use reciprocal rank fusion (RRF) to combine several component approaches in order to improve ranking results. In two experiments involving real-world mathematical questions, we show that combining four strategies for term selection, term augmentation, and term re-weighting improves nDCG'@1000 by 5%, MAP'@1000 by 7%, and P'@10 by more than 9% over using the question as given.

Mining a Century of Swiss Trademark Data

Daniel Travaglia
Jesper Findahl
Marco D'Ambros
Andrea Mocci
Raphael Parchet

This paper presents an approach for extracting trademark registration events from the Swiss Official Gazette of Commerce (SOGC), an official daily journal published by the Swiss Confederation since January 1883. Until 2001, the data is only available as scanned documents, which constitute the target dataset of this study. Our approach is composed of a chain of three steps based on state-of-the-art deep learning techniques. We leverage image classification to identify pages containing trademarks (macro segmentation); we apply object detection to identify the portion of the page corresponding to a registration event (micro segmentation); last, we perform information extraction using a document AI technique. We obtain a dataset of ca. 500,000 trademark registration events, extracted from a corpus of 430,000 pages. Each step of our workflow has relatively high accuracy: the macro and micro segmentation steps show precision and recall greater than 95% on a manually constructed dataset. The dataset offers a unique historical perspective on trademark registrations in Switzerland that is not available from any other source. Showcasing what can be achieved with the extracted information, we provide answers to a set of preliminary economics questions.

OPERA: An Environment Extending Coreference Annotation to Relations Between Entities

Antoine Boiteau
Yann Mathet
Antoine Widlöcher

The availability of annotated corpora on coreference is a requirement for linguistics and NLP. This presupposes the availability of suitable annotation environments. Yet most of annotations tools for coreference are based on annotation models with limited expressiveness. We present here the OPERA annotation tool, which is based on an extended model for coreference annotation that, in addition of allowing work on referring expressions in the text (a widespread feature), enables the relations between entities to be annotated and characterized.

Topic Modeling and Link-Prediction for Material Property Discovery

Ryan C. Barron
Maksim E. Eren
Valentin Stanev
Cynthia Matuszek
Boian S. Alexandrov

Link prediction is a key network analysis technique that infers missing or future relations between nodes in a graph, based on observed patterns of connectivity. Scientific literature networks and knowledge graphs are typically large, sparse, and noisy, and often contain missing links, potential but unobserved connections, between concepts, entities, or methods. Here, we present an AI-driven hierarchical link prediction framework that integrates matrix factorization to infer hidden associations and steer discovery in complex material domains. Our method combines Hierarchical Nonnegative Matrix Factorization (HNMFk), Boolean matrix factorization (BNMFk) with automatic model selection. These discrete factors are then fused with Logistic matrix factorization (LMF), we use to construct a three-level topic tree from a 46,862-document corpus focused on 73 transition-metal dichalcogenides (TMDs). This class of materials has been studied in a variety of physics fields and has a multitude of current and potential applications.

An ensemble BNMFk + LMF approach fuses discrete interpretability with probabilistic scoring. The resulting HNMFk clusters map each material onto coherent research themes, such as superconductivity, energy storage, and tribology, and highlight missing or weakly connected links between topics and materials, suggesting novel hypotheses for cross-disciplinary exploration. We validate our method by removing publications about superconductivity in well-known superconductors, and demonstrate that the model correctly predicts their association with the superconducting TMD clusters. This highlights the ability of the method to find hidden connections in a graph of material to latent topic associations built from scientific literature. This is especially useful when examining a diverse corpus of scientific documents covering the same class of phenomena or materials but originating from distinct communities and perspectives. The inferred links generating new hypotheses, produced by our method, are exposed through an interactive Streamlit dashboard, designed for scientific discovery.

SESSION: Optical Character Recognition

Session details: Optical Character Recognition

Steve Simske

Improving Lightweight Named Entity Recognition in Handwritten Documents by Predicting Pyramidal Histograms of Characters

David Villanova-Aparisi
Carlos D. Martínez-Hinarejos
Verónica Romero
Moisés Pastor-Gadea

Named Entity Recogniton (NER) consists of tagging parts of an unstructured text containing particular semantic information. When applied to handwritten documents, it is possible to do it as a two-step approach in which Handwritten Text Recognition (HTR) is performed prior to tagging the automatic transcription. However, it is also possible to do both tasks simultaneously by using an HTR model that learns to output the transcription and the tagging symbols. In this paper, we focus on improving the one-step approach by introducing the auxiliary task of predicting Pyramidal Histograms of Characters (PHOC) in a Convolutional Recurrent Neural Network (CRNN) model. Moreover, given the recent rise of models that digest large amounts of data, we also study the usage of synthetic data to pretrain the proposed architecture. Our experiments show that by pretraining the PHOC-based architecture on synthetic data substantial improvements can be made in both transcription and tagging quality without compromising the computational cost of the decoding step. The resulting model matches the NER performance of the state-of-the-art while keeping its lightweight nature.

Text Image Super-Resolution for Improved OCR in Real-Life Scenarios using Swin Transformers

Philipp Hildebrandt
Maximilian Schulze
Sarel Cohen
Vanja Doskoč
Raid Saabni
Tobias Friedrich

Text recognition in real-life images poses a challenging task due to blur, distortion, and low resolution. This work presents an innovative method integrating image super-resolution, image restoration, and optical character recognition techniques to enhance text recognition in real-life photographs. We specifically reviewed the processing of the TextZoom dataset and utilized transfer learning on an improved version of the image super-resolution model, SwinIR. The findings of our experiment show that our text recognition scores are better than the current best scores, and there is a significant rise in the peak signal-to-noise ratio while dealing with deformed low-resolution images from the TextZoom dataset. This approach outperforms earlier research in the domain of scene text image super-resolution and offers a promising resolution for text recognition in real-life images. The code can be accessed at this location: https://github.com/Phimanu/TextSR

Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval

Alexander Most
Joseph Winjum
Manish Bhattarai
Shawn Jones
Nishath Rajiv Ranasinghe
Ayan Biswas
Dan O'Malley

Retrieval-Augmented Generation (RAG) has become a popular technique for enhancing the reliability and utility of Large Language Models (LLMs) by grounding responses in external documents. Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text. However, even state-of-the-art OCRs can introduce errors, especially in degraded or complex documents. Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR. This study presents a systematic comparison between a vision-based RAG system (ColPali) and more traditional OCR-based pipelines utilizing Llama 3.2 (90B) and Nougat OCR across varying document qualities. Beyond conventional retrieval accuracy metrics, we introduce a semantic answer evaluation benchmark to assess end-to-end question-answering performance. Our findings indicate that while vision-based RAG performs well on documents it has been fine-tuned on, OCR-based RAG is better able to generalize to unseen documents of varying quality. We highlight the key trade-offs between computational efficiency and semantic accuracy, offering practical guidance for RAG practitioners in selecting between OCR-dependent and vision-based document retrieval systems in production environments.

A Proposal of Post-OCR Spelling Correction Using Monolingual Byte-level Language Models

Sávio Santos de Araújo
Byron Leite Dantas Bezerra
Arthur Flor de Sousa Neto

This work presents a proposal for a spelling corrector using monolingual byte-level language models (Monobyte) for the post-OCR task in texts produced by Handwritten Text Recognition (HTR) systems. We evaluate three Monobyte models, based on Google's ByT5, trained separately on English, French, and Brazilian Portuguese. The experiments evaluated three datasets with 21st century manuscripts: IAM, RIMES, and BRESSAY. In the IAM, Monobyte achieves reductions of 2.24% in character error rate (CER) and 26.37% in word error rate (WER). In RIMES, reductions are 13.48% (CER) and 33.34% (WER), while in BRESSAY, Monobyte improves CER by 12.78% and WER by 40.62%. The BRESSAY results surpass results reported in previous works using a multilingual ByT5 model. Our findings demonstrate the effectiveness of byte-level tokenization in noisy text and underscore the potential of computationally efficient, monolingual models. Code is availabled at https://github.com/savi8sant8s/monobyte-spelling-corrector.

Old Greek OCR Result Correction Using LLMs

Andreas Evaggelatos
Konstantinos Palaiologos
Basilis Gatos
Panagiotis Kaddas
Aikaterini Christopoulou
Vassilis Katsouros

Recognition of historical documents is still an active research field due to the relatively low recognition accuracy achieved when processing old fonts or low-quality images. In this work, we investigate the use of Large Language Models (LLMs) for the correction of the OCR for old Greek documents. We examine two different old Greek datasets, one machine printed and one typewritten, using a Deep Network based OCR together with several known and easy-to-use LLMs for the correction of the result. Additionally, we synthetically produce erroneous texts and change the LLM prompts in order to further study the behavior of LLMs for correcting old Greek noisy text. Experimental results highlight the potential of LLMs for OCR correction of old Greek documents especially for the cases that the recognition results are relatively poor.

SESSION: Document Organization and Generation

Session details: Document Organization and Generation

Ethan Munson

A Hybrid, Neuro-symbolic Approach for Scholarly Knowledge Organization

Hassan Hussein
Allarad Oelen
Sören Auer

The rapid development of generative AI leveraging neural models, particularly with the introduction of large language models (LLMs), has fundamentally advanced natural language processing and generation. However, such neural models are non-deterministic, opaque, and tend to confabulate. Knowledge Graphs (KGs) on the other hand contain factual information represented in a symbolic way for humans and machines following formal knowledge representation formalisms. However, the creation and curation of KGs is time-consuming, cumbersome, and resource-demanding. A key research challenge now is how to synergistically combine both formalisms with the human in the loop (Hybrid AI) to obtain structured and machine-processable knowledge in a scalable way. We introduce an approach for a tight integration of Humans, Neural Models (LLM), and Symbolic Representations (KG) for the semiautomatic creation and curation of Scholarly Knowledge Graphs. Our approach, while demonstrated in the scholarly context, establishes generalizable principles for neuro-symbolic integration that can be adapted to other domains. We implement and integrate our approach comprising an intelligent user interface and prompt templates for interaction with an LLM in the Open Research Knowledge Graph. We perform a thorough analysis of our approach and implementation with a user evaluation to assess the merits of the neuro-symbolic, hybrid approach for organizing scholarly knowledge.

Preserving Measurement Data Records Long-term: A Field Study on Information Management in the Wake of the 1986 Chernobyl Disaster

Uwe M. Borghoff
Peter Rödig

The long-term preservation of digital documents containing scientifically relevant data from measurements remains challenging. We focus on two critical aspects: first, the use of highly specialized--- sometimes proprietary and obsolete---formats; and second, the limited usefulness of raw data without contextual information. In the context of a field study on the preservation of gamma-spectroscopic food radioactivity measurements collected after the Chernobyl disaster in 1986, we examine these aspects and describe the problems encountered in an archival environment. We emphasize the implementation of the FAIR guiding principles for scientific data management and stewardship, 1 in particular the principle that "data are described with rich metadata" including "descriptive information about the context." We use the OAIS (Open Archival Information System) reference model as a methodological framework.

Towards More Homogeneous Paragraphs

Didier Verna

Paragraph justification is based primarily on shrinking or stretching the interword blanks. While the blanks on a line are all scaled by the same amout, the amount in question varies from line to line. The quality of a paragraph's typographic color largely depends on the aforementioned variation being as small as possible. Yet, TEX'S paragraph justification algorithm addresses this problem in a rather coarse fashion. In this paper, we propose a refinement to the algorithm allowing to improve the situation without disturbing the general behavior of the algorithm too much, and without the need for manual intervention. We analyze the impact of our refinement on a large number of experiments through several statistical estimators. We also exhibit a number of typographical traits relatedto whitespace distribution that we believe may contribute to our perception of homogeneousness.

MathML and other XML Technologies for Accessible PDF from LATEX

Frank Mittelbach
Ulrike Fischer
David Carlisle
Joseph Wright

In this paper we describe the current approach to using MathML within Tagged PDF to enhance the accessibility of mathematical (STEM) documents. While MathML is specified by the PDF 2.0 specification as a standard namespace for PDF Structure Elements, the interaction of MathML, which is defined as an XME vocabulary, and PDF Structure Elements (which are not defined as XME) is left unspecified by the PDF standard. This has necessitated the development of formalizations to interpret and validate PDF Structure Trees as XME, which are also introduced in this paper.

Measuring temporal gains in assisted document transcription

Shad Mohammad
Előd Egyed-Zsigmond
Franck Lebourgeois
Michiel Streijger
Michela Bussotti
Luis Tovar Pimentel
Vincent Paillusson

Transcribing ancient manuscripts is a very time consuming task mainly done by specialists like historians and paleographers. In order to minimize time spent on transcribing, computing tools can be used to assist researchers in this process, but, to our knowledge, there are no studies that evaluate and quantify precisely their actual benefits in terms of human time spent on the transcription. This paper presents a study on quantifying the temporal efficiency of different transcription and segmentation workflows. The main contribution of our work is the extension of a popular existing open-source text transcription web platform: eScriptorium with tools and methods to measure time spent on the transcription task. We explore and compare the efficiency of three workflows: fully manual, using a default annotation model, and using a finetuned model for both ancient manuscript segmentation and transcription. In our experiments, we aim to observe and compare the temporal gains achieved by each workflow, highlighting the trade-offs between manual and automated processing of transcriptions. The paper describes the design of the tracing layer, presents the methodology and the results. This work is conducted as part of the ChEDiL1 French ANR project.

SESSION: DocEng Demonstrations

Session details: DocEng Demonstrations

Cerstin Mahlow

A Comprehensive AI-Powered Editing and Typesetting Platform for Enhancing Academic Writing

Jie Wang

This demo introduces Doenba Edit, a user-friendly, AI-powered platform developed by Librum Technologies, Inc., designed for seamless editing and typesetting of academic writing within a word processor-like interface. It supports the entire academic writing workflow, from outlining and idea development to drafting, revising, typesetting, and cross-referencing, assisted by integrated AI tools at every stage. This offers a comprehensive solution for enhancing both the quality and efficiency of producing scholarly work.

Use Case Demonstration @ DocEng2025: Conversation-Driven Multi-LLM Framework for Web Document Sentiment Analysis

Dominik Opitz
Andreas Hamm

In this use case demonstration we show how a system of collaborative Large Language Models (LLMs) can be applied to the task of analyzing the sentiment of online news articles. The emergence of LLMs has proven to be highly valuable in interpreting unstructured text, offering nuanced and context-aware insights. While they can not fully replace traditional machine learning approaches for sentiment analysis, our approach illustrates how collaborative LLM architectures can enrich the explainability and trustworthiness of the outcomes.

The Di2Win Document Intelligence Platform

Afonso Ferreira
Cleber Zanchettin
Romulo Andrade
Byron Leite Dantas Bezerra

We present the Di2Win Document Intelligence Platform (DIP). This modular AI-driven pipeline transforms raw document images --- captured by scanners or mobile phones --- into structured data and business actions in a single pass. The system comprises five loosely-coupled micro-services: (1) image-quality verification using a contrast-invariant model that flags blur, skew, and illumination issues above 100 ms per page; (2) document classification via a Transformer-base model with layout embeddings, delivering top-k types with calibrated confidence; (3) information extraction through i) Dilbert, a multimodal Token-Layout-Language model fine-tuned on weakly-labeled forms or ii) Delfos, a Large Language Model Mixture of Experts fine-tuned with well-defined prompts; (4) DataDrift, a powerful rules engine to avoid inconsistent outputs concerning the business process; and (5) process automation orchestrated by a Business Process Model Notation (BPMN) plus a Robot Process Automation (RPA) engine that routes results to databases, APIs, or human-review queues. All AI components are orchestrated through a messaging service to control the information flow, and the application exposes REST/gRPC endpoints to communicate with outside consumers. This enables the hot-swapping of models without downstream code changes by plugging a new message consumer into the messaging system. This also provides horizontal scalability since to increase the application throughput, we only need to add new AI engine consumers to the messaging system. Deployed in banking, insurance, and healthcare, the Di2Win DIP has processed more than 30 million pages, reducing average handling time by 79% and re-keying errors by 86 %, speeding up the workflows up to ten times. Our DocEng demonstration allows attendees to upload documents, observe live quality and confidence dashboards, and edit extracted fields with immediate feedback to the active-learning loop.

SESSION: Document Classification

Session details: Document Classification

Besat Kassaie

SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification

Shahriar Shayesteh
Mukund Srinath
Lee Matheson
Lu Xian
Sinjoy Saha
C. Lee Giles
Shomir Wilson

One approach to understanding the vastness and complexity of the web is to categorize websites into sectors that reflect the specific industries or domains in which they operate. However, existing website classification approaches often struggle to handle the noisy, unstructured, and lengthy nature of web content, and current datasets lack a universal sector classification labeling system specifically designed for the web. To address these issues, we introduce SoAC (Sector of Activity Corpus), a large-scale corpus comprising 195, 495 websites categorized into 10 broad sectors tailored for web content, which serves as the benchmark for evaluating our proposed classification framework, SoACer (Sector of Activity Classifier). Building on this resource, SoACer is a novel end-to-end classification framework that first fetches website information, then incorporates extractive summarization to condense noisy and lengthy content into a concise representation, and finally employs large language model (LLM) embeddings (Llama3-8B) combined with a classification head to achieve accurate sectoral prediction. Through extensive experiments, including ablation studies and detailed error analysis, we demonstrate that SoACer achieves an overall accuracy of 72.6% on our proposed SoAC dataset. Our ablation study confirms that extractive summarization not only reduces computational overhead but also enhances classification performance, while our error analysis reveals meaningful sector overlaps that underscore the need for multi-label and hierarchical classification frameworks. These findings provide a robust foundation for future exploration of advanced classification techniques that better capture the complex nature of modern website content.1

Robust Image Classifiers Fail Under Shifted Adversarial Perturbations

Fatemeh Amerehi
Patrick Healy

Non-robustness of image classifiers to subtle, adversarial perturbations is a well-known failure mode. Defenses against such attacks are typically evaluated by measuring the error rate on perturbed versions of the natural test set, quantifying the worst-case performance within a specified perturbation budget. However, these evaluations often isolate specific perturbation types, underestimating the adaptability of real-world adversaries who can modify or compose attacks in unforeseen ways. In this work, we show that models considered robust to strong attacks, such as AutoAttack, can be compromised by a simple modification of the weaker FGSM attack, where the adversarial perturbation is slightly transformed prior to being added to the input. Despite the attack's simplicity, robust models that perform well against standard FGSM become vulnerable to this variant. These findings suggest that current defenses may generalize poorly beyond their assumed threat models and can achieve inflated robustness scores under narrowly defined evaluation settings.

Document Classification using File Names

Zhijian Li
Stefan Larson
Kevin Leach

Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets and computational resources associated with analyzing whole documents. In this paper, we present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method, to accurately and efficiently classify documents based solely on their file names, which substantially reduces inference time. Experiments on two datasets introduced in this paper show that our file name classifiers correctly predict more than 90% of in-scope documents with 99.63% and 96.57% accuracy while being 442x faster than more complex models such as DiT. Our results demonstrate that incorporating lightweight file name classification as a front-end to document analysis pipelines can efficiently process vast document datasets in critical scenarios, enabling fast and more reliable document classification.

Spurious Cues in RVL-CDIP and Tobacco3482 Document Classification: The Case of ID Codes

Stefan Larson
Sharad Duwal
Brian Vilnrotter
Gayatri Chakkithara
Vedant Padwal
Kevin Leach

RVL-CDIP and Tobacco3482 are commonly used document classification benchmarks, but recent work on explainability has revealed that ID codes stamped on the documents in these datasets may be used by machine learning models to learn shortcuts on the classification task. In this paper, we present an in-depth investigation into the influence and impact of these ID codes on model performance. We annotate ID codes in documents from RVL-CDIP and Tobacco3482 and find that shallow learning models can achieve classification accuracy scores of roughly 40% on RVL-CDIP and 60% on Tobacco3482 using only features derived from the ID codes. We also find that a state-of-the-art document classifier sees a performance drop of 11 accuracy points on RVL-CDIP when ID codes are removed from the data. Finally, we train an ID code detection model in order to remove ID codes from RVL-CDIP and Tobacco3482 and make this data publicly available.

SESSION: Document Analysis and Generation

Session details: Document Analysis and Generation

Didier Verna

BioReadNet: A Transformer-Driven Hybrid Model for Target Audience-Aware Biomedical Text Readability Assessment

Anya Amel Nait Djoudi
Patrice Bellot
Adrian-Gabriel Chifu

The perception of the readability of biomedical texts varies depending on the reader's profile, a disparity further amplified by the intrinsic complexity of these documents and the unequal distribution of health literacy within the population. Although 72% of Internet users consult medical information online, a significant proportion have difficulty understanding it. To ensure that texts are accessible to a diverse audience, it is essential to assess readability. However, conventional readability formulas, designed for general texts, do not take this diversity into account, underlining the need to adapt evaluation tools to the specific needs of biomedical texts and the heterogeneity of readers. To address this gap, we propose a novel readability assessment method tailored to three distinct audiences: expert adults, non-expert adults, and children. Our approach is built upon a structured, bilingual biomedical corpus of 20,008 documents (8,854 in French, 11,154 in English), compiled from multiple sources to ensure diversity in both content and audience. Specifically, the French corpus combines texts from Cochrane and Wikipedia/Vikidia, both of which are subsets of the CLEAR corpus, while the English corpus merges documents from the Cochrane Library, Plaba, and Science Journal for Kids. For each original expert-level text, domain specialists produced simplified variants calibrated specifically to the comprehension abilities of non-expert adults or children. Every document is therefore explicitly labeled by its target audience. Leveraging this resource, we trained a diverse suite of classifiers, from classical approaches (e.g., XGBoost, SVM) to classifiers built upon language models (e.g., BERT, CamemBERT, BioBERT, DrBERT). We then designed a hybrid architecture "BioReadNet" that integrates transformer embeddings with expert-driven linguistic features, achieving a macro-averaged F1 score of 0.987.

Visual Large Language Models for Graphics Understanding: A Case Study on Floorplan Images

Valeria Nardoni
Kimiya Noor Ali
Zahra Ziran
Simone Marinai

This study explores the use of Vision Large Language Models (VLLMs) for identifying items in complex graphical documents. In particular, we focus on looking for furniture objects (e.g. beds, tables, and chairs) and structural items (doors and windows) in floorplan images. We evaluate one object detection model (YOLO) and state-of-the-art VLLMs on two datasets featuring diverse floorplan layouts and symbols. The experiments with VLLMs are performed with a zero-shot setting, meaning the models are tested without any training or fine-tuning, as well as with a few-shot approach, where examples of items to be found in the image are given to the models in the prompt. The results highlight the strengths and limitations of VLLMs in recognizing architectural elements, providing guidance for future research in the use multimodal vision-language models for graphics recognition.

Designing Visual Tools for Writing Process Analysis

Cerstin Mahlow

Understanding how texts are produced is crucial not only for the development of theoretical models and writing strategies, but also for practical applications. However, the writing process itself--- including intermediate versions, copy-paste actions, input from co-authors or LLMs---remains invisible in the final text. This study addresses this gap by visualizing fine-grained keystroke logging data to capture both the product (final text) and the process (writer's actions) at sentence and text level. We design and implement custom JavaScript visualizations of linguistically processed keystroke logging data. Our pilot study examines data from nine students writing under identical conditions; we analyze temporal, spatial, and structural aspects of writing. The results reveal diverse, nonlinear writing strategies and suggest that individualized process visualizations can inform both document engineering and writing analytics. The novel visualization types we present demonstrate how process and product can be meaningfully integrated.

Synthetic Document Generation with Full Annotation: A Framework Utilizing Open-Weight Large Language Models

Pablo Melendez Abarca
Clemens Havas

Advances in generative AI allow to create synthetic semi-structured documents with ease. Proprietary large language models (LLM) can now generate realistic receipts (a type of semi-structured data) with a prompt; however, their usefulness for training document understanding models is limited without accurate annotations. This work presents a framework that uses open-weight LLMs to create fully-annotated receipts. The framework can be extended or modified, and includes a step for self-assessment. We use the open dataset of receipts, SROIE, to test the usefulness of the generated receipts, and show that mixing both datasets can improve information extraction up to 32.9% for specific fields of the SROIE dataset.

An Adaptive Agentic Tool Building Architecture leveraging Expert-in-the-Loop Guidance, applied to Document Generation

Xavier Daull
Elisabeth Murisasco
Patrice Bellot
Emmanuel Bruno
Vincent Martin

We introduce a general-purpose agentic architecture with expert-in-the-loop guidance that iteratively learns to create tools for searching information and generating documents while minimizing time for task domain adaptation and human feedback. We illustrate it with preliminary experiments on scientific synthesis processes (e.g. state-of-the-art research papers, patents) and could be applied to various problems requiring long structured and nuanced answers.

SESSION: Document Trust and Security

Session details: Document Trust and Security

Charlotte Curtis

Reinforcing Document Privacy in Nigeria: A Framework for Trust in National Data Systems

Fatima-Taslima Hassan
Richey Okoh-Michael

Nigeria has seen gradual growth in this era of digital transformation. Currently, it is being used to improve governance and drive business processes within the country. There have been significant efforts from the Nigerian government to encourage businesses to adopt technology in running day-to-day operations. This does not come without challenges. Document privacy and data protection remain major challenges in Nigeria's digital transformation journey. Despite the decreeing of the Nigeria Data Protection Act (NDPA) in 2023, the country continues to experience systemic document breaches, insider threats, and fragmented data infrastructure. This paper demonstrates how Nigeria's challenges differ fundamentally from intentional database separation in developed countries. The technical framework proposes integrated solutions. This paper focuses on the institutional weaknesses undermining privacy in national document systems, including overreliance on paper-based processes, a lack of harmonized databases, and corruption in IT governance.

Document Encryption in Practice: A Comparative Framework and Evaluation

Isaac Henry Teuscher
Benjamin Schooley

Document files with sensitive information are used across nearly every industry. In recent years, cyberattacks have resulted in millions of sensitive documents being exposed. Although document encryption methods exist, they are often flawed in terms of usability, security, or deployability. We present a structured framework for evaluating document encryption methods, adapting the usability-deployability-security ("UDS") model to the document encryption context. We apply this framework to compare current methods, performing a comprehensive evaluation of nine document protection methods, including password-based, passwordless, and cloud-based approaches. Our analysis across 15 design properties highlights the benefits and limitations of current methods. We propose strategies and design recommendations to address key limitations such as memory-wise effort, granular protection, and shareability.

Hierarchical Clustering of the SOREL Malware Corpus

Raguvir S
Charles Nicholas

We discuss the use of hierarchical clustering to identify similar specimens in a large malware corpus. Clustering of any kind requires the use of a distance function, and evaluation of clustering algorithms requires criteria that involve some sort of ground truth. We use Jaccard Distance as the ground truth, and we compare the results of clustering when using MinHash and SuperMinHash, both of which are approximations of Jaccard, while supposedly being faster. This work therefore is a study of this tradeoff between speed and clustering quality.