DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering

Full Citation in the ACM Digital Library

Binarization of photographed documents image quality, processing time and size assessment

Today, over eighty percent of the world's population owns a smart-phone with an in-built camera, and they are very often used to photograph documents. Document binarization is a key process in many document processing platforms. This competition on binarizing photographed documents assessed the quality, time, space, and performance of five new algorithms and sixty-four "classical" and alternative algorithms. The evaluation dataset is composed of offset, laser, and deskjet printed documents, photographed using six widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.

How did dennis ritchie produce his PhD thesis?: a typographical mystery

Dennis Ritchie, the creator of the C programming language and, with Ken Thompson, the co-creator of the Unix operating system, completed his Harvard PhD thesis on recursive function theory in early 1968. But for unknown reasons, he never officially received his degree, and the thesis itself disappeared for nearly 50 years. This strange set of circumstances raises at least three broad questions:

• What was the technical contribution of the thesis?

• Why wasn't the degree granted?

• How was the thesis prepared?

This paper investigates the third question: how was a long and typographically complicated mathematical thesis produced at a very early stage in the history of computerized document preparation?

Graphical document representation for french newsletters analysis

Document analysis is essential in many industrial applications. However, engineering natural language resources to represent entire documents is still challenging. Besides, available resources in French are scarce and do not cover all possible tasks, especially in specific business applications. In this context, we present a French newsletter dataset and its use to predict the good or bad impact of newsletters on readers. We propose a new representation of newsletters in the form of graphs that consider the newsletters' layout. We evaluate the relevance of the proposed representation to predict a newsletter's performance in terms of open and click rates using graph analysis methods.

A cascaded approach for page-object detection in scientific papers

In recent years, Page Object Detection (POD) has become a popular document understanding task, proving to be a non-trivial task given the potential complexity of documents. The rise of neural networks facilitated a more general learning approach to this task. However, in the literature, the different objects such as formulae, or figures among others, are generally considered individually. In this paper, we describe the joint localisation of six object classes relevant to scientific papers, namely isolated formulae, embedded formulae, figures, tables, variables and references. Through a qualitative analysis of these object classes, we note a hierarchy among the classes and propose a new localisation approach, using two, cascaded You Only Look Once (YOLO) networks. We also present a new data set consisting of labelled bounding boxes for all six object classes. This data set combines two commonly used data sets in the literature for formulae localisation, adding to the document images in these data sets the labels for figures, tables, variables and references. Using this data set, we achieve an average F1-score of 0.755 across all classes, which is comparable to the state-of-the-art for the object classes when considered individually for localisation.

From print to online newspapers on small displays: a layout generation approach aimed at preserving entry points

Simply transposing the print newspapers into digital media can not be satisfactory because they were not designed for small displays. One key feature lost is the notion of entry points that are essential for navigation. By focusing on headlines as entry points, we show how to produce alternative layouts for small displays that preserve entry points quality (readability and usability) while optimizing aesthetics and style. Our approach consists in a relayouting approach implemented via a genetic-inspired approach. We tested it on realistic newspaper pages. For the case discussed here, we obtained more than 2000 different layouts where the font was increased by a factor of two. We show that the quality of headlines is globally much better with the new layouts than with the original layout. Future work will tend to generalize this promising approach, accounting for the complexity of real newspapers, with user experience quality as the primary goal.

Long-term lifecycle-related management of digital building documents: towards a holistic and standard-based concept for a technical and organizational solution in building authorities

The long-term lifecycle-related management of digital building information is essential to improve the overall quality of public built assets. However, this management task still poses great challenges for building authorities, as they are usually responsible for large, heterogeneous and long-lived built assets with countless of data sets and documents that are increasingly changing from analogue to digital representations. These digital collections are characterized by complex dependencies, by numerous different, sometimes highly specialized and proprietary formats and also by their inappropriate organization. The major challenge is to ensure completeness, consistency and usability over the entire lifecycle of buildings or their associated digital data and documents.

In this paper, we present an approach for a holistic and standard-based concept for a technical and organizational solution in building authorities. Holistic means integrating concepts for the long-term usability of digital building information, taking into account the framework conditions described in building authorities, including the introduction of BIM (building information modeling). To this end, we outline how the concepts of the consolidated and widely accepted ISO-standardized reference model OAIS (open archive information system) can be applied to a building-specific information architecture.

First, we sketch the history of electronic data processing in the building sector and introduce the essential concepts of OAIS. Then, we illustrate typical major actors and their (future) IT systems, including systems intended for OAIS-compliant long-term usability. Next, we outline major (future) software components and their interactions and assignment to lifecycle phases. Finally, we delineate how the generic information model of OAIS can be used.

In summary, ensuring the long-term usability of digital information in the building sector will remain a grand challenge, but our proposed approach to the systematic application and further refinement of the OAIS reference model can help to better organize future discussions as well as research, development and implementation activities.

We conclude with some suggestions for further research based on the concepts of the OAIS reference model, such as refining information models or developing information repositories needed for long-term interpretation of digital objects.

Theory entity extraction for social and behavioral sciences papers using distant supervision

Theories and models, which are common in scientific papers in almost all domains, usually provide the foundations of theoretical analysis and experiments. Understanding the use of theories and models can shed light on the credibility and reproducibility of research works. Compared with metadata, such as title, author, keywords, etc., theory extraction in scientific literature is rarely explored, especially for social and behavioral science (SBS) domains. One challenge of applying supervised learning methods is the lack of a large number of labeled samples for training. In this paper, we propose an automated framework based on distant supervision that leverages entity mentions from Wikipedia to build a ground truth corpus consisting of more than 4500 automatically annotated sentences containing theory/model mentions. We use this corpus to train models for theory extraction in SBS papers. We compared four deep learning architectures and found the RoBERTa-BiLSTM-CRF is the best one with a precision as high as 89.72%. The model is promising to be conveniently extended to domains other than SBS. The code and data are publicly available at

Tab this folder of documents: page stream segmentation of business documents

In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are usually scanned together in folders, so it is essential to automate the task of segmenting the folders into documents which then proceed to further analysis tailored to specific document types. This task is known as Page Stream Segmentation (PSS). In this paper, we propose a deep learning solution to solve the task of determining whether or not a page is a breaking-point given a sequence of scanned pages (a folder) as input. We also provide a dataset called TABME (TAB this folder of docuMEnts) generated specifically for this task. Our proposed architecture combines LayoutLM and ResNet to exploit both textual and visual features of the document pages and achieves an F1 score of 0.953. The dataset and code used to run the experiments in this paper are available at the following web link:

Modifying PDF sewing patterns for use with projectors

Print-at-home PDF sewing patterns have gained popularity over the last decade and now represent a significant proportion of the home sewing pattern market. Recently, an all-digital workflow has emerged through the use of ceiling-mounted projectors, allowing for patterns to be projected directly onto fabric. However, PDF patterns produced for printing are not suitable for projecting.

This paper presents PDFStitcher, an open-source cross-platform graphical tool that enables end users to modify PDF sewing patterns for use with a projector. The key functionality of PDFStitcher is described, followed by a brief discussion on the future of sewing pattern file formats and information processing.

SeNMFk-SPLIT: large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection

As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.

Downstream transformer generation of question-answer pairs with preprocessing and postprocessing pipelines

We present a method to perform a downstream task of transformers on generating question-answer pairs (QAPs) from a given article. We first finetune pretrained transformers on QAP datasets. We then use a preprocessing pipeline to select appropriate answers from the article, and feed each answer and the relevant context to the finetuned transformer to generate a candidate QAP. Finally we use a postprocessing pipeline to filter inadequate QAPs. In particular, using pretrained T5 models as transformers and the SQuAD dataset as the finetruning dataset, we obtain a finetuned T5 model that outperforms previous models on standard performance measures over the SQuAD dataset. We then show that our method based on this finetuned model generates a satisfactory number of QAPs with high qualities on the Gaokao-EN dataset assessed by human judges.

Academic writing and publishing beyond documents

Research on writing tools stopped in the late 1980s when Microsoft Word had achieved monopoly status. However, the development of the Web and the advent of mobile devices are increasingly rendering static print-like documents obsolete. In this vision paper we reflect on the impact of this development on scholarly writing and publishing. Academic publications increasingly include dynamic elements, e.g., code, data plots, and other visualizations, which clearly requires other tools for document production than traditional word processors. When the printed page no longer is the desired final product, content and form can be addressed explicitly and separately, thus emphasizing the structure of texts rather than the structure of documents. The resulting challenges have not yet been fully addressed by document engineering.

Optical character recognition with transformers and CTC

Text recognition tasks are commonly solved by using a deep learning pipeline called CRNN. The classical CRNN is a sequence of a convolutional network, followed by a bidirectional LSTM and a CTC layer. In this paper, we perform an extensive analysis of the components of a CRNN to find what is crucial to the entire pipeline and what characteristics can be exchanged for a more effective choice. Given the results of our experiments, we propose two different architectures for the task of text recognition. The first model, CNN + CTC, is a combination of a convolutional model followed by a CTC layer. The second model, CNN + Tr + CTC, adds an encoder-only Transformers between the convolutional network and the CTC layer. To the best of our knowledge, this is the first time that a Transformers have been successfully trained using just CTC loss. To assess the capabilities of our proposed architectures, we train and evaluate them on the SROIE 2019 data set. Our CNN + CTC achieves an F1 score of 89.66% possessing only 4.7 million parameters. CNN + Tr + CTC attained an F1 score of 93.76% with 11 million parameters, which is almost 97% of the performance achieved by the TrOCR using 334 million parameters and more than 600 million synthetic images for pretraining.

Optical character recognition guided image super resolution

Recognizing disturbed text in real-life images is a difficult problem, as information that is missing due to low resolution or out-of-focus text has to be recreated. Combining text super-resolution and optical character recognition deep learning models can be a valuable tool to enlarge and enhance text images for better readability, as well as recognize text automatically afterwards. We achieve improved peak signal-to-noise ratio and text recognition accuracy scores over a state-of-the-art text super-resolution model TBSRN on the real-world low-resolution dataset TextZoom while having a smaller theoretical model size due to the usage of quantization techniques. In addition, we show how different training strategies influence the performance of the resulting model.

Anonymizing and obfuscating PDF content while preserving document structure

The portable document format (PDF) is both versatile and complex, with a specification exceeding well over a thousand pages. For independent developers writing software that reads, displays, or transforms PDFs, it is difficult to comprehensively account for all of the potential variations that might exist in the wild. Compounding this problem are the usage agreements that often accompany purchased and proprietary PDFs, preventing end users from uploading a troublesome document as part of a bug report and limiting the set of test cases that can be made public for open source development.

In this paper, pdf-mangler is presented as a solution to this problem. The goal of pdf-mangler is to remove information in the form of text, images, and vector graphics while retaining as much of the document structure and general visual appearance as possible. The intention is for pdf-mangler to be deployed as part of an automated bug reporting tool for PDF software.

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at

Detecting malware using text documents extracted from spam email through machine learning

Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available "Spam Email Malware Detection - 600" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.

Triplet transformer network for multi-label document classification

Multi-label document classification is the task of assigning one or more labels to a document, and has become a common task in various businesses. Typically, current state-of-the-art models based on pretrained language models tackle this task without taking the textual information of label names into account, therefore omitting possibly valuable information. We present an approach that leverages this information stored in label names by reformulating the problem of multi label classification into a document similarity problem. To achieve this, we use a triplet transformer network that learns to embed labels and documents into a joint vector space. Our approach is fast at inference, classifying documents by determining the closest and therefore most similar labels. We evaluate our approach on a challenging real-world dataset of a German radio-broadcaster and find that our model provides competitive results compared to other established approaches.

Chinese public procurement document harvesting pipeline

We present a processing pipeline for Chinese public procurement document harvesting, with the aim of producing strategic data with greater added value. It consists of three micro-modules: data collection, information extraction, database indexing. The information extraction part is implemented through a hybrid system which combines rule-based and machine learning approaches. Rule-based method is used for extracting information with presenting recurring morphological features, such as dates, amounts and contract awardee information. Machine learning method is used for trade detection in the title of procurement documents.