DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020

Full Citation in the ACM Digital Library

SESSION: Tutorial and Challenges

Machine Interpretation of Sketched Documents

Sketches and drawings have an integral role in the communication of ideas and form concepts. In this tutorial we focus on the use of sketches in the design and manufacturing process. Here, the drawing adapts from an initial, rough concept sketch to precise machine-drawings. While several computer-aided design tools exist which support the use of different drawings, the abmiguity that is typically associated with initial sketches often means that the initial concept sketch needs to be redrawn according to the specific interfaces used within these systems. This tutorial discusses the challenges that exist in the interpretation of sketches and presents drawing interpretation techniques that address some of these challenges.

DocEng'2020 Time-Quality Competition on Binarizing Photographed Documents

Document image binarization is a key process in many document processing platforms. The DocEng'2020 Time-Quality Competition on Binarizing Photographed Documents assessed the performance of eight new algorithms and also 41 other "classical" algorithms. Besides the quality of the binary image, the execution time of the algorithms was assessed. The evaluation dataset is composed of 32 documents photographed using four widely-used mobile devices with the strobe flash on and off, under several different angles of capture.

DocEng'2020 Competition on Extractive Text Summarization

The DocEng'2020 Competition on Extractive Text Summarization assessed the performance of six new methods and fourteen classical algorithms for extractive text sumarization. The systems were evaluated using the CNN-Corpus, the largest test set available today for single document extractive summarization using two different strategies and the ROUGE and the direct match measures.

SESSION: Full papers

PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX

The mathematical contents of scientific publications in PDF format cannot be easily analyzed by regular PDF parsers and OCR tools. In this paper, we propose a novel OCR system called PDF2LaTeX, which extracts math expressions and text in both postscript and image-based PDF files and translates them into LaTeX markup. As a preprocessing step, PDF2LaTeX first renders a PDF file into its image format, and then uses projection profile cutting (PPC) to analyze the page layout. The analysis of math expressions and text is based on a series of deep learning algorithms. First, it uses a convolutional neural network (CNN) as a binary classifier to detect math image blocks based on visual features. Next, it uses a conditional random field (CRF) to detect math-text boundaries by incorporating semantics and context information. In the end, the system uses two different models based on a CNN-LSTM neural network architecture to translate image blocks of math expressions and plaintext into the LaTeX representations. For testing, we created a new dataset composed of 102 PDF pages collected from publications on and compared the performance between PDF2LaTeX and the state-of-the-art commercial software InftyReader. The experiment results showed that the proposed system achieved a better recognition accuracy (81.1%) measured by the string edit distance between the predicted LaTeX and the ground truth.

Change Detection on JATS Academic Articles: An XML Diff Comparison Study

XML is currently a well established and widely used document format. It is used as a core data container in collaborative writing suites and other modern information architectures. The extraction and analysis of differences between two XML document versions is an attractive topic, and has already been tackled by several research groups. The goal of this study is to compare 12 existing state-of-the-art and commercial XML diff algorithms by applying them to JATS documents in order to extract and evaluate changes between two versions of the same academic article. Understanding changes between two article versions is important not only regarding data, but also semantics. Change information consumers in our case are editorial teams, and thus they are more generally interested in change semantics than in the exact data changes. The existing algorithms are evaluated on the following aspects: their edit detection suitability for both text and tree changes, execution speed, memory usage and delta file size. The evaluation process is supported by a Python tool available on Github.

Direct Sampling of Multiview Line Drawings for Document Retrieval

Engineering drawings, scientific data, and governmental document repositories rely on degraded two-dimensional images to represent physical three-dimensional objects. The collection of two-dimensional multiview images are generated from a set of known camera positions that are aimed directly at the target object. These images provide a convenient method for representing the original physical object but significantly degrades the interpretability of the object. The multiview images from the document repositories may be integrated to reconstruct an approximation of the original physical object as a point cloud. We show that retrieval methods for documents are improved by directly sampling point clouds from the multiview image set to reconstruct the original physical object. We compare the retrieval results from direct image retrieval, multiview convolutional neural networks (MVCNN), and point clouds reconstructed from sampled images. To evaluate these models, we trained them on line drawings generated from models in the ShapeNet Core data set. We show retrieval of the reconstructed object is more accurate than single image retrieval or the multiview image set retrieval.

Cardinal Graph Convolution Framework for Document Information Extraction

Graph Convolutional Networks (GCN) have been recognized as successful for processing pseudo-spatial graph representations of the underlying structure of documents. We present Cardinal Graph Convolutional Networks (CGCN), an efficient and flexible extension of GCNs with cardinal-direction awareness of spatial node arrangement. The formulation of CGCNs retains the traditional GCN permutation invariance, ensuring directional neighbors are involved in learning abstract representations, even in the absence of a proper ordering of the nodes. We show that CGCNs achieve state of the art results on an invoice information extraction task, jointly learning a word-level tagging as well as document meta-level classification and regression. We also present a new multiscale Inception-like CGCN block-layer, as well as Conv-Pool-DeConv-DePool UNet-like architecture, which increase the receptive field. We demonstrate the utility of CGCNs on private and public datasets, with respect to several baseline models: sequential LSTM, transformer classifier, non-cardinal GCNs, and an image-convolutional approach.

Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

Textbooks are educational documents created, structured and formatted by domain experts with the main purpose to explain the knowledge in the domain to a novice. Authors use their understanding of the domain when structuring and formatting the content of a textbook to facilitate this explanation. As a result, the formatting and structural elements of textbooks carry the elements of domain knowledge implicitly encoded by their authors. Our paper presents an extendable approach towards automated extraction of this knowledge from textbooks taking into account their formatting rules and internal structure. We focus on PDF as the most common textbook representation format; however, the overall method is applicable to other formats as well. The evaluation experiments examine the accuracy of the approach, as well as the pragmatic quality of the obtained knowledge models using one of their possible applications -- semantic linking of textbooks in the same domain. The results indicate high accuracy of model construction on symbolic, syntactic and structural levels across textbooks and domains, and demonstrate the added value of the extracted models on the semantic level.

An Assessment of Sentence Simplification Methods in Extractive Text Summarization

The unprecedented growth of textual content on the Web made essential the development of automatic or semi-automatic techniques to help people to find valuable information in such a huge heap of text data. Automatic text summarization is one of such techniques that is being pointed out as offering a viable solution in such a chaotic scenario. Extractive text summarization, in particular, selects a set of sentences from a text according to specific criteria. Strategies for extractive summarization can benefit from preprocessing techniques that emphasize the relevance or infor-mativeness of sentences with respect to the selection criteria. This paper tests such a hypothesis using sentence simplification methods. Four methods are used to simplify a corpus of news articles in English: a rule-based method, an optimization method, a supervised deep learning model and an unsupervised deep learning model. The simplified outputs are summarized using 14 sentence selection strategies. The combinations of simplification and summarization methods are compared with the baseline --- the summarized corpus without previous simplification --- with a quantitative analysis, which suggests sentence compression with restrictions and models learned from large parallel corpora tend to perform better and yield gains over summarization without prior simplification.

SESSION: Short papers and Application notes

Improving query expansion strategies with word embeddings

Representation learning has been a fruitful area in recent years, driven by the growing interest in deep learning methods. In particular, word representation learning, a.k.a. word embeddings has triggered progress in different natural language processing (NLP) tasks. Despite the success of word embeddings in tasks such as named entity recognition or textual entailment, their use is still embryonic in query expansion. In this work, we examine the usefulness of word embeddings to represent queries and documents in query-document matching tasks. For this purpose, we use a re-ranking strategy. The re-ranking phase is conducted using representations of queries and documents based on word embeddings. We introduce IDF average word embeddings, a new text representation strategy based on word embeddings, which allows us to create a query vector representation that provides higher relevance to informative terms during the process. Experimental results in TREC benchmark datasets show that our proposal consistently achieves the best results in terms of MAP.

Interactive and Scalable visualization framework for Version-aware XML documents

The Extensible Markup Language (XML) is widely used to store, retrieve, and share digital documents. Recently, a form of Version Control System has been applied to the language, resulting in Version-Aware XML and allowing for enhanced portability and scalability. While Version Control Systems are able to keep track of changes made to documents, we think that there is untapped potential in the technology. In this paper, we define a set of requirements for visualization tools in a modern version control system. We also present an interactive and scalable visualization framework to represent Version-Aware-related data that helps users visualize and understand version control data, delete specific revisions of a document, and access a comprehensive overview of the entire versioning history.

We evaluated our interface prototype by conducting semi-structured usability tests and questionnaires to obtain both qualitative and quantitative feedback from volunteers with a general technological background.

A Framework to Evaluate Webpage Segment Recognizers

Many webpages are inaccessible because they cannot be read by non-sighted human users, nor can they be appropriately indexed by search engines. Using tools that can identify and classify the functional elements of pages, it is possible to create accessible versions. A way to achieve this goal is to extract segments and assign the right semantic role to them.

We evaluate tools that read webpages and automatically recognize and classify the functional elements, also called segments, and create a measure of their accuracy so tools can be compared. We introduce a method and tools to measure accuracy, and a procedure to build ground truth based on labels by the webpages' designers.

We present a proof-of-concept of a methodology: a method and tools to measure accuracy, and a procedure to build ground truth based on labels by the webpages' designers.

Our contribution is an evaluative framework to assess how well tools that parse the functional units in webpages and assign semantic roles to those units can satisfy the needs of search engines, non-sighted users, researchers and developers.

Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters

Short text stream clustering is an important but challenging task since massive amounts of text are generated from different social media. Given streams of texts, the proposed method clusters the streams of texts based on the frequently occurring word pairs (not necessarily consecutive) in texts. It detects outliers in the clusters and reassigns the outliers to appropriate clusters using the semantic similarity between the outliers and the clusters based on the dynamically computed similarity thresholds. Thus the proposed method efficiently deals with the concept drift problem. Experimental results demonstrate that the proposed approach outperforms the state-of-the-art short text stream clustering algorithms by a statistically significant margin on several short text datasets.

Parsing a markup language that supports overlap and discontinuity

Text As Graph Markup Language (TAGML) is a recently developed markup language that offers core support for overlapping and discontinuous markup. Designing and implementing a markup language technology stack that supports overlap poses numerous challenges; the most prominent being that the markup language cannot be expressed in a recursive context-free (CF) grammar. In this short paper we discuss our experiments with parsing TAGML based on a context-sensitive grammar. Our current approach implements an attribute grammar, which consists of a CF grammar with semantic actions. We discuss the advantages and disadvantages of our approach, and sketch several alternative methods.

COVID-19 Kaggle Literature Organization

The world has faced the devastating outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, in 2020. Research in the subject matter was fast-tracked to such a point that scientists were struggling to keep up with new findings. With this increase in the scientific literature, there arose a need for organizing those documents. We describe an approach to organize and visualize the scientific literature on or related to COVID-19 using machine learning techniques so that papers on similar topics are grouped together. By doing so, the navigation of topics and related papers is simplified. We implemented this approach using the widely recognized CORD-19 dataset to present a publicly available proof of concept.

A Framework for Extracted View Maintenance

When information extraction programs (extractors) are applied to documents, they create relations that store facts found in the documents. In this work, we formalize and address the problem of keeping such extracted relations consistent with source documents that are arbitrarily updated. We define three classes of document updates, namely those that are irrelevant, autonomously computable, and pseudo-irrelevant with respect to a given extractor. Finally, we propose algorithms to detect pseudo-irrelevant document updates with respect to extractors that are expressed as document spanners, a model of information extraction inspired by SystemT.

HTR-Flor++: A Handwritten Text Recognition System Based on a Pipeline of Optical and Language Models

Offline Handwritten Text Recognition (HTR) is a task that offers a challenge in computer vision, where images are the only source of information. In fact, several approaches to optical models have been developed, such as through of Hidden Markov Model (HMM) or recurrent Bidirectional/Multidimensional layers. The current state-of-the-art consists of combined deep learning techniques, the Convolutional Recurrent Neural Networks (CRNN), in which recurrent layers still suffer from vanishing gradient problem when processing very long texts. In a way, high-performance models generally have millions of trainable parameters and a high computational cost. However, recently a new optical model architecture, Gated-CNN, demonstrated improvements to complement CRNN modeling. Thus, in this work, we present a new small architecture for HTR (based on Gated-CNN) integrated with two steps of language model at the character and word levels, respectively. Therefore, we used 9 state-of-the-art approaches and validated the results using the IAM public dataset. Finally, the proposed model surpasses the results obtained by different approaches in the literature, reaching recognition rates of CER 2.7% and WER 5.6%, which means an improvement of 13% over the best results on IAM dataset.

Assessing Causality Structures learned from Digital Text Media

In this paper we describe a framework to uncover potential causal relations between event mentions from streaming text of news media. This framework relies on a dataset of manually labeled events to train a recurrent neural network for event detection. It then creates a time series of event clusters, where clusters are based on BERT contextual word embedding representations of the identified events. Using these time series dataset, we assess four methods based on Granger causality for inferring causal relations. Granger causality is a statistical concept of causality that is based on forecasting. It states that a cause occurs before the effect, and the cause produces unique changes in the effect, so past values of the cause help predict future values of the effect. The four analyzed methods are the pairwise Granger test, VAR(1), BigVar and SiMoNe. The framework is applied to the New York Times dataset, which covers news for a period of 246 months. This preliminary analysis delivers important insights into the nature of each method, identifies differences and commonalities, and points out some of their strengths and weaknesses.

The Old Bailey and OCR: Benchmarking AWS, Azure, and GCP with 180,000 Page Images

The Proceedings of the Old Bailey is a corpus of over 180,000 page images of court records printed from April 1674 to April 1913 and presents a comprehensive challenge for Optical Character Recognition (OCR) services. The Old Bailey is an ideal benchmark for historical document OCR, representing more than two centuries of variations in documents, including spellings, formats, and printing and preservation qualities. In addition to its historical and sociological significance, the Old Bailey is filled with imperfections that reflect the reality of coping with large-scale historical data. Most importantly, the Old Bailey contains human transcriptions for each page, which can be used to help measure OCR accuracy. Since humans do make mistakes in transcriptions, the relative performance of OCR services will be more informative than their absolute performance. This paper compares three leading commercial OCR cloud services: Amazon Web Services's Textract (AWS); Microsoft Azure's Cognitive Services (Azure); and Google Cloud Platform's Vision (GCP). Benchmarking involved downloading over 180,000 images, executing the OCR, and measuring the error rate of the OCR text against the human transcriptions. Our results found that AWS had the lowest median error rate, Azure had the lowest median round trip time, and GCP had the best combination of a low error rate and a low duration.

ServiceMarq: Extracting Service Contributions from Call for Papers

In an era, where large numbers of academic research papers are submitted to conferences and journals, the voluntary services of academicians to manage them, is indispensable. The call for contributions of research papers -- through an e-mail or as a webpage, not only solicits research works from scientists, but also lists the names of the researchers and their roles in managing the conference. Tracking such information which showcases the researchers' leadership qualities is becoming increasingly important. Here we present ServiceMarq - a system which proactively tracks service contributions to conferences. It performs focused crawling for website-based call for papers, and integrates archival and natural language processing libraries to achieve both high precision and recall in extracting information. Our results indicate that aggregated service contribution gives an alternative but correlated picture of institutional quality compared against standard bibliometrics. In addition, we have developed a proof of concept website to track service contributions and is available at and our github repo is available at

COVIDSeer: Extending the CORD-19 Dataset

We develop an enhanced version of CORD-19 dataset released by the Allen Institute for AI. Tools in the SeerSuite project are used to exploit information in original articles not directly provided in the CORD-19 datasets. We add 728 new abstracts, 70,102 figures and 31,446 tables with captions that are not provided in the current data release. We also built a vertical search engine COVIDSeer based on the new dataset we created. COVIDSeer has a relatively simple architecture with features like keyword filtering, and similar paper recommendation. The goal was to provide a system and dataset that can help scientists better navigate through the literature concerning COVID-19. The enriched dataset can serve as a supplement to the existing dataset. The search engine, which offers keyphrase-enhanced search, will hopefully help biomedical and life science researchers, medical students, and the general public to more effectively explore coronavirus-related literature. The entire data set and the system will be made open source.

Automatic Generation of Electrical Plan Documents from Architectural Data

This paper explores a novel application of document generation: the automatic creation of residential electrical plans from architectural data. Electrical plan documents, crucial to all residential construction and renovation projects, are currently generated manually by electrical designers who must adhere to the local electrical code and follow industry best practices. The designers decide the type and location of all household electrical devices and outlets based on the architectural floor plans. This is a tedious, highly repetitive, and time-consuming task. We propose a procedural approach to automate the generation of residential electrical plans via a stack-based finite state machine model that mimics the electrical designer's thought process. The system receives 2D architectural data (e.g. wall location) as input and yields a customized electrical plan as output. Experimental results on a variety of architectural layouts of bathrooms, bedrooms, and kitchens are very promising and demonstrate the approach's functionality and usefulness. This paper paves the way for new algorithmic tools facilitating the design cycle of building projects.