DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Making a Difference through Data Visualisation

Kinsetsu make a difference in critical sectors. Through automated visualisation of locations, environments, and assets, Kinsetsu optimise service flows, unburden teams, and improve safety and outcomes for citizens. Joanne will discuss how Kinsetsu has provided actionable insights through their IoT platform and sensor data within Healthcare, Defence and Local Government. Working with customers to visualise real-time data, including digital mapping and analytics to manage anomalies within their local and remote environments, bringing positive change. Joanne will demonstrate their "living documents" that dynamically represent the here and the now, providing data at the right time to the right person in the right place, when and where they need it.

The Evolution and Growth of Engineering Documents for Consumer Engagement

Product and service differentiation by virtue of delivering a unique personalized consumer experience is considered by many as the next competitive battleground. Due to the e-commerce sales surge during the pandemic more often these days the consumers first tangible interaction with the brand is now upon receipt of a package purchased online. Brands look at the consumer touchpoints with the brand, its packaging and labels, as more and more critical in delivering and managing an event-based-experience.

How you "engineer" your users to engage is the human factor/behavioral element of document engineering. This presentation will look at real-life examples of how brands are evolving their strategies when focussing on event based experiences to both deliver new brand marketing/consumer experience strategies and to create data sets from the consumer engagement to address both old-age business problems and challenges and some new emerging ones.

SESSION: Tutorials

Looking Beneath the Surface: The Science and Applications of Eye-Gaze Tracking for Assessing Visual Attention

The purpose of visual media is to convey information, ideas, concepts and emotions, and for this reason, the effectiveness of visual media can be assessed by how much it captures the attention and engages with its audience. Eye movement patterns have long been recognised as providing valuable insights into the cognitive processes that underlie attention, learning and memory, and as such they may shed light on how viewers engage with visual media. The process of measuring and analysing the movements of a person's eyes is called eye-gaze tracking, which is a powerful tool with a broad range of applications, not only in studying how people interact with visual content, but also in domains such as healthcare, driving, gaming, and many others. Thanks to advancements in technology, modern eye-gaze trackers have evolved into much less intrusive and more comfortable devices than their scary predecessors. This has also worked in their favour in making eye-gaze trackers, whether screen-based, head-mounted, or embedded within VR headsets, more accessible and easy to use. In view of the increasing popularity in using eye-gaze tracking to study attention and engagement, this tutorial aims to explore this technology from different angles, including its development over the years, its technical workings, the metrics that may be used to quantify visual attention, and several application domains.

The tutorial will provide an overview of eye-gaze tracking, covering the state-of-the-art, the different types of technologies in use, the key eye movements and gaze tracking metrics in quantifying attention, the factors to consider when choosing an eye-gaze tracking device, and several eye-gaze tracking applications.

This tutorial is intended for all researchers, with or without a background in the domain, who are interested in learning more about eye-gaze tracking and its use in assessing visual attention.

Reviewer #2 Must Be Stopped!: Or, The Art of Providing Good Reviews

Love it or hate it, the peer review process (whether open, blind, or even double-blind) has become the standard and accepted way of assessing the quality of papers before publication, be it for a conference, journal, or book. Indeed, forming the program committee is an essential part in any conference organisation and a good program committee may well be the differentiator from peer conferences. However, we have all been the recipients of a less than stellar/helpful review: from the snarky ones to the one-liners, these reviews can be demoralising and can give the peer-review process a bad reputation! The scope of this tutorial is then to encourage researchers to become more involved in the peer-review process by joining program committees and encourages good practices to collectively strengthen the quality of the peer-review process.

The program committee consists of a group of individuals who voluntarily review the papers vetting papers such that those selected for publication are not only relevant to the field, but also provide novel insights to problems within the field. A truly excellent review, whether recommending acceptance or non-acceptance, motivates the authors to do more in the area their submission addresses. Individuals that form part of a program committee tend to be experts in their field such that they can provide constructive critique of the submitted papers. But while it is easy to rely on well established researchers, it is also important that program committees are refreshed with new and young researchers as this keeps the research community moving on-wards. Additionally, like any good mentor/mentee relationship, younger reviewers may be better able to point out "blind spots" to compelling new research that authors more "set in their ways" might have missed/omitted.

The tutorial will be divided into two parts. In the first part, we will provide an overview of the reviewing process, tips on how to read a paper for review, use tools such as Google Scholar to verify references and novelty, as well as tips on writing the review to provide constructive criticism while avoiding falling into pitfall of becoming the infamous Reviewer #2. We will provide sample reviews from reviewers who have won "best reviewer" awards at conferences similar to DocEng.

The second part of the tutorial will take a more practical approach as participants will be able to carry out a mock review of a paper. Throughout the tutorial, the DocEng reviewing process will be used as an example. Participation and discussions from attendees will be encouraged!

This tutorial is intended for young researchers who are interested in starting to review papers, more mature researchers interested in improving the quality of the peer-review process, as well as anyone interested in finding out more about the reviewing process adopted at DocEng, and even potentially becoming part of the DocEng program committee.

SESSION: DocEng 2023 Challenge

Quality, Space and Time Competition on Binarizing Photographed Document Images

Document image binarization is a fundamental step in many document processes. No binarization algorithm performs well on all types of document images, as the different kinds of digitalization devices and the physical noises present in the document and acquired in the digitalization process alter their performance. Besides that, the processing time is also an important factor that may restrict its applicability. This competition on binarizing photographed documents assessed the quality, time, space, and performance of five new algorithms and sixty-four "classical" and alternative algorithms. The evaluation dataset is composed of laser and deskjet printed documents, photographed using six widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.

SESSION: Document Modelling, Management and Representation

Session details: Document Modelling, Management and Representation

Dynamic Topic Modeling with Tensor Decomposition as a Tool to Explore the Legal Precedent Relevance Over Time

The precedent is a textual citation of prior court decisions. This undoubtedly offers great value in a common-law-based judicial system where courts are bound to their previous rulings, such as in the United States, Canada, and India. In those countries, precedent relevance detection is an issue that has attracted considerable attention where studies propose Network Science techniques for relevance measurement --- where decisions and their relationships are represented in network structures. However, those methods fail to capture the precedent relevance in the Brazilian scenario due to the massive and increasing number of decisions issued yearly. The Brazilian Supreme Court (STF), the highest judicial body in Brazil, has produced more than a million decisions over the last decade. Therefore, we propose an interpretable and cost-effective process to explore the precedent through latent topics that emerge, evolve, and fade over time in a collection of historical documents. To do so, we explore dynamic topic modeling with tensor decomposition as a tool to investigate the legal changes embedded in those decisions over time. We base our study on the individual decisions published by STF between 2000 and 2018. Additionally, through experiments, we explore the proposed process within different scenarios to investigate the precedent citations over the STF's recent history, and how those citations correlates with the legal named entities, such as legislative references. The experiments showed the process' capability to produce coherent and interpretable results for temporal analysis of the precedent citations in larger collections of documents. Also, it presents the potential to support further studies in the legal domain.

Static Pruning for Multi-Representation Dense Retrieval

Dense retrieval approaches are challenging the prevalence of inverted index-based sparse representation approaches for information retrieval systems. Different families have arisen: single representations for each query or passage (such as ANCE or DPR), or multiple representations (usually one per token) as exemplified by the ColBERT model. While ColBERT is effective, it requires significant storage space for each token's embedding. In this work, we aim to prune the embeddings for tokens that are not important for effectiveness. Indeed, we show that, by adapting standard uniform and document-centric static pruning methods to embedding-based indexes, but retaining their focus on low-IDF tokens, we can attain large improvements in space efficiency while maintaining high effectiveness. Indeed, on experiments conducted on the MSMARCO passage ranking task, by removing all embeddings corresponding to the 100 most frequent BERT tokens, the index size is reduced by 45%, with limited impact on effectiveness (e.g. no statistically significant degradation of NDCG@10 or MAP on the TREC 2020 queryset). Similarly, on TREC Covid, we observed a 1.3% reduction in nDCG@10 for a 38% reduction in total index size.

Genetic Generative Information Retrieval

Documents come in all shapes and sizes and are created by many different means, including now-a-days, generative language models. We demonstrate that a simple genetic algorithm can improve generative information retrieval by using a document's text as a genetic representation, a relevance model as a fitness function, and a large language model as a genetic operator that introduces diversity through random changes to the text to produce new documents. By "mutating" highly-relevant documents and "crossing over" content between documents, we produce new documents of greater relevance to a user's information need --- validated in terms of estimated relevance scores from various models and via a preliminary human evaluation. We also identify challenges that demand further study.

Improving Zero-Shot Text Matching for Financial Auditing with Large Language Models

Auditing financial documents is a very tedious and time-consuming process. As of today, it can already be simplified by employing AI-based solutions to recommend relevant text passages from a report for each legal requirement of rigorous accounting standards. However, these methods need to be fine-tuned regularly, and they require abundant annotated data, which is often lacking in industrial environments. Hence, we present ZeroShotALI, a novel recommender system that leverages a state-of-the-art large language model (LLM) in conjunction with a domain-specifically optimized transformer-based text-matching solution. We find that a two-step approach of first retrieving a number of best matching document sections per legal requirement with a custom BERT-based model and second filtering these selections using an LLM yields significant performance improvements over existing approaches.

SESSION: Document Recognition, Summarisation and Inference

Session details: Document Recognition, Summarisation and Inference

WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation

Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.

Automatically Inferring the Document Class of a Scientific Article

We consider the problem of automatically inferring the (LATEX) document class used to write a scientific article from its PDF representation. Applications include improving the performance of information extraction techniques that rely on the style used in each document class, or determining the publisher of a given scientific article. We introduce two approaches: a simple classifier based on hand-coded document style features, as well as a CNN-based classifier taking as input the bitmap representation of the first page of the PDF article. We experiment on a dataset of around 100k articles from arXiv, where labels come from the source LATEX document associated to each article. Results show the CNN approach significantly outperforms that based on simple document style features, reaching over 90% average F1-score on a task to distinguish among several dozens of the most common document classes.

Exploiting Label Dependencies for Multi-Label Document Classification Using Transformers

We introduce in this paper a new approach to improve deep learning-based architectures for multi-label document classification. Dependencies between labels are an essential factor in the multi-label context. Our proposed strategy takes advantage of the knowledge extracted from label co-occurrences. The proposed method consists in adding a regularization term to the loss function used for training the model, in a way that incorporates the label similarities given by the label co-occurrences to encourage the model to jointly predict labels that are likely to co-occur, and and not consider labels that are rarely present with each other. This allows the neural model to better capture label dependencies. Our approach was evaluated on three datasets: the standard AAPD dataset, a corpus of scientific abstracts and Reuters-21578, a collection of news articles, and a newly proposed multi-label dataset called arXiv-ACM. Our method demonstrates improved performance, setting a new state-of-the-art on all three datasets.

Character Relationship Mapping in Major Fictional Works Using Text Analysis Methods

Determining the relationships between characters is an important step in analyzing fictional works. Knowing character relationships can be useful when summarizing a work and may also help to determine authorship. In this paper, scores are generated for pairs of characters in fictional works, which can be used for classification tasks if characters have a relationship or not. An SVM is used to predict relationships between characters. Characters farther from the decision boundary often had stronger relationships than those closer to the boundary. The relative rank of the relationships may have additional literary and authorship related purposes.

SESSION: Visual Document Analysis

Session details: Visual Document Analysis

Multi-Task CTC for Joint Handwriting Recognition and Character Bounding Box Prediction

Deep learning based models continue to push the frontier of OCR and Handwritten Text Recognition (HTR) both in terms of increasing accuracy and minimizing the burden of data annotation. For example, the CTC loss [3] allows for line-level training without the explicit spatial positions of the characters and words. Other recent works have further reduced the annotation effort by training systems on entire paragraphs or pages of text without line-level annotations. While these techniques have greatly increased accuracy and significantly reduced the data collection burden, useful outputs beyond just the Unicode character prediction have been lost. Because these systems have fewer explicit segmentation steps, they do not produce character-level or word-level bounding boxes. As a result, downstream applications such as highlighting, redaction tools, and basic document correction, are not provided and must be obtained with other methods.

In this work we propose a novel technique to augment existing CTC-based OCR/HTR methods to output other attributes, such as character bounding boxes, without any added inference computation overhead or changes to the neural network. We demonstrate our technique for two types of tasks: fully supervised and self-supervised. We achieve state-of-the-art results on the CASIA-HWDB for the fully supervised task of character bounding box prediction. We then explore a self-supervised method to learning character bounding boxes without manually created annotations and demonstrate promising results.

Using YOLO Network for Automatic Processing of Finite Automata Images with Application to Bit-Strings Recognition

The recognition of handwritten diagrams has drawn attention in recent years because of their potential applications in many areas, especially when it can be used for educational purposes. Although there are many online approaches, the advances of deep object detector networks have made offline recognition an attractive option, allowing simple inputs such as paper-drawn diagrams. In this paper, we have tested the YOLO network, including its version with fewer parameters, YOLO-Tiny, for the recognition of images of finite automata. This recognition was applied to the development of an application that recognizes bit-strings used as input to the automaton: given an image of a transition diagram, the user inserts a sequence of bits and the system analyzes whether the automaton recognizes the sequence or not. Using two bases of finite automata, we have evaluated the detection and recognition of finite automata symbols as well as bit-string processing. With regard to the diagram symbol detection task, experiments on a handwritten finite automata image dataset returned 82.04% and 97.20% for average precision and recall, respectively.

Layout Analysis of Historic Architectural Program Documents

In this paper, we introduce and make publicly available the CRS Visual Dataset, a new dataset consisting of 7,029 pages of human-annotated and validated scanned archival documents from the field of 20th-century architectural programming; and ArcLayNet, a fine-tuned machine learning model based on the YOLOv6-S object detection architecture. Architectural programming is an essential professional service in the Architecture, Engineering, Construction, and Operations (AECO) Industry, and the documents it produces are powerful instruments of this service. The documents in this dataset are the product of a creative process; they exhibit a variety of sizes, orientations, arrangements, and modes of content, and are underrepresented in current datasets. This paper describes the dataset and narrates an iterative process of quality control in which several deficiencies were identified and addressed to improve the performance of the model. In this process, our key performance indicators, mAP@0.5 and mAP@0.5:0.95, both improved by approximately 10%.

OntG-Bart: Ontology-Infused Clinical Abstractive Summarization

Automating the process of clinical text summarization could save clinicians' reading time and reduce their fatigue, acknowledging the necessity of human professionals in the loop. This paper addresses clinical text summarization, aiming to incorporate ontology concept relationships via a Graph Neural Network (GNN) into the summarization process. Specifically, we propose a model, extending Bart's encoder-decoder framework with GNN encoder and multi-head attentional layers for decoder, producing ontology-aware summaries. This GNN interacts with the textual encoder, influencing their mutual representations. The model's effectiveness is validated on two real-world radiology datasets. We also present an ablation study to elucidate the impact of varied graph configurations and an error analysis aimed at pinpointing potential areas for future improvements.

POSTER SESSION: Poster Lightning Talks

Session details: Poster Lightning Talks

Algorithm Parallelism for Improved Extractive Summarization

While much work on abstractive summarization has been conducted in recent years, including state-of-the-art summarizations from GPT-4, extractive summarization's lossless nature continues to provide advantages, preserving the style and often key phrases of the original text as meant by the author. Libraries for extractive summarization abound, with a wide range of efficacy. Some do not perform much better or perform even worse than random sampling of sentences extracted from the original text. This study breathes new life to using classical algorithms by proposing parallelism through an implementation of a second order meta-algorithm in the form of the Tessellation and Recombination with Expert Decisioner (T&R) pattern, taking advantage of the abundance of already-existing algorithms and dissociating their individual performance from the implementer's biases. Resulting summaries obtained using T&R are better than any of the component algorithms.

YinYang, a Fast and Robust Adaptive Document Image Binarization for Optical Character Recognition

Optical Character Recognition (OCR) from document photos taken by cell phones is a challenging task. Most OCR methods require prior binarization of the image, which can be difficult to achieve when documents are captured with various mobile devices in unknown lighting conditions. For example, shadows cast by the camera or the camera holder on a hard copy can jeopardize the binarization process and hinder the next OCR step. In the case of highly uneven illumination, binarization methods using global thresholding simply fail, and state-of-the-art adaptive algorithms often deliver unsatisfactory results. In this paper, we present a new binarization algorithm using two complementary local adaptive passes and taking advantage of the color components to improve results over current image binarization methods. The proposed approach gave remarkable results at the DocEng'22 competition on the binarization of photographed documents.

LYLAA: A Lightweight YOLO based Legend and Axis Analysis method for CHART-Infographics

Chart Data Extraction (CDE) is a complex task in document analysis that involves extracting data from charts to facilitate accessibility for various applications, such as document mining, medical diagnosis, and accessibility for the visually impaired. CDE is challenging due to the intricate structure and specific semantics of charts, which include elements such as title, axis, legend, and plot elements. The existing solutions for CDE have not yet satisfactorily addressed these issues. In this paper, we focus on two critical subtasks in CDE, Legend Analysis and Axis Analysis, and present a lightweight YOLO-based method for detection and domain-specific heuristic algorithms (Axis Matching and Legend Matching), for matching. We evaluate the efficacy of our proposed method, LYLAA, on a real-world dataset, the ICPR2022 UB PMC dataset, and observe promising results compared to the competing teams in the ICPR2022 CHART-Infographics competition. Our findings showcase the potential of our proposed method in the CDE process.

Read-Write-Learn: Self-Learning for Handwriting Recognition

Handwriting recognition relies on supervised data for training. Annotations typically include both the written text and the author's identity to facilitate the recognition of a particular style. A large annotation set is required for robust recognition, which is not always available in historical texts and low-annotation languages. To mitigate this challenge, we propose the Read-Write-Learn framework. In this setting, we augment the training process of handwriting recognition with a language model and a handwriting generator. Specifically, in the first reading step, we employ a language model to identify text that is likely detected correctly by the recognition model. Then, in the writing step, we generate more training data in the same writing style. Finally, in the learning step, we use the newly generated data in the same writing style to finetune the recognition model. Our Read-Write-Learn framework allows the recognition model to incrementally converge on the new style. Our experiments on historical handwritten documents demonstrate the benefits of the approach, and we present several examples to showcase improved recognition.

AI-powered Resume-Job matching: A document ranking approach using deep neural networks

This study focuses on the importance of well-designed online matching systems for job seekers and employers. We treat resumes and job descriptions as documents. Then, calculate their similarity to determine the suitability of applicants, and rank a set of resumes based on their similarity to a specific job description. We employ Siamese Neural Networks, comprised of identical sub-network components, to evaluate the semantic similarity between documents. Our novel architecture integrates various neural network architectures, where each sub-network incorporates multiple layers such as CNN, LSTM and attention layers to capture sequential, local and global patterns within the data. The LSTM and CNN components are applied concurrently and merged together. The resulting output is then fed into a multi-head attention layer. These layers extract features and capture document representations. The extracted features are then combined to form a unified representation of the document. We leverage pre-trained language models to obtain embeddings for each document, which serve as a lower-dimensional representation of our input data. The model is trained on a private dataset of 268,549 real resumes and 4,198 job descriptions from twelve industry sectors, resulting in a ranked list of matched resumes. We performed a comparative analysis involving our model, Siamese CNN (S-CNNs), Siamese LSTM with Manhattan distance, and a BERT-based sentence transformer model. By combining the power of language models and the novel Siamese architecture, this approach leverages both strengths to improve document ranking accuracy and enhance the matching process between job descriptions and resumes. Our experimental results demonstrate that our model outperforms other models in terms of performance.

Automatically Labeling Cyber Threat Intelligence reports using Natural Language Processing

Attribution provides valuable intelligence in the face of Advanced Persistent Threat (APT) attacks. By accurately identifying the culprits and actors behind the attacks, we can gain more insights into their motivations, capabilities, and potential future targets. Cyber Threat Intelligence (CTI) reports are relied upon to attribute these attacks effectively. These reports are compiled by security experts and provide valuable information about threat actors and their attacks.

We are interested in building a fully automated APT attribution framework. An essential step in doing so is the automated processing and extraction of information from CTI reports. However, CTI reports are largely unstructured, making extraction and analysis of the information a difficult task.

To begin this work, we introduce a method for automatically highlighting a CTI report with the main threat actor attributed within the report. This is done using a custom Natural Language Processing (NLP) model based on the spaCy library. Also, the study showcases and highlights the performance and effectiveness of various pdf-to-text Python libraries that were used in this work. Additionally, to evaluate the effectiveness of our model, we experimented on a dataset consisting of 605 English documents, which were randomly collected from various sources on the internet and manually labeled. Our method achieved an accuracy of 97%. Finally, we discuss the challenges associated with processing these documents automatically and propose some methods for tackling them.

Privacy Now or Never: Large-Scale Extraction and Analysis of Dates in Privacy Policy Text

The General Data Protection Regulation (GDPR) and other recent privacy laws require organizations to post their privacy policies, and place specific expectations on organisations' privacy practices. Privacy policies take the form of documents written in natural language, and one of the expectations placed upon them is that they remain up to date. To investigate legal compliance with this recency requirement at a large scale, we create a novel pipeline that includes crawling, regex-based extraction, candidate date classification and date object creation to extract updated and effective dates from privacy policies written in English. We then analyze patterns in policy dates using four web crawls and find that only about 40% of privacy policies online contain a date, thereby making it difficult to assess their regulatory compliance. We also find that updates in privacy policies are temporally concentrated around passage of laws regulating digital privacy (such as the GDPR), and that more popular domains are more likely to have policy dates as well as more likely to update their policies regularly.

Tabular Corner Detection in Historical Irish Records

The process of extracting relevant data from historical handwritten documents can be time-consuming and challenging. In Ireland, from 1864 to 1922, government records regarding births, deaths, and marriages were documented by local registrars using printed tabular structures. Leveraging this systematic approach, we employ a neural network capable of segmenting scanned versions of these record documents. We sought to isolate the corner points with the goal of extracting the vital tabular elements and transforming them into consistently structured standalone images. By achieving uniformity in the segmented images, we enable more accurate row and column segmentation, enhancing our ability to isolate and classify individual cell contents effectively. This process must accommodate varying image qualities, different tabular orientations and sizes resulting from diverse scanning procedures, as well as faded and damaged ink lines that naturally occur over time.

SESSION: Applications and User Experiences

Session details: Applications and User Experiences

Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability

Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.

A PDF Malware Detection Method Using Extremely Small Training Sample Size

Machine learning-based methods for PDF malware detection have grown in popularity because of their high levels of accuracy. However, many well-known ML-based detectors require a large number of specimen features to be collected before making a decision, which can be time-consuming. In this study, we present a novel, distance-based method for detecting PDF malware. Notably, our approach needs significantly less training data compared to traditional machine learning or neural network models. We evaluated our method using the Contagio dataset and reported that it can detect 90.50% of malware samples with only 20 benign PDF files used for model training. To show the statistical significance, we reported results with a 95% confidence interval (CI). We evaluated our model's performance across multiple metrics including Accuracy, F1 score, Precision, and Recall, alongside False Positive Rate, False Negative Rates, True Positive Rate and True Negative Rates. This paper highlights the feasibility of using distance-based methods for PDF malware detection, even with limited training data, thereby offering a promising direction for future research.

Deep-learning for dysgraphia detection in children handwritings

Early identification of dysgraphia in children is crucial for timely intervention and support. Traditional methods, such as the Brave Handwriting Kinder (BHK) test, which relies on manual scoring of handwritten sentences, are both time-consuming and subjective posing challenges in accurate and efficient diagnosis. In this paper, an approach for dysgraphia detection by leveraging smart pens and deep learning techniques is proposed, automatically extracting visual features from children's handwriting samples. To validate the solution, samples of children handwritings have been gathered and several interviews with domain experts have been conducted. The approach has been compared with an algorithmic version of the BHK test and with several elementary school teachers' interviews.

A document format for sewing patterns

Sewing patterns are a form of technical document, requiring expertise to author and understand. Digital patterns are typically produced and sold as PDFs with human-interpretable vector graphics, but there is little consistency or machine-readable metadata in these documents. A custom file format would enable digital pattern manipulation tools to enhance or replace a paper based workflow.

In this vision paper, basic sewing pattern components and modification processes are introduced, and the limitations of current PDF patterns are highlighted. Next, an XML-based sewing pattern document format is proposed to take advantage of the inherent relationships between different pattern components. Finally, document security and authenticity considerations are discussed.

SESSION: Document Content Analysis

Session details: Document Content Analysis

Addressing the gap between current language models and key-term-based clustering

This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.

Synchronous Recognition of Music Images Using Coupled N-Gram Models

Handwritten music recognition researches the use of technologies to automatically transcribe handwritten music pieces that are only found in image format, and make them available to the general public. Many historical music pieces are composed by a music part and a lyrics part. Handwritten music recognition has focused mainly on transcribing the music elements in historical images, but there exist many pieces where both music and lyrics are present and of relevance. The recognition of both music and lyrics is generally carried out as separate tasks. Both parts are synchronized in many historical documents at line level and loosely at word level. These two elements are strongly related having each one affecting the other. Discovering this relation may be very relevant to improve recognition results in both parts and to further steps like music analysis, composition analysis, etc. This paper introduces a preliminary system that transcribes synchronously and simultaneously both the music and lyrics elements of handwritten historical music images. The results obtained over a historical manuscript dataset show that this system obtains an improvement of up to 15.4% at symbol rate on stave recognition and up to an approximately average 7.6% improvement when both the music and lyrics part are jointly considered.

Technology-Assisted Review for Spreadsheets and Noisy Text

In a large-scale eDiscovery effort, human assessors participated in a technology-assisted review ("TAR") process employing a modified version of Grossman and Cormack's Continuous Active Learning® ("CAL®") tool to review Excel spreadsheets and poor-quality OCR text (defined as 30-50% Markov error rate). In the legal industry, these documents are typically considered inappropriate for the application of TAR and, consequently, are usually the subject of exhaustive manual review. Our results assuage this concern by showing that a CAL TAR process, using feature engineering techniques adapted from spam filtering, can achieve satisfactory results on Excel spreadsheets and noisy OCR text. Our findings are cause for optimism in the legal industry--- adding these document classes to TAR datasets will make large reviews more manageable and less costly.