In this tutorial, we consider important aspects (algorithms, approaches, considerations) for tagging both unstructured and structured text for downstream use. This includes summarization, in which text information is compressed for more efficient archiving, searching, and clustering. In the tutorial, we focus on the topic of automatic text summarization, covering the most important milestones of the six decades of research in this area.
Visual Text Analytics has been an active area of interdisciplinary research (http://textvis.lnu.se/). This interactive tutorial is designed to give attendees an introduction to the area of information visualization, with a focus on linguistic visualization. After an introduction to the basic principles of information visualization and visual analytics, this tutorial will give an overview of the broad spectrum of linguistic and text visualization techniques, as well as their application areas [3]. This will be followed by a hands-on session that will allow participants to design their own visualizations using tools (e.g., Tableau), libraries (e.g., d3.js), or applying sketching techniques [4]. Some sample datasets will be provided by the instructor. Besides general techniques, special access will be provided to use the VisArgue framework [1] for the analysis of selected datasets.
The DChanges series of workshops focuses on changes in all their aspects and applications: algorithms to detect changes, models to describe them and techniques to present them to the users are only some of the topics that are investigated. This year, we would like to focus on collaboration tools for non-textual documents.
The workshop is open to researchers and practitioners from industry and academia. We would like to provide a platform to discuss and explore the state of the art in the field of document changes. One of the goals of this year's edition is to review the outcomes of the last four editions and to develop plans for the future.
There is currently much discussion and research on topics such as open access, alternative publishing models, semantic publishing, peer review, data sharing, reproducible science, etc.; in short, how we can bring scholarly publishing in line with modern technologies and expectations.
In the past, the document engineering community's participation in defining the future directions has been rather limited, which is surprising, as many document-centric issues related to scientific publishing still remain unresolved. Therefore, the main goal of this workshop, which will be held at DocEng 2018, is to stimulate discussion on this topic among experts in the document engineering field and provide a forum for the exchange of ideas.
The second goal of the workshop is more hands-on: for generating the Post-Proceedings, we will be trialling a new workflow, which is based on some of the technologies discussed. The results will be reported to the DocEng Steering Committee and recommendations will be made future conferences.
The last ten years have witnessed an enormous increase in the application of "deep learning" methods to both spoken and textual natural language processing. Have they helped? With respect to some well-defined tasks such as language modelling and acoustic modelling, the answer is most certainly affirmative, but those are mere components of the real applications that are driving the increasing interest in our field. In many of these real applications, the answer is surprisingly that we cannot be certain because of the shambolic evaluation standards that have been commonplace --- long before the deep learning renaissance --- in the communities that specialized in advancing them.
This talk will consider three examples in detail: sentiment analysis, text-to-speech synthesis, and summarization. We will discuss empirical grounding, the use of inferential statistics alongside the usual, more engineering-oriented pattern recognition techniques, and the use of machine learning in the process of conducting an evaluation itself.
The objective of high-recall information retrieval (HRIR) is to identify substantially all information relevant to an information need, where the consequences of missing or untimely results may have serious legal, policy, health, social, safety, defence, or financial implications. To find acceptance in practice, HRIR technologies must be more effective---and must be shown to be more effective---than current practice, according to the legal, statutory, regulatory, ethical, or professional standards governing the application domain. Such domains include, but are not limited to, electronic discovery in legal proceedings; distinguishing between public and non-public records in the curation of government archives; systematic review for meta-analysis in evidence-based medicine; separating irregularities and intentional misstatements from unintentional errors in accounting restatements; performing "due diligence" in connection with pending mergers, acquisitions, and financing transactions; and surveillance and compliance activities involving massive datasets. HRIR differs from ad hoc information retrieval where the objective is to identify the best, rather than all relevant information, and from classification or categorization where the objective is to separate relevant from non-relevant information based on previously labeled training examples. HRIR is further differentiated from established information retrieval applications by the need to quantify "substantially all relevant information"; an objective for which existing evaluation strategies and measures, such as precision and recall, are not particularly well suited.
This paper proposes an interactive multimedia authoring tool called STEVE (Spatio-Temporal View Editor) and a new multimedia model called SIMM (Simple Interactive Multimedia Model). STEVE aims at allowing users with no knowledge of multimedia authoring languages and models to create hypermedia applications for web and digital TV systems in a user-friendly way. Compared with existing multimedia authoring tools, STEVE is the unique tool that allows ordinary users to export hypermedia applications to HTML5 and NCL documents. STEVE uses an event-based temporal synchronization model called SIMM that exactly fits its needs. SIMM provides high-level temporal, spatial and interactivity relations to make authoring with STEVE easier. Usability tests show that, according to users, STEVE allowed them to create multimedia applications and export them as HTML5 and NCL documents in a few minutes without programming.
The text processing tool LATEX has prevailed as a standard in many fields of exact sciences; it is evident that LATEX is likely to be here to stay. From that perspective, it is important to explore what are the best possible ways to support the author in efficiently editing documents. There have been several approaches that provide graphical editing support for LATEX. We argue that a true WYSIWYG (What You See Is What You Get) approach is a justified requirement for future systems and we present here the first cloud-based true WYSIWYG editor. This allows the author to edit the document in its print form directly in a web-based PDF viewer. Building such a system creates unique challenges compared to existing approaches. We identify these challenges and name workable solutions. We also provide a usability evaluation of the new system. In short our finding is that editing LATEX directly in the PDF view is possible for a wide range of edits and valuable for many major user groups and use cases; hence it is a fair requirement for future top-of-the-line LATEX editors.
This paper describes the BumbAR approach for composing multimedia presentations and evaluates it through a qualitative study based on the Technology Acceptance Model (TAM). The BumbAR proposal is based on the event-condition-action model of Nested Context Model (NCM) and explores the use of augmented reality and real-world objects (markers) as an innovative user interface to specify the behavior and relationships between the media objects in a presentation. The qualitative study aimed at measuring the users' attitude towards using BumbAR and an augmented reality environment for authoring multimedia presentations. The results show that the participants found the BumbAR approach both useful and easy-to-use, while most of them (66,67%) found the system more convenient than traditional desktop-based authoring tools.
This paper examines the document aspects of object-based broadcasting. Object-based broadcasting augments traditional video and audio broadcast content with additional (temporally-constrained) media objects. The content of these objects -- as well as their temporal validity -- are determined by the broadcast source, but the actual rendering and placement of these objects can be customized to the needs/constraints of the content viewer(s). The use of object-based broadcasting enables a more tailored end-user experience than the one-size-fits-all of traditional broadcasts: the viewer may be able to selectively turn off overlay graphics (such as statistics) during a sports game, or selectively render them on a secondary device. Object-based broadcasting also holds the potential for supporting presentation adaptivity for accessibility or for device heterogeneity.
From a technology perspective, object-based broadcasting resembles a traditional IP media stream, accompanied by a structured multimedia document that contains timed rendering instructions. Unfortunately, the use of object-based broadcasting is severely limited because of the problems it poses for the traditional television production workflow (and in particular, for use in live television production). The traditional workflow places graphics, effects and replays as immutable components in the main audio/video feed originating from, for example, a production truck outside a sports stadium. This single feed is then delivered near-live to the homes of all viewers. In order to effectively support dynamic object-based broadcasting, the production workflow will need to retain a familiar creative interface to the production staff, but also allow the insertion and delivery of a differentiated set of objects for selective use at the receiving end.
In this paper we present a model and implementation of a dynamic system for supporting object-based broadcasting in the context of a motor sport application. We define a new multimedia document format that supports dynamic modifications during playback; this allows editing decisions by the producer to be activated by agents at the receiving end of the content. We describe a prototype system to allow playback of these broadcasts and a production system that allows live object-based control within the production workflow. We conclude with an evaluation of a trial using near-live deployment of the environment, using content from our partners, in a sport environment.
PDF is the established format for the exchange of final-form print-oriented documents on the Web, and for a good reason: it is the only format that guarantees the preservation of layout across different platforms, systems and viewing devices. Its main disadvantage, however, is that a document, once converted to PDF, is very difficult to edit. As of today (2018), there is still no universal format for the exchange of editable formatted text documents on the Web; users can only exchange the application's source files, which do not benefit from the robustness and portability of PDF.
This position paper describes how we can engineer such an editable format based on some of the principles of PDF. We begin by analysing the current status quo, and provide a summary of current approaches for editing existing PDFs, other relevant document formats, and ways to embed the document's structure into the PDF itself. We then ask ourselves what it really means for a formatted document to be editable, and discuss the related problem of enabling WYSIWYG direct manipulation even in cases where layout is usually computed or optimized using offline or batch methods (as is common with long-form documents).
After defining our goals, we propose a framework for creating such editable portable documents and present a prototype tool that demonstrates our initial steps and serves as a proof of concept. We conclude by providing a roadmap for future work.
In this paper, we propose a high recall active document retrieval system for a class of applications involving query documents, as opposed to key terms, and domain-specific document corpora. The output of the model is a list of documents retrieved based on the domain expert feedback collected during training. A modified version of Bag of Word (BoW) representation and a semantic ranking module, based on Google n-grams, are used in the model. The core of the system is a binary document classification model which is trained through a continuous active learning strategy. In general, finding or constructing training data for this type of problem is very difficult due to either confidentiality of the data, or the need for domain expert time to label data. Our experimental results on the retrieval of Call For Papers based on a manuscript demonstrate the efficacy of the system to address this application and its performance compared to other candidate models.
Document engineering employs practices of modeling and representation. Enactment of these practices relies on shared metaphors. However, choices driven by metaphor often receive less attention than those driven by factors critical to developing working systems, such as performance and usability. One way to remedy this issue is to take a historical approach, studying cases without a guiding concern for their ongoing development and maintenance. In this paper, we compare two historical case studies of "failed" designs for hypertext on the Web. The first case is netomat (1999), a Web browser created by the artist Maciej Wisniewski, which responded to search queries with dynamic multimedia streams culled from across the Web and structured by a custom markup language. The second is the XML Linking Language (XLink), a W3C standard to express hypertext links within and between XML documents. Our analysis focuses on the relationship between the metaphors used to make sense of Web documents and the hypermedia structures they compose. The metaphors offered by netomat and XLink stand as alternatives to metaphors of the "page" or the "app." Our intent here is not to argue that any of these metaphors are superior, but to consider how designers' and engineers' metaphorical choices are situated within a complex of already existing factors shaping Web technology and practice. The results provide insight into underexplored interconnections between art and document engineering at a critical moment in the history of the Web, and demonstrate the value for designers and engineers of studying "paths not taken" during the history of the technologies we work on today.
In this contemporaneous world, it is an obligation for any organization working with documents to end up with the insipid task of classifying truckload of documents, which is the nascent stage of venturing into the realm of information retrieval and data mining. But classification of such humongous documents into multiple classes, calls for a lot of time and labor. Hence a system which could classify these documents with acceptable accuracy would be of an unfathomable help in document engineering. We have created multiple classifiers for document classification and compared their accuracy on raw and processed data. We have garnered data used in a corporate organization as well as publicly available data for comparison. Data is processed by removing the stop-words and stemming is implemented to produce root words. Multiple traditional machine learning techniques like Naive Bayes, Logistic Regression, Support Vector Machine, Random forest Classifier and Multi-Layer Perceptron are used for classification of documents. Classifiers are applied on raw and processed data separately and their accuracy is noted. Along with this, Deep learning technique such as Convolution Neural Network is also used to classify the data and its accuracy is compared with that of traditional machine learning techniques. We are also exploring hierarchical classifiers for classification of classes and subclasses. The system classifies the data faster and with better accuracy than if done manually. The results are discussed in the results and evaluation section.
We present some results from a joint project between HP Labs, Cardiff University and Dyfed Powys Police on predictive policing. Applications of the various techniques from recommender systems and text mining to the problem of crime patterns recognition are demonstrated. Our main idea is to consider crime records for different regions and time period as a corpus of text documents with words being crime types. We apply tools from NLP and text documents classifications to analyse different regions in time and space. We evaluate performance of several measures of similarity for texts and documents clustering algorithms.
We present a novel marketing method for consumer trend detection from online user generated content, which is motivated by the gap identified in the market research literature. The existing approaches for trend analysis are generally based on rating of trends by industry experts through survey questionnaires, interviews, or similar. These methods proved to be inherently costly and often suffer from bias. Our approach is based on the use of information extraction techniques for identification of trends in large aggregations of social media data. It is cost-effective method that reduces the possibility of errors associated with the design of the sample and the research instrument. The effectiveness of the approach is demonstrated in the experiment performed on restaurant review data. The accuracy of the results is at the level of current approaches for both, information extraction and market research.
Combining text and mathematics when searching in a corpus with extensive mathematical notation remains an open problem. Recent results for Tangent-3 on the math and text retrieval task at NTCIR-12, for example, have room for improvement, even though formula retrieval appeared to be fairly successful.
This paper explores how to adapt the state-of-the-art BM25 text ranking method to work well when searching for math together with text. Following the approach proposed for the Tangent math search system, we use symbol layout trees to represent math formulae. We extract features from the symbol layout trees to serve as search terms to be ranked using BM25 and then explore the effects on retrieval performance of various classes of features. Based on the results, we recommend which features can be used effectively in a conventional text-based retrieval engine. We validate our overall approach using a NTCIR-12 math and text benchmark.
Millions of customer reviews for products are available online across hundreds of different websites. These reviews have a tremendous influence on the purchase decision of new customers and in creating a positive brand image. Understanding which of the product issues are critical in determining the product ratings is crucial for marketing teams. We have developed a solution which can derive deep insights from customer reviews which goes significantly beyond keyword based analysis. Our solution can identify key customer issues voiced in the reviews and the impact of each of these on the final rating that a customer gives the product. This insight is very actionable as it helps identify which customer concerns are responsible for bad ratings of products.
This paper presents two different prediction models of Mathematical Expression Constraints (ME-Con) in technical publications. Based on the assumption of independent probability distributions, two types of features: FS, based on the ME symbols; FW, based on the words adjacent to MEs, are used for analysis. The first prediction model is based on an iterative greedy scheme aiming to optimize the performance goal. The second scheme is based on naïve Bayesian inference of the two different feature types considering the likelihood of the training data. The first model achieved an average F1 scores of 69.5% (based on the tests made on an Elsevier dataset). The second prediction model using FS achieved 82.4% for F1 score and 81.8% accuracy. And it achieved similar yet slightly higher F1 scores as that of the first model for the word stems of FW, but slightly lower F1 score for the Part-Of-Speech (POS) tags of FW.1
Uncontrolled variants and duplicate content are ongoing problems in component content management; they decrease the overall reuse of content components. Similarity analyses can help to clean up existing databases and identify problematic texts, however, the large amount of data and intentional variants in technical texts make this a challenging task.
We tackle this problem by using an efficient cosine similarity algorithm which leverages semantic information from XML-based information models. To verify our approach we built a browser-based prototype which can identify intentional variants by weighting semantic text properties with high performance. The prototype was successfully deployed in an industry project with a large-scale content corpus.
We introduce a system to automatically manage photocopies made from copyrighted printed materials. The system monitors photocopiers to detect the copying of pages from copyrighted publications. Such activity is tallied for billing purposes. Access rights to the materials can be verified to prevent printing. Digital images of the copied pages are checked against a database of copyrighted pages. To preserve the privacy of the copying of non-copyright materials, only digital fingerprints are submitted to the image matching service. A problem with such systems is creation of the database of copyright pages. To facilitate this, our system maintains statistics of clusters of similar unknown page images along with copy sequence. Once such a cluster has grown to a sufficient size, a human inspector can determine whether those page sequences are copyrighted. The system has been tested with 100,000s of pages from conference proceedings and with millions of randomly generated pages. Retrieval accuracy has been around 99% even with copies of copies or double-page copies.
N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.
Engineering design makes use of freehand sketches to communicate ideas, allowing designers to externalise form concepts quickly and naturally. Such sketches serve as working documents which demonstrate the evolution of the design process. For the product design to progress, however, these sketches are often redrawn using computer-aided design tools to obtain virtual, interactive prototypes of the design. Although there are commercial software packages which extract the required information from freehand sketches, such packages typically do not handle the complexity of the sketched drawings, particularly when considering the visual cues that are introduced to the sketch to aid the human observer to interpret the sketch. In this paper, we tackle one such complexity, namely the use of shading and shadows which help portray spatial and depth information in the sketch. For this reason, we propose a vectorisation algorithm, based on trainable COSFIRE filters for the detection of junction points and subsequent tracing of line paths to create a topology graph as a representation of the sketched object form. The vectorisation algorithm is evaluated on 17 sketches containing different shading patterns and drawn by different sketchers specifically for this work. Using these sketches, we show that the vectorisation algorithm can handle drawings with straight or curved contours containing shadow cues, reducing the salient point error in the junction point location by 91% of that obtained by the off-the-shelf Harris-Stephen's corner detector while the overall vectorial representations of the sketch achieved an average F-score of 0.92 in comparison to the ground truth. The results demonstrate the effectiveness of the proposed approach.
End-to-end Optical Character Recognition (OCR) systems are heavily used to convert document images into machine-readable text. Commercial and open-source OCR systems (like Abbyy, OCRopus, Tesseract etc.) have traditionally been optimized for contemporary documents like books, letters, memos, and other end-user documents. However, these systems are difficult to use equally well for digitizing historical document images, which contain degradations like non-uniform shading, bleed-through, and irregular layout; such degradations usually do not exist in contemporary document images.
The open-source anyOCR is an end-to-end OCR pipeline, which contains state-of-the-art techniques that are required for digitizing degraded historical archives with high accuracy. However, high accuracy comes at a cost of high computational complexity that results in 1) long runtime that limits digitization of big collection of historical archives and 2) high energy consumption that is the most critical limiting factor for portable devices with constrained energy budget. Therefore, we are targeting energy efficient and high throughput acceleration of the anyOCR pipeline. Generalpurpose computing platforms fail to meet these requirements that makes custom hardware design mandatory. In this paper, we are presenting a new concept named iDocChip. It is a portable hybrid hardware-software FPGA-based accelerator that is characterized by low footprint meaning small size, high power efficiency that will allow using it in portable devices, and high throughput that will make it possible to process big collection of historical archives in real time without effecting the accuracy.
In this paper, we focus on binarization, which is the second most critical step in the anyOCR pipeline after text-line recognizer that we have already presented in our previous publication [21]. The anyOCR system makes use of a Percentile Based Binarization method that is suitable for overcoming degradations like non-uniform shading and bleed-through. To the best of our knowledge, we propose the first hardware architecture of the PBB technique. Based on the new architecture, we present a hybrid hardware-software FPGA-based accelerator that outperforms the existing anyOCR software implementation running on i7-4790T in terms of runtime by factor of 21, while achieving energy efficiency of 10 Images/J that is higher than that achieved by low power embedded processors with negligible loss of recognition accuracy.
Musical notation is a means of passing on performance instructions with fidelity to others. Composers, however, often introduced embellishments to the music they performed notating these embellishments with symbols next to the relevant notes. In time, these symbols, known as ornaments, and their interpretation became standardized such that there are acceptable ways of interpreting an ornament. Although music books may contain footnotes which express the ornament in full notation, these remain cumbersome to read. Ideally, a music student will have the possibility of selecting ornamented notes and express them as full notation. The student should also have the possibility to collapse the expressed ornament back to its symbolic representation, giving the student the possibility of also becoming familiar with playing from the ornamented score. In this paper, we propose a complete pipeline that achieves this goal. We compare the use of COSFIRE and template matching for optical music recognition to identify and extract musical content from the score. We then express the score using MusicXML and design a simple user interface which allows the user to select ornamented notes, view their expressed notation and decide whether they want to retain the expressed notation, modify it, or revert to the symbolic representation of the ornament. The performance results that we achieve indicate the effectiveness of our proposed approach.
Documents do often not exist in isolation but are implicitly or explicitly linked to parts of other documents. However, due to a multitude of proprietary document formats with rather simple link models, today's possibilities for creating hyperlinks between snippets of information in different document formats are limited. In previous work, we have presented a dynamically extensible cross-document link service overcoming the limitations of the simple link models supported by most existing document formats. Based on a plug-in mechanism, our link service enables the linking across different document types. In this paper, we assess the extensibility of our link service by integrating some document formats as well as third-party document viewers. We illustrate the flexibility of creating advanced hyperlinks across these document formats and viewers that cannot be realised with existing linking solutions or link models of existing document formats. A user study further investigates the user experience when creating and navigating cross-document hyperlinks.
Structural analysis is a text analysis technique that helps uncovering the association and opposition relationships between the terms of a text. It is used in particular in the field of humanities and social sciences. This technique is usually applied by hand with pen and paper as support. However, as any combination of words in the raw text may be considered as an association or opposition relationship, applying the technique by hand in a readable way can quickly prove overwhelming for the analyst. In this paper, we propose Evoq, an application that provides support to structural analysts in their work. Furthermore, we present interactive visualizations representing the relationships between terms. These visualizations help create alternative representations of text, as advocated by structural analysts. We conducted two usability evaluations that showed great potential for Evoq as a structural analysis support tool and for the use of alternative representations of texts in the analysis.
The E-marketplace is a common place where entities situated in different contexts conduct business electronically. Since sellers and buyers may be located in areas with different languages, customs and even business standards, business documents may be heterogeneously edited and parsed in different contexts. However, so far, no desirable approaches have been implemented to transfer a document from one context to another without generating ambiguity and disputes may arise due to different interpretation towards to the same document. Thus, it is important to guarantee consistent understanding among different contexts. This paper proposes a cross-context semantic document exchange approach, Tabdoc approach, as a novel strategy in implementing semantic interoperability. It guarantees consistent business document understanding and realizes automatic cross-context document processing. The experimental results demonstrate promising performance improvements over state-of-the-art methods.
Document editing has migrated in the last decade from a mostly individual activity to a shared activity among multiple persons. The World Wide Web and other communication means have contributed to this evolution. However, collaboration via the web has shown a tendency to centralize information, making it accessible to subsequent uses and abuses, such as surveillance, marketing, and data theft.
Traditionally, access control policies have been enforced by a central authority, usually the server hosting the content, a single point of failure. We describe a novel scheme for collaborative editing in which clients enforce access control through the use of strong encryption. Encryption keys are distributed as the portion of a URI which is not shared with the server, enabling users to adopt a variety of document security workflows. This system separates access to the information ("the key") from the responsibility of hosting the content ("the carrier of the vault"), allowing privacy-conscious editors to enjoy a modern collaborative editing experience without relaxing their requirements.
The paper presents CryptPad, an open-source reference implementation which features a variety of editors which employ the described access control methodology. We will detail approaches for implementing a variety of features required for user productivity in a manner that satisfies user-defined privacy concerns.
There are several domains in which the documents are made of reusable pieces. Template languages have been widely studied by the document engineering community to deal with common structures and textual fragments. Though, templating mechanisms are often hidden in mainstream word-precessors and even unknown by common users.
This paper presents a pattern-based language for templates, serialized in HTML and exploited in a user-friendly WYSIWYG editor for writing technical documentation. We discuss the deployment of the editor by an engineering company in the railway domain, as well as some generalized lessons learned about templates.
We present ARCHANGEL; a decentralised platform for ensuring the long-term integrity of digital documents stored within public archives. Document integrity is fundamental to public trust in archives. Yet currently that trust is built upon institutional reputation --- trust at face value in a centralised authority, like a national government archive or University. ARCHANGEL proposes a shift to a technological underscoring of that trust, using distributed ledger technology (DLT) to cryptographically guarantee the provenance, immutability and so the integrity of archived documents. We describe the ARCHANGEL architecture, and report on a prototype of that architecture build over the Ethereum infrastructure. We report early evaluation and feedback of ARCHANGEL from stakeholders in the research data archives space.
Scholarship in the humanities often requires the ability to search curated electronic corpora and to display search results in a variety of formats. Challenges that need to be addressed include transforming the texts into a suitable form, typically XML, and catering to the scholars' search and display needs. We describe our experience in creating such a search and display facility.
Although web search remains an active research area, interest in enterprise search has not kept up with the information requirements of the contemporary workforce. To address these issues, this research aims to develop, implement, and study the query expansion techniques most effective at improving relevancy in enterprise search. The case-study instrument was a custom Apache Solr-based search application deployed at a medium-sized manufacturing company. It was hypothesized that a composition of techniques tailored to enterprise content and information needs would prove effective in increasing relevancy evaluation scores. Query expansion techniques leveraging entity recognition, alphanumeric term identification, and intent classification were implemented and studied using real enterprise content and query logs. They were evaluated against a set of test queries derived from relevance survey results using standard relevancy metrics such as normalized discounted cumulative gain (nDCG). Each of these modules produced meaningful and statistically significant improvements in relevancy.
Commutative Replicated Data Types (CRDTs) are an emerging tool for real-time collaborative editing. Existing work on CRDTs mostly focuses on documents as a list of text content, but large documents (having over 7,000 pages) with complex sectional structure need higher-level organization. We introduce the Causal Graph, which extends the Causal Tree CRDT into a graph of nodes and transitions to represent ordered trees. This data structure is useful in driving document outlines for large collaborative documents, resolving structures with over 100,000 sections in less than a second.
The simple question "Was this review helpful to you?" increases an estimated $2.7B revenue to Amazon.com annually 1. In this paper, we propose a solution to the problem of electronic product review accumulation using helpfulness prediction. The popularity of e-commerce and online retailers such as Amazon, eBay, Yelp, and TripAdvisor are largely relying on the presence of product reviews to attract more customers. The major issue for the user submitted reviews is to quantify and evaluate the actual effectiveness by combining all the reviews under a particular product. With the varying size of reviews for each product, it is quite cumbersome for the customers to get hold of the overall helpfulness.Therefore, we propose a feature extraction technique that can quantify and measure helpfulness for each product based on user submitted reviews.
Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.
SlideDiff is a system that automatically creates an animated rendering of textual and media differences between two versions of a slide presentation. While previous work focused on either textual or image data, SlideDiff integrates both text and media changes, as well as their interactions, for example when adding an image forces nearby text boxes to shrink. Given two versions of a slide (not the full history of edits), SlideDiff detects the textual and image differences, and then animates the changes by mimicking what a user might have done, such as moving the cursor, typing text, resizing image boxes, adding images. This editing metaphor is well known to most users, helping them better understand what has changed, and fosters a sense of connection between remote workers, derived from communicating both the revision process as well as its results. After detection of text and image differences, the animations are rendered in HTML and CSS, including mouse cursor motion, text and image box selection and resizing, text deletion and insertion with its cursor. We discuss strategies for animating changes, in particular the importance of starting with large changes and finishing with smaller edits, and provide details of the implementation using modern HTML and CSS.
diffi (diff improved) is a comparison tool whose primary goal is to describe the differences between the content of two documents regardless of their formats.
diffi examines the stacks of abstraction levels of the two documents to be compared, finds which levels can be compared, selects one or more appropriate comparison algorithms and calculates the delta(s) between the two documents. Finally, the deltas are serialized using the extended unified patch format, an extension of the common unified patch format.
The produced deltas describe the differences between all the comparable levels of the inputs documents. Users and developers of patch visualization tools have, thus, the choice to focus on their preferred level of abstraction.
This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.
Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.
Finding concepts considering their meaning and semantic relations in a document corpus is an important and challenging task. In this paper, we present our contributions on how to understand unstructured data present in one or multiple documents. Generally, the current literature concentrates efforts in structuring knowledge by identifying semantic entities in the data. In this paper, we test our hypothesis that hyperknowledge specifications are capable of defining rich relations among documents and extracted facts. The main evidence supporting this hypothesis is the fact that hyperknowledge was built on top of hypermedia fundamentals, easing the specification of rich relationships between different multimodal components (i.e. multimedia content and knowledge entities). The key challenge tackled in this paper is how to structure and correlate these components considering their meaning and semantic relations.
This paper introduces the Jena Document Information System (JeDIS). The focus lies on its capability to partition annotation graphs into modules. Annotation modules are defined in terms of types from the annotation schema. Modules allow easy manipulation of their annotations (deletion or update) and the creation of alternative annotations of individual documents even for annotation formalisms that by design do not support this feature.
Mathematical Expressions (ME) and words are carefully bonded in technical writing to characterize physical concepts and their interactions quantitatively, and qualitatively. This paper proposes the Qualitative-Quantitative (QuQn) map as an abstraction of scientific papers to depict the dependency among MEs and their most related adjacent words. QuQn map aims to offer a succinct representation of the reasoning logic flow in a paper. Various filters can be applied to a QuQn map to reduce redundant/indirect links, control the display of problem settings (simple ME variables with declaration), and prune nodes with specific topological properties such as the largest connected subgraph. We developed a visualization tool prototype to support interactive browsing of the technical contents at different granularities of detail.
Citation analysis is considered as major and one of the most popular branches of bibliometrics. Citation analysis is based on the assumption that all citations have similar values and weights each equally. Specific research fields like content-based citation analysis (CCA) seeks to explain the "how" and "why" of citation behavior. In this paper we tackle to explain the "how" from a centrality indicator based on factors which are built automatically according to the authors' citation behavior. This indicator allows to evaluate bibliographical references' importance for reading the paper with which user interacts. From objective quantitative measurements, factors are computed in order to characterize the level of granularity where citations are used. By the setting of the centrality indicator's factors we can highlight citations which tend towards a partial or a global construction of the authors' discourse. We carry out a pilot study in which we test our approach on some papers and discuss the challenges in carrying out the citation analysis in this context. Our results show interesting and consistent correlations between the level of granularity and the significance of citation influences.
We propose OurDirection, an open-domain dialogue framework that is specialized in mimicking the Hansard (debate) materials from Canadian House of Commons. In this framework, we employed two sets of neural network models (Hierarchical Recurrent Encoder-Decoder (HRED) and RNN) to generate the dialogue responses. Extensive experiments on Hansard dataset shows that the models can learn the structure of the debates, and can produce reasonable responses to the user entries.
Reprint of Japanese historical manuscripts is time-consuming and requires training because they are hand-written, and may contain characters different from those currently used. We proposed a framework for assisting the human process for reading Japanese historical manuscripts and implemented a part of a system based on the framework as a Web service. In this paper, we present a graphical user interface (GUI) for the system and reprint process through the GUI. We conducted a user test to evaluate the system with the GUI by a questionnaire. From the results of the experiment, we confirmed that the GUI can be used intuitively but we also found points to be improved in the GUI.
Each day, a vast amount of data is published on the web. In addition, the rate at which content is being published is growing, which has the potential to overwhelm users, particularly those who are technically unskilled. Furthermore, users from various domains of expertise face challenges when trying to retrieve the data they require. They may rely on IT experts, but these experts have limited knowledge of individual domains, making data extraction a time-consuming and error-prone task. It would be beneficial if domain experts were able to retrieve needed data and create relatively complex queries on top of web documents. The existing query solutions either are limited to a specific domain or require beginning with a predefined knowledge base or sample ontologies. To address these limitations, we propose a goal-oriented platform that enables users to easily extract data from web documents. This platform enables users to express their goals in natural language, after which the platform elicits the corresponding result type using the algorithm proposed. The platform also applies the concept of ontology to semantically improve search results. To retrieve the most relevant results from web documents, the segments of a user's query are mapped to the entities of the ontology. Two types of ontologies are used: goal ontologies and domain-specific ones, which comprise domain concepts and the relationships among them. In addition, the platform helps domain experts to generate the domain ontologies that will be used to extract data from web documents. Placing ontologies at the center of the approach integrates a level of semantics into the platform, resulting in more-precise output. The main contributions of this research are that it provides a goal-oriented platform for extracting data from web documents and integrates ontology-based development into web-document searches.
Suppose that you write a text or find a text that looks interesting on the Web, and that want to create an e-book from the text. When creating an e-book from such a text file, you have to create a cover page for the e-book. However, with existing conversion services/tools we cannot obtain any cover page reflecting the impression of the text automatically. In this paper, in order to support users to create "good" cover pages for such texts, we propose a method for recommending colors and fonts for the cover pages of given texts/cover-less EPUB books. In our method, colors and fonts are selected so that the colors and the fonts reflect the impression of the contents of given texts/EPUB books.
Attention guides computation to focus on important parts of the input data. For pairwise input, existing attention approaches tend to bias towards trivial repetitions (e.g. punctuations and stop words) between two texts, and thus failed to contribute reasonable guidance to model predictions. As a remedy, we suggest taking into account the corpus-level information via global-aware attention. In this paper, we propose an attention mechanism that makes use of intratext, inter-text and global contextual information. We undertake an ablation study on paraphrase identification, and demonstrate that the proposed attention mechanism can obviate the downsides of trivial repetitions and provide interpretable word weightings.
Short text clustering is an important but challenging task. We investigate impact of similarity matrix sparsification on the performance of short text clustering. We show that two sparsification methods (the proposed Similarity Distribution based, and k-nearest neighbors) that aim to retain a prescribed number of similarity elements per text, improve hierarchical clustering quality of short texts for various text similarities. These methods using a word embedding based similarity yield competitive results with state-of-the-art methods for short text clustering especially for general domain, and are faster than the main state-of-the-art baseline.
Extracting key terms from technical documents allows us to write effective documentation that is specific and clear, with minimum ambiguity and confusion caused by nearly synonymous but different terms. For instance, in order to avoid confusion, the same object should not be referred to by two different names (e.g. "hydraulic oil filter"). In the modern world of commerce, clear terminology is the hallmark of successful RFPs (Requests for Proposal) and is therefore a key to the growth of competitive organizations. While Automatic Term Extraction (ATE) is a well-developed area of study, its applications in the technical domain have been sparse and constrained to certain narrow areas such as the biomedical research domain. We present a method for Automatic Term Extraction (ATE) for the technical domain based on the use of part-of-speech features and common words information. The method is evaluated on a C programming language reference manual as well as a manual of aircraft maintenance guidelines, and has shown comparable or better results to the reported state of the art results.
Historically, people have interacted with companies and institutions through telephone-based dialogue systems and paper-based forms. Now, these interactions are rapidly moving to web- and phone-based chat systems. While converting traditional telephone dialogues to chat is relatively straightforward, converting forms to conversational interfaces can be challenging. In this work, we introduce methods and interfaces to enable the conversion of PDF and web-based documents that solicit user input into chat-based dialogues. Document data is first extracted to associate fields and their textual descriptions using metadata and lightweight visual analysis. The field labels, their spatial layout, and associated text are further analyzed to group related fields into natural conversational units. These correspond to questions presented to users in chat interfaces to solicit information needed to complete the original documents and downstream processes they support. This user supplied data can be inserted into the source documents and/or in downstream databases. User studies of our tool show that it streamlines form-to-chat conversion and produces conversational dialogues of at least the same quality as a purely manual approach.