JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

Full Citation in the ACM Digital Library

SESSION: Invited talks

Bridging worlds: indigenous knowledge in the digital world

The digitalistion of Indigenous knowledge has been challenging considering epistemological differences and the lack of involvement of indigenous people. Drawing from our community projects in Namibia we share approaches of co-designing technologies and digital presentations of indigenous knowledge.

D-lib magazine pioneered web-based scholarly communication

The web began with a vision of, as stated by Tim Berners-Lee in 1991, "that much academic information should be freely available to anyone". For many years, the development of the web and the development of digital libraries and other scholarly communications infrastructure proceeded in tandem. A milestone occurred in July, 1995, when the first issue of D-Lib Magazine was published as an online, HTML-only, open access magazine, serving as the focal point for the then emerging digital library research community. In 2017 it ceased publication, in part due to the maturity of the community it served as well as the increasing availability of and competition from eprints, institutional repositories, conferences, social media, and online journals - the very ecosystem that D-Lib Magazine nurtured and enabled. As long-time members of the digital library community and frequent contributors to D-Lib Magazine, we reflect on the many innovations that D-Lib Magazine pioneered and were made possible by the web, including: open access, HTML-only publication and embracing the hypermedia opportunities afforded by HTML, persistent identifiers and stable URLs, rapid publication, and community engagement. Although it ceased publication after 22 years and 265 issues, it remains unchanged on the live web and still provides a benchmark for academic serials and web-based publishing.

SESSION: Natural language processing

A domain-adaptive pre-training approach for language bias detection in news

Media bias is a multi-faceted construct influencing individual behavior and collective decision-making. Slanted news reporting is the result of one-sided and polarized writing which can occur in various forms. In this work, we focus on an important form of media bias, i.e. bias by word choice. Detecting biased word choices is a challenging task due to its linguistic complexity and the lack of representative gold-standard corpora. We present DA-RoBERTa, a new state-of-the-art transformer-based model adapted to the media bias domain which identifies sentence-level bias with an F1 score of 0.814. In addition, we also train, DA-BERT and DA-BART, two more transformer models adapted to the bias domain. Our proposed domain-adapted models outperform prior bias detection approaches on the same data.

X-SCITLDR: cross-lingual extreme summarization of scholarly documents

The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Consequently, recent work on applying text mining technologies for scholarly publications has investigated the application of automatic text summarization technologies, including extreme summarization, for this domain. However, previous work has concentrated only on monolingual settings, primarily in English. In this paper, we fill this research gap and present an abstractive cross-lingual summarization dataset for four different languages in the scholarly domain, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage 'summarize and translate' approach and a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios.

TinyGenius: intertwining natural language processing with microtask crowdsourcing for scholarly knowledge graph creation

As the number of published scholarly articles grows steadily each year, new methods are needed to organize scholarly knowledge so that it can be more efficiently discovered and used. Natural Language Processing (NLP) techniques are able to autonomously process scholarly articles at scale and to create machine readable representations of the article content. However, autonomous NLP methods are by far not sufficiently accurate to create a high-quality knowledge graph. Yet quality is crucial for the graph to be useful in practice. We present TinyGenius, a methodology to validate NLP-extracted scholarly knowledge statements using microtasks performed with crowdsourcing. The scholarly context in which the crowd workers operate has multiple challenges. The explainability of the employed NLP methods is crucial to provide context in order to support the decision process of crowd workers. We employed TinyGenius to populate a paper-centric knowledge graph, using five distinct NLP methods. In the end, the resulting knowledge graph serves as a digital library for scholarly articles.

Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach

The challenge of automatically extracting metadata from scientific PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German social sciences, the authors are not required to generate their papers according to a specific template and they often create their own templates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always effective which is reflected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientific PDF documents. The aim is to benefit from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its effectiveness over unimodal models, with an overall F1 score of 92.3%.

SESSION: Information retrieval & access

Specialized document embeddings for aspect-based similarity of research papers

Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only one perspective on document similarity that ignores which aspects make two documents alike. To address this limitation, aspect-based similarity measures have been developed using document segmentation or pairwise multi-class document classification. While segmentation harms the document coherence, the pairwise classification approach scales poorly to large scale corpora. In this paper, we treat aspect-based similarity as a classical vector similarity problem in aspect-specific embedding spaces. We represent a document not as a single generic embedding but as multiple specialized embeddings. Our approach avoids document segmentation and scales linearly w.r.t. the corpus size. In an empirical study, we use the Papers with Code corpus containing 157, 606 research papers and consider the task, method, and dataset of the respective research papers as their aspects. We compare and analyze three generic document embeddings, six specialized document embeddings and a pairwise classification baseline in the context of research paper recommendations. As generic document embeddings, we consider FastText, SciBERT, and SPECTER. To compute the specialized document embeddings, we compare three alternative methods inspired by retrofitting, fine-tuning, and Siamese networks. In our experiments, Siamese SciBERT achieved the highest scores. Additional analyses indicate an implicit bias of the generic document embeddings towards the dataset aspect and against the method aspect of each research paper. Our approach of aspect-based document embeddings mitigates potential risks arising from implicit biases by making them explicit. This can, for example, be used for more diverse and explainable recommendations.

Studying retrievability of publications and datasets in an integrated retrieval system

In this paper, we investigate the retrievability of datasets and publications in a real-life Digital Library (DL). The measure of retrievability was originally developed to quantify the influence that a retrieval system has on the access to information. Retrievability can also enable DL engineers to evaluate their search engine to determine the ease with which the content in the collection can be accessed. Following this methodology, in our study, we propose a system-oriented approach for studying dataset and publication retrieval. A speciality of this paper is the focus on measuring the accessibility biases of various types of DL items and including a metric of usefulness. Among other metrics, we use Lorenz curves and Gini coefficients to visualize the differences of the two retrievable document types (specifically datasets and publications). Empirical results reported in the paper show a distinguishable diversity in the retrievability scores among the documents of different types.

What a publication tells you: benefits of narrative information access in digital libraries

Knowledge bases allow effective access paths in digital libraries. Here users can specify their information need as graph patterns for precise searches and structured overviews (by allowing variables in queries). But especially when considering textual sources that contain narrative information, i.e., short stories of interest, harvesting statements from them to construct knowledge bases may be a serious threat to the statements' validity. A piece of information, originally stated in a coherent line of arguments, could be used in a knowledge base query processing without considering its vital context conditions. And this can lead to invalid results. That is why we argue to move towards narrative information access by considering contexts in the query processing step. In this way digital libraries can allow users to query for narrative information and supply them with valid answers. In this paper we define narrative information access, demonstrate its benefits for Covid 19 related questions, and argue on the generalizability for other domains such as political sciences.

SESSION: Search and recommendation

Causal factorization machine for robust recommendation

Factorization Machines (FMs) are widely used for the collaborative recommendation because of their effectiveness and flexibility in feature interaction modeling. Previous FM-based works have claimed the importance of selecting useful features since incorporating unnecessary features will introduce noise and reduce the recommendation performance. However, previous feature selection algorithms for FMs are proposed based on the i.i.d. hypothesis and select features according to their importance to the predictive accuracy on training data. However, the i.i.d. assumption is often violated in real-world applications, and shifts between training and testing sets may exist. In this paper, we consider achieving causal feature selection in FMs so as to enhance the robustness of recommendation when the distributions of training data and testing data are different. What's more, different from other machine learning tasks like image classification, which usually select a global set of causal features for a predictive model, we emphasize the importance of considering personalized causal feature selection in recommendation scenarios since the causal features for different users may be different. To achieve our goal, we propose a personalized feature selection method for FMs and refer to the confounder balancing approach to balance the confounders for every treatment feature. We conduct experiments on three real-world datasets and compare our method with both representative shallow and deep FM-based baselines to show the effectiveness of our method in enhancing the robustness of recommendations and improving the recommendation accuracy.

Query-specific subtopic clustering

We propose a Query-Specific Siamese Similarity Metric (QS3M) for query-specific clustering of text documents. Our approach uses fine-tuned BERT embeddings to train a non-linear projection into a query-specific similarity space. We build on the idea of Siamese networks but include a third component, a representation of the query. QS3M is able to model the fine-grained similarity between text passages about the same broad topic and also generalizes to new unseen queries during evaluation. The empirical evaluation for clustering employs two TREC datasets and a set of academic abstracts from arXiv. When used to obtain query-relevant clusters, QS3M achieves a 12% performance improvement on the TREC datasets over a strong BERT-based reference method and many baselines such as TF-IDF and topic models. A similar improvement is observed for the arXiv dataset suggesting the general applicability of QS3M to different domains. Qualitative evaluation is carried out to gain insight into the strengths and limitations of the model.

On modifying evaluation measures to deal with ties in ranked lists

Evaluation metrics for search and ranking systems are generally designed for a linear list of ranked items that does not have ties. However, ties in ranked lists arise naturally for certain systems or techniques. Evaluation protocols generally arbitrarily break ties in such lists, and compute the standard metrics. If the number of ties is non-trivial, it would be more principled to use modified, tie-aware formulations of these metrics. For most commonly used metrics, McSherry and Najork [5] present modified definitions that are tie-aware, and therefore, more appropriate for assessing the quality of systems that retrieve multiple distinct results at the same rank. This paper proposes a tie-aware version of Hit@k that we call ta-Hit@k. Hit@k is also a common evaluation measure that is widely used for some tasks, but is not covered in [5]. We also empirically compare the values of ta-Hit@k and Hit@k for a single example system on a standard benchmark task.

SESSION: Web archives

ABCDEF: the 6 key features behind scalable, multi-tenant web archive processing with ARCH: archive, big data, concurrent, distributed, efficient, flexible

Over the past quarter-century, web archive collection has emerged as a user-friendly process thanks to cloud-hosted solutions such as the Internet Archive's Archive-It subscription service. Despite advancements in collecting web archive content, no equivalent has been found by way of a user-friendly cloud-hosted analysis system. Web archive processing and research require significant hardware resources and cumbersome tools that interdisciplinary researchers find difficult to work with. In this paper, we identify six principles - the ABCDEFs (Archive, Big data, Concurrent, Distributed, Efficient, and Flexible) - used to guide the development and design of a system. These make the transformation of, and working with, web archive data as enjoyable as the collection process. We make these objectives - largely common sense - explicit and transparent in this paper. They can be employed by every computing platform in the area of digital libraries and archives and adapted by teams seeking to implement similar infrastructures. Furthermore, we present ARCH (Archives Research Compute Hub)1, the first cloud-based system designed from scratch to meet all of these six key principles. ARCH is an interactive interface, closely connected with Archive-It, engineered to provide analytical actions, specifically generating datasets and in-browser visualizations. It efficiently streamlines research workflows while eliminating the burden of computing requirements. Building off past work by both the Internet Archive (Archive-It Research Services) and the Archives Unleashed Project (the Archives Unleashed Cloud), this merged platform achieves a scalable processing pipeline for web archive research. It is open-source and can be considered a reference implementation of the ABCDEF, which we have evaluated and discussed in terms of feasibility and compliance as a benchmark for similar platforms.

Investigating bloom filters for web archives' holdings

What web archives hold is often opaque to the public and even experts in the domain struggle to provide precise assessments. Given the increasing need for and use of crawled and archived web resources, discovery of individual records as well as sharing of entire holdings are pressing use cases. We investigate Bloom Filters (BFs) and their applicability to address these use cases. We experiment with and analyze parameters for their creation, measure their performance, outline an approach for scalability, and describe various pilot implementations that showcase their potential to meet our needs. BFs come with beneficial characteristics and hence have enjoyed popularity in various domains. We highlight their suitability for web archiving use cases and how they can contribute to very fast and accurate search services.

StreamingHub: interactive stream analysis workflows

Reusable data/code and reproducible analyses are foundational to quality research. This aspect, however, is often overlooked when designing interactive stream analysis workflows for time-series data (e.g., eye-tracking data). A mechanism to transmit informative metadata alongside data may allow such workflows to intelligently consume data, propagate metadata to downstream tasks, and thereby auto-generate reusable, reproducible analytic outputs with zero supervision. Moreover, a visual programming interface to design, develop, and execute such workflows may allow rapid prototyping for interdisciplinary research. Capitalizing on these ideas, we propose StreamingHub, a framework to build metadata propagating, interactive stream analysis workflows using visual programming. We conduct two case studies to evaluate the generalizability of our framework. Simultaneously, we use two heuristics to evaluate their computational fluidity and data growth. Results show that our framework generalizes to multiple tasks with a minimal performance overhead.

SESSION: Biblio- /altmetrics

Open-source code repository attributes predict impact of computer science research

With an increased importance of transparency and reproducibility in computer science research, it has become common to publicly release open-source repositories that contain the code, data, and documentation alongside a publication. We study the relationship between transparency of a publication (as represented by the attributes of its open-source repository) and its scientific impact (as represented by paper citations). Using the Mann-Whitney test and Cliff's delta, we observed a statistically significant difference in citations between papers with and without an associated open-source repository. We also observed a statistically significant correlation (p < 0.01) between citations and several repository interaction features: Stars, Forks, Subscribers and Issues. Finally, using time-series features of repository growth (Stars), we trained a classifier to predict whether a paper would be highly cited (top 10%) with cross-validated AUROC of 0.8 and AUPRC of 0.65. Our results provide evidence that those who make sustained efforts in making their works transparent also tend to have a higher scientific impact.

Altmetrics and citation counts: an empirical analysis of the computer science domain

Background. Researchers, funding agencies, and institutions involve bibliographic data to assess the impact or reputation of papers, publication venues, researchers, and institutions. Particularly citation counts, and metrics that build on these (e.g., impact factor, h-index), are widely used, despite extensive and rightful criticism regarding, for instance, their meaning, value, and comparability. Moreover, such metrics require time to accumulate and do not represent the scientific impact outside of academia, for instance, on industry. To overcome such limitations, researchers investigate and propose altmetrics to complement or provide a more meaningful alternative to traditional metrics. Altmetrics are based on user interactions in the internet and especially social-media platforms, promising a faster accumulation and to represent scientific impact on other parts of society. Aim. In this paper, we complement current research by studying the altmetrics of 18,360 papers published at 16 publication venues of the computer science domain. Method. Namely, we conducted an empirical study to understand whether altmetrics correlate with citation counts and how they have evolved over time. Results. Our results help understand how altmetrics can complement citation counts, and which represent proxy metrics that indicate the immediate impact of a paper as well as future citations. We discuss our results extensively to reflect on the limitations and criticism on such metrics. Conclusion. Our findings suggest that altmetrics can be helpful to complement citation metrics, potentially providing a better picture of overall scientific impact and reducing potential biases of focusing solely on citations.

Pre-trained transformer-based citation context-aware citation network embeddings

Academic papers form citation networks wherein each paper is a node and citation relationships between papers are edges. The embeddings of each paper obtained by projecting the citation network into a vector space are called citation network embeddings. Thus far, only a limited number of studies have focused on incorporating information regarding the intent of one paper to cite another paper. We consider citation context, i.e., the text to cite a paper, as a source of information for citation intent, and propose a new method for generating citation context-aware citation network embeddings. We trained SciBERT with our proposed masked paper prediction task in which the model predicts the cited paper from the citing paper and the citation context. In addition, we propose a new loss function that considers not only the citation context but also the neighboring nodes in the citation network. We conducted experiments involving citation-recommendation and paper-classification tasks which we formulated on two existing datasets: FullTextPeer-Read and AASC. For both tasks, the proposed method outperformed hyperdoc2vec, an existing method for citation context-aware citation network embedding; further, it achieved a comparable performance to a state-of-the-art citation network embedding that do not utilize any citation context for paper classification.

SESSION: Information extraction

Mining mathematical documents for question answering via unsupervised formula labeling

The increasing number of questions on Question Answering (QA) platforms like Math Stack Exchange (MSE) signifies a growing information need to answer math-related questions. However, there is currently very little research on approaches for an open data QA system that retrieves mathematical formulae using their concept names or querying formula identifier relationships from knowledge graphs. In this paper, we aim to bridge the gap by presenting data mining methods and benchmark results to employ Mathematical Entity Linking (MathEL) and Unsupervised Formula Labeling (UFL) for semantic formula search and mathematical question answering (MathQA) on the arXiv preprint repository, Wikipedia, and Wikidata. The new methods extend our previously introduced system, which is part of the Wikimedia ecosystem of free knowledge. Based on different types of information needs, we evaluate our system in 15 information need modes, assessing over 7,000 query results. Furthermore, we compare its performance to a commercial knowledge-base and calculation-engine (Wolfram Alpha) and search-engine (Google). The open source system is hosted by Wiki-media at https://mathqa.wmflabs.org. A demovideo is available at purl.org/mathqa.

Visual descriptor extraction from patent figure captions: a case study of data efficiency between BiLSTM and transformer

Technical drawings used for illustrating designs are ubiquitous in patent documents, especially design patents. Different from natural images, these drawings are usually made using black strokes with little color information, making it challenging for models trained on natural images to recognize objects. To facilitate indexing and searching, we propose an effective and efficient visual descriptor model that extracts object names and aspects from patent captions to annotate benchmark patent figure datasets. We compared two state-of-the-art named entity recognition (NER) models and found that with a limited number of annotated samples, the BiLSTM-CRF model outperforms the Transformer model by a significant margin, achieving an overall F1=96.60%. We further conducted a data efficiency study by varying the number of training samples and found that BiLSTM consistently beats the transformer model on our task. The proposed model is used to annotate a benchmark patent figure dataset.

SESSION: Search and recommendation II

Asking for help in community question-answering: the goal-framing effect of question expression on response networks

Community question-answering (CQA) enables both information retrieval and social interactions. CQA questions are viewed as goal-expressions from the askers' perspectives. Most prior studies mainly focused on the goals expressed in the questions, but not on how responders' expectations and responses are influenced by the goal-expressions. To fill the gap, this research proposes the use of framing theory to understand how different expressions of goals influence responses. Cues of questions were used to identify goal-frames in CQA questions. Social network analysis was used to construct response networks whose nodes represent postings and connections represent responses. Our results reveal that goal-frames with high complexity, high specificity, and rewards tend to increase the centrality of questions. In contrast, low complexity and low specificity tend to generate extensive conversations. Implications for both researchers and practitioners are discussed in the final section.

Recommending research papers to chemists: a specialized interface for chemical entity exploration

Researchers and scientists increasingly rely on specialized information retrieval (IR) or recommendation systems (RS) to support them in their daily research tasks. Paper recommender systems are one such tool scientists use to stay on top of the ever-increasing number of academic publications in their field. Improving research paper recommender systems is an active research field. However, less research has focused on how the interfaces of research paper recommender systems can be tailored to suit the needs of different research domains. For example, in the field of biomedicine and chemistry, researchers are not only interested in textual relevance but may also want to discover or compare the contained chemical entity information found in a paper's full text. Existing recommender systems for academic literature do not support the discovery of this non-textual, but semantically valuable, chemical entity data. We present the first implementation of a specialized chemistry paper recommender system capable of visualizing the contained chemical structures, chemical formulae, and synonyms for chemical compounds within the document's full text. We review existing tools and related research in this field before describing the implementation of our ChemVis system. With the help of chemists, we are expanding the functionality of ChemVis, and will perform an evaluation of recommendation performance and usability in future work.

SESSION: User behavior

Leveraging user interaction signals and task state information in adaptively optimizing usefulness-oriented search sessions

Current information retrieval (IR) systems still face plenty of challenges when applied in addressing complex search tasks (CSTs) that trigger multi-round search iterations. Existing relevance-oriented optimization algorithms and metrics are limited in helping users find documents that are useful for completing CSTs, rather than merely topically relevant. To address this gap, our work aimed to characterize CSTs from a process-oriented perspective and develop a state-based adaptive approach to simulating and evaluating search path recommendations. Based on the data collected from 80 journalism search sessions, we first extracted intention-based task states from participants' annotations to characterize temporal their temporal cognitive changes in searching and validated the state labels with expert assessments. Built upon the state labels and state distribution patterns, we then developed a simulated adaptive search path recommendation approach, aiming to help users find needed useful documents quicker. The results demonstrate that 1) different types of CSTs can be differentiated based on their distinct state distribution and transition patterns; 2) After a small number of iterative training, our adaptive recommendation algorithm can consistently outperform the best possible performance from individual participants in terms of the useful-based search efficiency across all CSTs. Going beyond traditional static viewpoint of task facets and relevance-focused evaluation approach, our work characterizes CSTs with a dynamic perspective and develops a domain-specific adaptive search algorithm that can help users find useful documents quicker and learn from online search logs. Our findings can facilitate future exploration of adaptive search path adjustments for similar types of CSTs in other domains and work task scenarios.

Reliable editions from unreliable components: estimating ebooks from print editions using profile hidden markov models

A profile hidden Markov model, a popular model in biological sequence analysis, can be used to model related sequences of characters transcribed from books, magazines, and other printed materials. This paper documents one application of a profile HMM: automatically producing an ebook edition from distinct print editions. The resulting ebook has virtually all the desired properties found in a publisher-prepared ebook, including accurate transcription and an absence of print artifacts such as end-of-line hyphenation and running headers. The technique, which has particular benefits for readers and libraries that require books in an accessible format, is demonstrated using seven copies of a nineteenth-century novel.

Information seeking within academic digital libraries: a survey of graduate student search strategies

When searching within an academic digital library, a variety of information seeking strategies may be employed. The purpose of this study is to determine whether graduate students choose appropriate information seeking strategies for the complexity of a given search scenario, and to explore among other factors that could influence their decisions. We used a survey method in which participants (n=33) were asked to recall their most recent instance of an academic digital library search session that matched two given scenarios (randomly chosen from four alternatives), and for each scenario identify whether they employed search strategies associated with four different information seeking models. Although we expected that the information seeking strategies used would be influenced by the search scenario, this was not the case. The factors that affected whether a participant would use an advanced information seeking strategy were based on their graduate-level academic search training and their primary research methodology. These findings highlight that while it is important to train graduate students on how to conduct academic digital library searches, more work is needed to train them on matching the information seeking strategies to the complexity of their search tasks and developing interfaces that guide their search process.

A reference dependence approach to enhancing early prediction of session behavior and satisfaction

There is substantial evidence from behavioral economics and decision sciences demonstrating that in the context of decision-making under uncertainty, the carriers of value behind actions are gains and losses defined relative to a reference point (e.g. pre-action expectations), rather than the absolute final outcomes. Also, the capability of early predicting session-level search decisions and user experience is essential for developing reactive and proactive search recommendations. To address these research gaps, our study aims to 1) develop reference dependence features based on a series of simulated user expectations or reference points in first query segments of sessions, and 2) examine the extent to which we can enhance the performance of early predicting session behavior and user satisfaction by constructing and employing reference dependence features. Based on the experimental results on three datasets of varying types, we found that incorporating reference dependent features developed in first query segments into prediction models achieves better performance than using baseline cost-benefit features only in early predicting three key session metrics (user satisfaction score, session clicks, and session dwell time). Also, when running simulations by varying the search time expectation and rate of user satisfaction decay, the results demonstrate that users tended to expect to complete their search within a minute and showed a rapid rate of satisfaction decay in a logarithmic fashion once surpassing the estimated expectation points. By factoring in a user's search time expectation and measuring their behavioral response once the expectation is not met, we can further improve the performance of early prediction models and enhance our understanding of users' behavioral patterns.

SESSION: Scholarly communication I

Comparing different perspectives of characterizing interdisciplinarity of scientific publications: author vs. publication perspectives

This study compares two distinct perspectives of characterizing interdisciplinarity of scientific publications, namely author- and publication-based perspectives. The publication-based perspectives calculate interdisciplinarity of a publication from its references and citing papers. Author-based perspectives characterize interdisciplinarity of a scientific publication using its authors' prior published records. We employ various extant diversity indicators including variety, balance, disparity, Simpson's index, 2DS, and ID to capture the characteristics of interdisciplinarity from different perspectives. An overall analysis, journal-level analysis, and case studies are provided in the current study. We find that author- and publication-based perspectives are distinct from each other and they contribute to a more holistic picture of interdisciplinarity. Additionally, our results show that characterizing interdisciplinarity from authors' citing papers would generate a ranking that better distinguishes publications.

How does author affiliation affect preprint citation count?: analyzing citation bias at the institution and country level

Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not analyzed preprints with respect to citation bias, although they play an increasingly important role in modern scholarly communication. In this paper, we investigate whether preprints are affected by citation bias with respect to the author affiliation. We measure citation bias for bioRxiv preprints and their publisher versions at the institution level and country level, using the Lorenz curve and Gini coefficient. This allows us to mitigate the effects of confounding factors and see whether or not citation biases related to author affiliation have an increased effect on preprint citations. We observe consistent higher Gini coefficients for preprints than those for publisher versions. Thus, we can confirm that citation bias exists and that it is more severe in case of preprints. As preprints are on the rise, affiliation-based citation bias is, thus, an important topic not only for authors (e.g., when deciding what to cite), but also to people and institutions that use citations for scientific impact quantification (e.g., funding agencies deciding about funding based on citation counts).

DeepASPeer: towards an aspect-level sentiment controllable framework for decision prediction from academic peer reviews

Peer review is the widely accepted mechanism to determine the quality of scientific work. Even though peer-reviewing has been an integral part of academia since the 1600s, it frequently receives criticism for the lack of transparency and consistency. Even for humans, predicting the peer review outcome is a challenging task as there are many dimensions and human factors involved. However, Artificial Intelligence (AI) techniques can assist the editor/chair anticipate the final decision based on the reviews from the human reviewers. Peer review texts reflect the reviewers' opinions/ sentiments on various aspects (e.g., novelty, substance, soundness, etc.) of the paper concerning the research in the paper, which may be valuable to predict a manuscript's acceptance or rejection. The exact types and number of aspects could vary from one to the other venue (i.e., the conferences or journals). Peer review texts, however, which often contain rich sentiment information about the reviewers and, therefore, their overall opinion of the paper's research, can be useful in predicting a manuscript's acceptance or rejection. Here in this work, we study how we could take advantage of aspects and their corresponding sentiment to build a generic controllable system to assist the editor/chair in determining the outcome based on the reviews of a paper to make better editorial decisions. Our proposed deep neural architecture considers three information channels, including reviews, review aspect category, and its sentiment, to predict the final decision. Experimental results show that our model can achieve up to 76.67% accuracy on the ASAP-Review dataset (Aspect-enhanced Peer Review) consisting of ICLR and NIPS reviews considering the sentiment of the reviews. Empirical results also show an improvement of around 3.3 points while aspect information is added to the sentiment information1.

SESSION: Scholarly communication II

Complexities associated with user-generated book reviews in digital libraries: temporal, cultural, and political case studies

While digital libraries (DL) have made large-scale collections of digitized books increasingly available to researchers [31, 67], there remains a dearth of similar data provisions or infrastructure for computational studies of the consumption and reception of books. In the last two decades, user-generated book reviews on social media have opened up unprecedented research possibilities for humanities and social sciences (HSS) scholars who are interested in book reception. However, limitations and gaps have emerged from existing DH research which utilize social media data for answering HSS questions. To shed light on the under-investigated features of user-generated book reviews and the challenges they might pose to scholarly research, we conducted three exemplar cases studies: (1) a longitudinal analysis for profiling the temporal changes of ratings and popularity of 552 books across ten years; (2) a cross-cultural comparison of book ratings of the same 538 books across two platforms; and, (3) a classification experiment on 20,000 sponsored and non-sponsored books reviews. Correspondingly, our research reveals the real-world complexities and under-investigated features of user-generated book reviews in three dimensions: the transience of book ratings and popularity (temporal dimension), the cross-cultural differences in reading interests and book reception (cultural dimension), and the user power dynamics behind the publicly accessible reviews ("political" dimension). Our case studies also demonstrate the challenges posed by user-generated book reviews' real-world complexities to their scholarly usage and propose solutions to these challenges. We conclude that DL stakeholders and scholars working with user-generated book reviews should look into these under-investigated features and real-world challenges to evaluate and improve the scholarly usability and interpretability of their data.

The significance and impact of winning an academic award: a study of early career academics

Academic award plays an important role in an academic's career, particularly for early career academics. Previous studies have primarily focused on the impact of awards conferred to academics who have made outstanding contributions to a specific research field, such as the Nobel Prize. In contrast, this paper aims to investigate the effect of awards conferred to academics at an earlier career stage, who have the potential to make a great impact in the future. We devise a metric named Award Change Factor (ACF), to evaluate the change of a recipient's academic behavior after winning an academic award. Next, we propose a model to compare award recipients with academics who have similar performance before winning an academic award. In summary, we analyze the impact of an award on the recipients' academic impact and their teams from different perspectives. Experimental results show that most recipients do have improvements in both productivity and citations after winning an academic award, while there is no significant impact on publication quality. In addition, receipt of an academic award not only expands recipients' collaboration network, but also has a positive effect on their team size.

Between acceptance and rejection: challenges for an automatic peer review process

The peer review process is the main academic resource to ensure that science advances and is disseminated. To contribute to this important process, classification models capable of predicting the score of a review text (RSP) and the final decision of a paper (PDP) were created. But what challenges prevent us from having a fully efficient system responsible for these tasks? And how far are we from having an automated system to take care of these two tasks? To answer these questions, in this work, we evaluated the general performance of existing state-of-the-art models for RSP and PDP tasks, and investigated what types of instances these models tend to have difficulty classifying and how impactful they are. We found, for example, that the performance of a model to predict the final decision of a paper is 23.31% lower when it is exposed to difficult instances and that the classifiers make mistake with a very high confidence. These and other results lead us to conclude that there are groups of instances that can negatively impact the model's performance. That way, the current state-of-the-art models have potential to helping editors to decide whether to approve or reject a paper, however we are still far from having a system that is fully responsible for scoring a paper and decide if it will be accepted or rejected.

SESSION: Classification

How many sources are needed?: the effects of bibliographic databases on systematic review outcomes

Systematic reviews are an established method to synthesize the current state of research for a specific question to make evidence-based decisions in research, politics, and practice. A key activity of a review approach is a systematic and comprehensive search strategy to find all potentially relevant literature. Although guidelines and handbooks address relevant methodological aspects and recommend strategies, the right choice of databases and information sources is unclear. Specifically in educational research, an interdisciplinary field, with no core database at hand and multiple potentially relevant sources available, investigators lack guidance for choosing the most appropriate ones. The presented study investigates the coverage in terms of scope, similarity and combination efficiency of seven multidisciplinary, discipline-specific and nationally focused databases. The evaluation is based on relevant assessed literature of two extensive recently published reviews in German educational research that serve as gold standard to evaluate the databases. Results indicate distinct variations in the databases, while also detecting databases with equal coverage. The paper contributes to guidance in choosing databases for educational review studies, while stressing that this process depends on a review's topical and geographical focus. Moreover, general implications resulting from the study refer to the relevance of database choice for review outcomes, the careful consideration of diverse search strategies beyond database search and a rigorous documentation of database inclusion and exclusion criteria.

Cross-domain multi-task learning for sequential sentence classification in research papers

Sequential sentence classification deals with the categorisation of sentences based on their content and context. Applied to scientific texts, it enables the automatic structuring of research papers and the improvement of academic search engines. However, previous work has not investigated the potential of transfer learning for sentence classification across different scientific domains and the issue of different text structure of full papers and abstracts. In this paper, we derive seven related research questions and present several contributions to address them: First, we suggest a novel uniform deep learning architecture and multi-task learning for cross-domain sequential sentence classification in scientific texts. Second, we tailor two common transfer learning methods, sequential transfer learning and multi-task learning, to deal with the challenges of the given task. Semantic relatedness of tasks is a prerequisite for successful transfer learning of neural models. Consequently, our third contribution is an approach to semi-automatically identify semantically related classes from different annotation schemes and we present an analysis of four annotation schemes. Comprehensive experimental results indicate that models, which are trained on datasets from different scientific domains, benefit from one another when using the proposed multi-task learning architecture. We also report comparisons with several state-of-the-art approaches. Our approach outperforms the state of the art on full paper datasets significantly while being on par for datasets consisting of abstracts.

A library perspective on nearly-unsupervised information extraction workflows in digital libraries

Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper tackles the question how digital libraries can handle such extractions and if their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), pharmacy and political sciences. We report on opportunities and limitations. Finally we discuss best practices for unsupervised extraction workflows.


Integration of text and geospatial search for hydrographic datasets using the lucene search library

We present a hybrid text and geospatial search application for hydrographic datasets built on the open-source Lucene search library. Our goal is to demonstrate that it is possible to build custom GIS applications by integrating existing open-source components and data sources, which contrasts with existing approaches based on monolithic platforms such as ArcGIS and QGIS. Lucene provides rich index structures and search capabilities for free text and geometries; the former has already been integrated and exposed via our group's Anserini and Pyserini IR toolkits. In this work, we extend these toolkits to include geospatial capabilities. Combining knowledge extracted from Wikidata with the HydroSHEDS dataset, our application enables text and geospatial search of rivers worldwide.

SchenQL: a query language for bibliographic data with aggregations and domain-specific functions

Current search interfaces of digital libraries are not suitable to satisfy complex or convoluted information needs directly, when it comes to cases such as "Find authors who only recently started working on a topic". They might offer possibilities to obtain this information only by requiring vast user interaction.

We present SchenQL, a web interface of a domain-specific query language on bibliographic metadata, which offers information search and exploration by query formulation and navigation in the system. Our system focuses on supporting aggregation of data and providing specialised domain dependent functions while being suitable for domain experts as well as casual users of digital libraries.

DevOps practices in digital library development

In this demonstration, we present how we apply DevOps practices to our digital library project development. We demonstrate a digital library platform, which is a cloud-native, serverless, and microservice digital libraries management system. All the services are open-source and publicly available on GitHub1. We also share our experiences and lessons learned about adopting the DevOps process.

TermoPL: a tool for extracting and clustering domain related terms

We present a new version of the terminology extraction tool - TermoPL. This version not only allows the ranking of term candidates but also their semantic grouping. To ensure the results are precise, we use the WordNet lexical database for identifying semantic relations between words. The tool was designed primarily for Polish texts, but the current version is tagset-independent and can be adapted to process texts in other languages. The new semantic grouping feature has been fully implemented for Polish texts, but we plan to make it available for English texts as well.

Collaborative annotation and semantic enrichment of 3D media: a FOSS toolchain

A new FOSS (free and open source software) toolchain and associated workflow is being developed in the context of NFDI4Culture, a German consortium of research- and cultural heritage institutions working towards a shared infrastructure for research data that meets the needs of 21st century data creators, maintainers and end users across the broad spectrum of the digital libraries and archives field, and the digital humanities. This short paper and demo present how the integrated toolchain connects: 1) OpenRefine - for data reconciliation and batch upload; 2) Wikibase - for linked open data (LOD) storage; and 3) Kompakkt - for rendering and annotating 3D models. The presentation is aimed at librarians, digital curators and data managers interested in learning how to manage research datasets containing 3D media, and how to make them available within an open data environment with 3D-rendering and collaborative annotation features.

The hurdles of current data citation practices and the adding-value of providing PIDs below study level

This paper discusses the data reuse principle from the end-user standpoint, addressing the data citation hurdles, particularly in the Social Sciences. Referencing research data and their inherited detailed entities by Persistent Identifiers (PIDs) supports FAIR data usage. However, in the Social Sciences, PIDs are only available on the study level, but not on the level of the inline data objects, such as survey variables. Since citing research data is the backbone of proper data reuse, this paper proposes an infrastructure to reference specific attributes within data sets, assigning PIDs to the finegrained level of attributes. By assigning PIDs to these attributes, individual elements of the data files can be referenced and retrieved with the required metadata for both machine-actionable as well as human access.

BIP!: scholar: a service to facilitate fair researcher assessment

In recent years, assessing the performance of researchers has become a burden due to the extensive volume of the existing research output. As a result, evaluators often end up relying heavily on a selection of performance indicators like the h-index. However, over-reliance on such indicators may result in reinforcing dubious research practices, while overlooking important aspects of a researcher's career, such as their exact role in the production of particular research works or their contribution to other important types of academic or research activities (e.g., production of datasets, peer reviewing). In response, a number of initiatives that attempt to provide guidelines towards fairer research assessment frameworks have been established. In this work, we present BIP! Scholar, a Web-based service that offers researchers the opportunity to set up profiles that summarise their research careers taking into consideration well-established guidelines for fair research assessment, facilitating the work of evaluators who want to be more compliant with the respective practices.

Memento validator: a toolset for memento compliance testing

Web archiving is serving the task of knowledge preservation for the ever changing state of the web. The Memento protocol provides a unified approach to access versions of web resources across heterogeneous archives and repositories. The discovery of archived content relies on data providers' correct implementation of the Memento protocol, which extends the HTTP protocol with content negotiation over the dimension of time. Implementation inconsistencies can impede the overall time-based content negotiation process and render the resources not usable. We introduce a novel tool set, an API, and a web interface that allow data providers to test their Memento protocol compliance and improve the time-travel experience for users. We offer all our tools as open source to help adoption by the web archiving community.

SESSION: Datasets

S2AMP: a high-coverage dataset of scholarly mentorship inferred from publications

Mentorship is a critical component of academia, but is not as visible as publications, citations, grants, and awards. Despite the importance of studying the quality and impact of mentorship, there are few large representative mentorship datasets available. We contribute two datasets to the study of mentorship. The first has over 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph. We use this dataset to train an accurate classifier for predicting mentorship relations from bibliographic features, achieving a held-out area under the ROC curve of 0.96. Our second dataset is formed by applying the classifier to the complete co-authorship graph of S2. The result is an inferred graph with 137 million weighted mentorship edges among 24 million nodes. We release this first-of-its-kind dataset to the community to help accelerate the study of scholarly mentorship: https://github.com/allenai/S2AMP-data

A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis: pilot investigations

This exploratory study proposes a prototype sentence-level parallel corpus to support studying optical character recognition (OCR) quality in curated digitized library collections. Existing data resources, such as ICDAR2019[21] and GT4HistOCR[23], generally aligned content by artifact publishing characteristics such as documents or lines, which is limited to explore OCR noise concentrating on natural language granularity like sentences and chapters. Building upon an existing volume-aligned corpus that collected human-proofread texts from Project Gutenberg and paired OCR views from HathiTrust Digital Library, we extracted and aligned 167,079 sentences from 189 sampled books in four domains published from 1793 to 1984. To support downstream research on OCR quality, we conducted an analysis of OCR errors with a specific focus on their associations with the source text metadata. We found that sampled data in agriculture has a higher ratio of real-word errors than other domains, while sentences from social-science volumes contain more non-word errors. Besides, data sampled from early-age volumes tend to have a high ratio of non-word errors, while samples from recently-published volumes is likely to have more real-word errors. Following our findings, we suggest that scholars should consider the potential influence of source data characteristics on their findings in the study of OCR quality issues.

HedgePeer: a dataset for uncertainty detection in peer reviews

Uncertainty detection from text is essential in many applications in information retrieval (IR). Detecting textual uncertainties helps extract factual information instead of uncertain or non-factual information. To avoid overprecise commitment, people use linguistic devices like hedges (uncertain words or phrases). In peer reviews, reviewers often use hedges wherever they are unsure about their opinion or when facts do not back their opinions. Usage of hedges or uncertain words in writing can also indicate the reviewer's confidence or measure of conviction in their reviews. Reviewer confidence is important in the peer review process (especially to the editors or chairs) to judge the quality of evaluation of the paper under review. However, the self-annotated reviewer confidence score is often miscalibrated or biased and not an accurate representation of the reviewer's conviction of their judgment on the merit of the paper. Less confident reviewers sometimes speculate their observations. Here in this paper, we introduce HedgePeer, a new uncertainty detection dataset of peer review comments, which is more than five times larger than the existing datasets on hedge detection in other domains. We curate our dataset from the open-access reviews available in the open review platform and annotate the review comments in terms of the hedge cues and hedge spans. We also provide several baseline approaches, including a multitask learning model with sentiment intensity and parts-of-speech as scaffold tasks to predict hedge cues and spans. We make our dataset and baseline codes available at https://github.com/Tirthankar-Ghosal/HedgePeer-Dataset. Our dataset is motivated towards computationally estimating the reviewer's conviction from their review texts.

SESSION: Doctoral consortium

Opening scholarly documents through text analytics

Vast amounts of scholarly knowledge are buried in electronic theses and dissertations (ETDs). ETDs are valuable documents that have been developed at great cost but largely remain unknown and unused. We aim for digital libraries to open up these long documents using computerized text mining and analytics. We add value to the existing systems by providing chapter-level labels and summaries. This allows readers to easily find chapters of interest. We use ETDs to fine-tune language models like BERT and SciBERT, to help better capture the specialized vocabulary present in such documents.

Thesis plan: the effect of data science teaching for non-STEM students

In recent years, the interest in Data Science has increased in both industry and academia. Historically, access to this discipline has been redirected to STEM professionals. However, the ubiquity of cloud computing and the simplicity of modern programming languages such as Python and R have enabled non-STEM students and professionals to leverage it especially to analyze data. Similarly, with what has been conveyed with computational thinking in terms of enabling non-STEM students with com technological competencies, this article aims to present a proposal for improving the teaching of data science specifically to non-stem students.

Context-sensitive, personalized search at the point of care

Medical data science opportunities emerging in recent years have enabled the retrieval of similar cases, related treatments, and supportive information. However, current medical domain search engines, such as PubMed, continue to retrieve health documents in a simple similarity form with the users' situational and contextual features not well integrated. Meaning, current medical information systems are neither able to consider the case context nor do personalization, thus requiring too much time the practitioners don't have.

Synthesizing digital libraries and digital humanities perspectives for illuminating under-investigated complexities associated with user-generated book reviews

The abundance of user-generated book reviews has opened up unprecedented research opportunities for digital humanities (DH) research. However, there remains a dearth of attention to the complexities and scholarly usabilities of such web data. Given digital libraries (DL) expertise in web archives and scholarly datasets, my thesis explores how to synthesize DL and DH to illuminate the under-examined complexities (namely temporal changes, cultural divergence, and user power dynamics) associated with usergenerated book reviews and improve the scholarly usability of such emergent data provisions.

Integration of models for linked data in cultural heritage and contributions to the FAIR principles

Incorporating linked data-based models into the process of describing cultural objects is increasingly important for cultural heritage. Communities such as libraries, archives, and museums have developed and adopted models specific to their contexts. Without a trivial solution, choosing models to support more general applications is challenging. This Ph.D. aims to analyze existing solutions and practices in these domains and propose validated solutions for the discovery, access, interoperability, and reuse of cultural objects, following the FAIR principles. Transversal to the base models used, this research intends to adopt solutions that balance the simplicity of the models with the satisfaction of the requirements.


Building digital library collections with greenstone 3 tutorial

This tutorial is designed for those who want an introduction to building a digital library using an open source software program. The tutorial will focus on the Greenstone digital library software. In particular, participants will work with the Greenstone Librarian Interface, a flexible graphical user interface designed for developing and managing digital library collections. Attendees do not require programming expertise, however they should be familiar with HTML and the Web, and be aware of representation standards such as Unicode, Dublin Core and XML.

The Greenstone software has a pedigree of more than two decades, with over 1 million downloads from SourceForge. This tutorial will introduce users to Greenstone 3---a version of the software designed to take better advantage of newer standards and web technologies that have been developed since the previous implementation of Greenstone. Written in Java, the software is more modular in design to increase the flexibility and extensibility of the software. Emphasis in the tutorial is placed on where Greenstone 3 goes beyond what Greenstone 2 can do.

Open refine to wikibase: a new data upload pipeline

This tutorial aims to help researchers, digital collection librarians and data managers make their datasets available as linked open data. Participants have the option to either bring their own dataset or work with a sample dataset provided by the organizers. They will take part in practical exercises and learn how to use OpenRefine for data transformation and Wikibase for data storage. This pipeline workflow was developed in the context of NFDI4Culture, a German consortium of research and cultural institutions working towards a shared infrastructure for research data that meets the needs of 21st century data creators, maintainers and end users.


JCDL2022 workshop: extraction and evaluation of knowledge entities from scientific documents (EEKE2022)

The 3rd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2022) was held online at the ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2022. The goal of this workshop series (https://eekeworkshop.github.io/) is to engage the related communities in open problems in the extraction and evaluation of knowledge entities from scientific documents. Participants are encouraged to identify knowledge entities, explore feature of various entities, analyze the relationship between entities, and construct the extraction platform or knowledge base.

20th European NKOS workshop: networked knowledge organization systems and services

The workshop will explore the potential of Knowledge Organization Systems (KOS), such as classification systems, taxonomies, thesauri, ontologies, and lexical databases, in the context of current developments and possibilities. These tools help model the underlying semantic structure of a domain for purposes of information retrieval, knowledge discovery, language engineering, etc. The workshop provides an opportunity to discuss projects, research and development activities, evaluation approaches, lessons learned, and research findings. The main theme of the workshop is Designing for Cultural Hospitality and Indigenous Knowledge in KOS.

Understanding literature references in academic full text (ULITE)

The goal of the Understanding Literature References in Academic Full Text (ULITE) workshop is to engage communities interested in the broad topic of literature reference understanding and automatic processing of scientific full text publications. The workshop has a focus on working with open infrastructures/tools and offering the extracted information as open data for reuse. Our view is to expose people from one community to the work of the respective other community and to foster fruitful interaction across communities.

Web archiving and digital libraries (WADL) 2022

The 2022 edition of the Workshop on Web Archiving and Digital Libraries (WADL) will explore the integration of web archiving and digital libraries. The workshop aims at addressing aspects covering the entire life cycle of digital resources, including creation/authoring, uploading/publishing, crawling, indexing, exploration, and archiving. It will also explore areas such as archiving processes and tools for "non-traditional" resources such as social media, scholarly and government datasets, 3D objects, and digital online art.

2nd workshop on digital infrastructures for scholarly content objects (DISCO'22)

The goal of the Digital Infrastructures for Scholarly Content Objects (DISCO) workshop is to raise awareness of quality issues, improved discovery, and re-use challenges in digital infrastructures for scholarly content, and to collect potential solutions among an audience of diverse expertise.

E-only theses submission and preservation workshop

This workshop will provide participants with an opportunity to discuss issues relating to the deposit and archiving of e-only theses; and create a forum for promoting inter-institutional collaboration and knowledge exchange.

The workshop aims to invite discussion in such areas as challenges, barriers, opportunities, current practice, and lessons learnt from existing efforts; updates necessary in workflows, processes, and documentation; as well as elements that digital preservation policy and practice should include to cater for e-only theses submission; methods to capture embargoes, supplementary research data and non-traditional theses into IR and digital preservation records; and stakeholder requirements that need to be considered to facilitate submission of e-only theses and their preservation.

Workshop proceedings and recommendations deriving from round table will be documented in a report that will be shared with participants and the wider community.