Motivated by the Foreword of Licklider's "Libraries of the Future" (dedicated to Vannevar Bush), this keynote focuses on users, exploration, and future directions of the digital library (DL) field, which moves toward procognitive systems. Many different digital library "users," each a member of a Society, engage in a diversity of Scenarios, often involving some aspect of exploration, usually of the DL content Streams. Services -- e.g., searching, browsing, recommending, and visualizing -- help those users leverage knowledge Structures and Spatial representations. Following on the final sentence of Licklider's book, we "call for a formal base plus an overlay of experience, " leading to a new way to build better DLs. Licklider said we seek "the facts, concepts, principles, and ideas that lie behind the visible and tangible aspects of documents," to help us acquire and use knowledge. Put simply: "The console of the procognitive system will have two special buttons, a silver one labeled 'Where am I" and a gold one labeled 'What should I do next?' "How can we build and use this?
For more than 55 years, researchers have applied artificial intelligence (AI), natural language processing (NLP), representations (data, information, knowledge), question-answering, databases, human-computer interaction, and other techniques described by Licklider, to these challenges. We have a vast range of hardware and software services available, but without a more formal approach, will not enable adaptive self-organization and tailored exploration.
The 5S framework can help us build, apply, and improve digital libraries to facilitate exploration, through a formal approach that will simplify such efforts, making them extensible through both human and computing agents. For example, to more easily build DLs, we propose collaboratively building knowledge graphs -- involving both User eXperience designers, subject matter experts, and developers -- that specify connections to services and workflows, enabling DL operation atop a workflow engine. User exploration, additional help by UX designers, recommendations of adaptations of existing workflows, and AI-based optimizations and solutions to new problems, will all expand the knowledge graph to ensure new and more helpful assistance.
When this is accomplished, we must teach and learn about this next generation of digital libraries, further developing suitable curriculum and educational modules, that rest upon a solid theoretical foundation, helping spread understanding of key concepts and best practices.
The digital lifecycle contains definitive processes of data curation, management, long-term preservation, and dissemination of the content, all key building blocks in the development of a digital library. It is vital to maintain a complete digital lifecycle workflow for the preservation of digital cultural heritage and digital scholarship. In this talk, I will explore a digital lifecycle program (DLP)  for digital libraries. When we begin to build a complete digital lifecycle, we also refer to the print lifecycle. The two are similar despite differences in format and physical conditions for content creation (selection & data mining, conversion and curation), organization (analysis, integration, aggregation and linking), and interpretation (metadata and cataloging), preservation (data storage, duplication, checksum & repair, and migration), access and publishing (navigating, discovery and rights management). Only through a systematic and sustainable digital lifecycle program, can we build platforms for cross disciplinary research and repositories of large aggregated digital content.
DLP can be exemplified in specific cultural heritage projects such as the Digital Dunhuang project . Digital Dunhuang enables long-term preservation of cultural heritage of inestimable value, while providing a platform for sharing all digital assets generated in the act of preservation. With the support of The Mellon Foundation, the Dunhuang Academy has been exploring building a permanent repository of all Digital Dunhuang assets. The only way to ensure that information gathered from Dunhuang's Mogao Caves is permanently preserved for future generations, is to integrate all the content that has been created in the past, is being created now, and will be created in the future into one large digital repository . This digital repository will facilitate perpetual preservation, effective digital asset management operations, and easy access in a systematic way.
In facilitating the digital lifecycle development, we are ensuring that knowledge and scholarship created in the digital age will be able to survive as print-and-paper scholarship has for centuries. We are also ensuring that the vast number of users of the digital library will have effective access to aggregated content across different domains and platforms. During those transformative changes, librarians will take on shifting roles in data management, digital preservation, rights management, open access, and innovation.
Natural Language Processing (NLP) and related technologies are critical for the success of many Internet applications such as digital libraries, e-commerce and customer service.
This talk presents some recent research efforts and trends of four sets of NLP technologies for Internet applications. First, neural language model has been a very popular research direction in the last a few years that serves as the foundation of many NLP technologies and has significantly improved the performance of many applications; Second, machine translation techniques have been substantially advanced to better bridge the language barriers for many Internet applications; Third, the identification of inappropriate Internet text information (e.g., pornographic content) is a challenging research topic due to the diversified text representation; Fourth, machine reading comprehension has been become an important question and answering technology to directly satisfy information needs of many Internet users. These technologies will be discussed with examples from large-scale real-world applications.
New researchers are usually very curious about the recipe that could accelerate the chances of their paper getting accepted in a reputed forum (journal/conference). In search of such a recipe, we investigate the profile and peer review text of authors whose papers almost always get accepted at a venue (Journal of High Energy Physics in our current work). We find authors with high acceptance rate are likely to have a high number of citations, high h-index, higher number of collaborators etc. We notice that they receive relatively lengthy and positive reviews for their papers. In addition, we also construct three networks -- co-reviewer, co-citation and collaboration network and study the network-centric features and intra- and inter-category edge interactions. We find that the authors with high acceptance rate are more 'central' in these networks; the volume of intra- and inter-category interactions are also drastically different for the authors with high acceptance rate compared to the other authors. Finally, using the above set of features, we train standard machine learning models (random forest, XGBoost) and obtain very high class wise precision and recall. In a followup discussion we also narrate how apart from the author characteristics, the peer-review system might itself have a role in propelling the distinction among the different categories which could lead to potential discrimination and unfairness and calls for further investigation by the system admins.
While it has been extensively studied on how to model and measure a scholar's research impact (e.g., citation analysis), there have been very few studies that systematically collect and quantify a scholar's service impact to scientific communities. To address this lack of studies, we have developed a prototype digital library, named as g, that crawls, extracts, and quantifies scholars' service impacts based on their roles as "gatekeepers" in Computer Science conferences. Continuing this effort, in this work, we further theoretically analyze and improve the understanding on the expected behavior of three quantification measures (i.e., G-indexes) being used in g. In addition, we demonstrate that the stretched-exponential model fits significantly better than three other heavy-tail models (i.e., power-law, log-normal, and parabolic-fractal) in capturing scholars' service impacts via three quantification measures. Finally, using the analyzed quantification measures, we present leading scholars and conferences with respect to their service impacts. Our prototype is available at: https://gatekeeper.ist.psu.edu.
Journal Impact Factor is a popular metric for determining the quality of a journal in academia. The number of citations received by a journal is a crucial factor in determining the impact factor, which may be misused in multiple ways. Therefore, it is crucial to detect citation anomalies for further identifying manipulation and inflation of impact factor. Citation network models the citation relationship between journals in terms of a directed graph. Detecting anomalies in the citation network is a challenging task which has several applications in spotting citation cartels and citation stack and understanding the intentions behind the citations.
In this paper, we present a novel approach to detect the anomalies in a journal-level scientific citation network and compare the results with the existing graph anomaly detection algorithms. Due to the lack of proper ground-truth, we introduce a journal-level citation anomaly dataset which consists of synthetically injected citation anomalies and use it to evaluate our methodology. Our method is able to predict the anomalous citation pairs with a precision of 100% and an F1-score of 86%. We further categorize the detected anomalies into various types and reason out possible causes. We also analyze our model on the Microsoft Academic Search dataset - a real-world citation dataset and interpret our results using a case study, wherein our results resemble the citations and SCImago Journal Rank (SJR) rating-change charts, thus indicating the usefulness of our method. We further design 'Journal Citation Analysis Tool', an interactive web portal which, given the citation network as an input, shows the journal-level anomalous citation patterns and helps users analyze citation patterns of a given journal over the years.
We propose the Personalization Finder, a search interface that enables users to be aware of and control personalization of a web search. The proposed interface is intended to improve behavioral data privacy and promote critical information seeking. A preliminary user survey indicates that many web users worry that search engines personalize their search results for political topics, which can raise concerns about opinion polarization. However, the survey also indicates that few users believe that search engines personalize search results for such topics. Based on the results of an online user study, we confirmed the following: (1) Our prototype interface resulted in users spending more time looking at the web search results at deeper ranking positions than conventional web search interfaces when querying political topics, and (2) on average, users thought our prototype was useful for objective collection of information.
One of the key questions in studies of Search as Learning is how to represent and measure the intangible and invisible learning processes that occur during the search process. In this study, participants are presented with two tasks and asked to represent their current relevant knowledge in a mind map. Participants then perform the search tasks and modify the mind maps as they search. In this paper we report on the use of measurement of vocabulary that is added to or removed from the mind map as a proxy for knowledge change in the search process. Mind maps represent user's knowledge with content in the tree structure. We examined the effect of users' prior knowledge upon their query behaviors by comparing users' pre-search mind map vocabulary with their query terms; and investigate how the content pages users read during search affected their knowledge change during search by comparing pre- and post-search mind maps. Our results demonstrated that users' prior knowledge played an important role in their query formulation. More than 50% of query terms came from users' prior knowledge, which accounted for nearly 40% of pre-search mind map vocabulary. As the search process proceeded, users added new vocabulary to their mind maps, and about 1/3 of new vocabulary was present in and copied directly from content pages. This study shows how prior knowledge and search results contribute to users' learning during information searching, and can help us better understand how learning occurred in the information searching process.
It is a common phenomenon that many students study with background music, but the influence of background music on learning is still an open question, with inconclusive findings in the literature. Inspired by the research gap, we conducted a controlled user experiment on reading with 100 students from a comprehensive university. The participants were tasked to read nine academic passages. In the meantime, those who were randomly allocated to the experiment group listened to their self-provided music in the background during the reading task, while those in the control group did not have background music during reading. During the experiment, participants' reading logs, self-reported meta-cognition and emotion status were recorded. This paper reports the results of comparing measures on reading performance, meta-cognition and emotion changes between the two groups. In addition, the relationships between participants' personal traits and their preferred background music types were investigated. Findings indicated that learning with background music of one's own choice could be beneficial for maintaining positive emotion, with no cost on reading performance. Through providing empirical evidence on the effect of background music on reading, this study contributes to furthering our understanding on human behaviors in multi-channel learning settings and rendering design implications for personalized recommendations in online music services and music digital libraries for facilitating reading and self-learning.
Understanding the topical evolution in industrial innovation is a challenging problem. With the advancement in the digital repositories in the form of patent documents, it is becoming increasingly more feasible to understand the innovation secrets - 'catchphrases' - of organizations. However, searching and understanding this enormous textual information is a natural bottleneck. In this paper, we propose an unsupervised method for the extraction of catchphrases from the abstracts of patents granted by the U.S. Patent and Trademark Office over the years. Our proposed system achieves substantial improvement, both in terms of precision and recall, against state-of-the-art techniques. As a second objective, we conduct an extensive empirical study to understand the temporal evolution of the catchphrases across various organizations. We also show how the overall innovation evolution in the form of introduction of newer catchphrases in an organization's patents correlates with the future citations received by the patents filed by that organization. Our code and data sets will be placed in the public domain.
Multivariate relations are general in various types of networks, such as biological networks, social networks, transportation networks, and academic networks. Due to the principle of ternary closures and the trend of group formation, the multivariate relationships in social networks are complex and rich. Therefore, in graph learning tasks of social networks, the identification and utilization of multivariate relationship information are more important. Existing graph learning methods are based on the neighborhood information diffusion mechanism, which often leads to partial omission or even lack of multivariate relationship information, and ultimately affects the accuracy and execution efficiency of the task. To address these challenges, this paper proposes the multivariate relationship aggregation learning (MORE) method, which can effectively capture the multivariate relationship information in the network environment. By aggregating node attribute features and structural features, MORE achieves higher accuracy and faster convergence speed. We conducted experiments on one citation network and five social networks. The experimental results show that the MORE model has higher accuracy than the GCN (Graph Convolutional Network) model in node classification tasks, and can significantly reduce time cost.
When a user requests a web page from a web archive, the user will typically either get an HTTP 200 if the page is available, or an HTTP 404 if the web page has not been archived. This is because web archives are typically accessed by Uniform Resource Identifier (URI) lookup, and the response is binary: the archive either has the page or it does not, and the user will not know of other archived web pages that exist and are potentially similar to the requested web page. In this paper, we propose augmenting these binary responses with a model for selecting and ranking recommended web pages in a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing web pages in the archive that the user may not know existed. First, we check if the URI is already classified in DMOZ or Wikipedia. If the requested URI is not found, we use machine learning to classify the URI using DMOZ as our ontology and collect candidate URIs to recommended to the user. The classification is in two parts, a first-level classification and a deep classification. Next, we filter the candidates based on if they are present in the archive. Finally, we rank candidates based on several features, such as archival quality, web page popularity, temporal similarity, and URI similarity. We calculated the F1 score for different methods of classifying the requested web page at the first level. We found that using all-grams from the URI after removing numerals and the top-level domain (TLD) produced the best result with F1 =0.59. For the deep-level classification, we measured the accuracy at each classification level. For second-level classification, the micro-average F1=0.30 and for third-level classification, F1=0.15. We also found that 44.89% of the correctly classified URIs contained at least one word that exists in a dictionary and 50.07% of the correctly classified URIs contained long strings in the domain. In comparison with the URIs from our Wayback access logs, only 5.39% of those URIs contained only words from a dictionary, and 26.74% contained at least one word from a dictionary. These percentages are low and may affect the ability for the requested URI to be correctly classified.
Reviewing scientific literature is a cumbersome, time consuming but crucial activity in research. Leveraging a scholarly knowledge graph, we present a methodology and a system for comparing scholarly literature, in particular research contributions describing the addressed problem, utilized materials, employed methods and yielded results. The system can be used by researchers to quickly get familiar with existing work in a specific research domain (e.g., a concrete research question or hypothesis). Additionally, it can be used to publish literature surveys following the FAIR Data Principles. The methodology to create a research contribution comparison consists of multiple tasks, specifically: (a) finding similar contributions, (b) aligning contribution descriptions, (c) visualizing and finally (d) publishing the comparison. The methodology is implemented within the Open Research Knowledge Graph (ORKG), a scholarly infrastructure that enables researchers to collaboratively describe, find and compare research contributions. We evaluate the implementation using data extracted from published review articles. The evaluation also addresses the FAIRness of comparisons published with the ORKG.
There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. Toward this end, in this work, we propose a novel semantic data model for modeling the contribution of scientific investigations. Our model, i.e. the Research Contribution Model (RCM), includes a schema of pertinent concepts highlighting six core information units, viz. Objective, Method, Activity, Agent, Material, and Result, on which the contribution hinges. It comprises bottom-up design considerations made from three scientific domains, viz. Medicine, Computer Science, and Agriculture, which we highlight as case studies. For its implementation in a knowledge graph application we introduce the idea of building blocks called Knowledge Graph Cells (KGC), which provide the following characteristics: (1) they limit the expressibility of ontologies to what is relevant in a knowledge graph regarding specific concepts on the theme of research contributions; (2) they are expressible via ABox and TBox expressions; (3) they enforce a certain level of data consistency by ensuring that a uniform modeling scheme is followed through rules and input controls; (4) they organize the knowledge graph into named graphs; (5) they provide information for the front end for displaying the knowledge graph in a human-readable form such as HTML pages; and (6) they can be seamlessly integrated into any existing publishing process thatsupports form-based input abstracting its semantic technicalities including RDF semantification from the user. Thus RCM joins the trend of existing work toward enhanced digitalization of scholarly publication enabled by an RDF semantification as a knowledge graph fostering the evolution of the scholarly publications beyond written text.
Citation recommendation systems aim to recommend citations for either a complete paper or a small portion of text called a citation context. The process of recommending citations for citation contexts is called local citation recommendation and is the focus of this paper. Firstly, we develop citation recommendation approaches based on embeddings, topic modeling, and information retrieval techniques. We combine, for the first time to the best of our knowledge, the best-performing algorithms into a semi-genetic hybrid recommender system for citation recommendation. We evaluate the single approaches and the hybrid approach offline based on several data sets, such as the Microsoft Academic Graph (MAG) and the MAG in combination with arXiv and ACL. We further conduct a user study for evaluating our approaches online. Our evaluation results show that a hybrid model containing embedding and information retrieval-based components outperforms its individual components and further algorithms by a large margin.
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of a recommender system based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.
Hierarchical classification schemes are an effective and natural way to organize large document collections. However, complex schemes make the manual classification time-consuming and require domain experts. Current machine learning approaches for hierarchical classification do not exploit all the information contained in the hierarchical schemes. During training, they do not make full use of the inherent parent-child relation of classes. For example, they neglect to tailor document representations, such as embeddings, to each individual hierarchy level. Our model overcomes these problems by addressing hierarchical classification as a sequence generation task. To this end, our neural network transforms a sequence of input words into a sequence of labels, which represents a path through a tree-structured hierarchy scheme. The evaluation uses a patent corpus, which exhibits a complex class hierarchy scheme and high-quality annotations from domain experts and comprises millions of documents. We re-implemented five models from related work and show that our basic model achieves competitive results in comparison with the best approach. A variation of our model that uses the recent Transformer architecture outperforms the other approaches. The error analysis reveals that the encoder of our model has the strongest influence on its classification performance.
The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building---all proceeding concurrently in mutually-reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.
Web archive data usually contains high-quality documents that are very useful for creating specialized collections of documents, e.g., scientific digital libraries and repositories of technical reports. In doing so, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the huge number of documents collected by web archiving institutions. In this paper, we explore different learning models and feature representations to determine the best performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. We focus our evaluation on three datasets that we created from three different Web archives. Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.
The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.
Domain knowledge map, a.k.a., scholarly network, construction as an important method can describe the significant characters of a selected domain. In this research, we will address three fundamental problems for scholarly network generation. Firstly, two different methods will be investigated to associate keywords on the graph: Co-occur Domain Distance and Citation Probability Distribution Distance. Secondly, this paper will construct domain (core journals and conference proceedings) knowledge and domain referral (domain citation) scholarly networks, and propose a novel method to integrate those graphs by optimizing the nodes and their linkage. Finally, the paper will propose an innovative method to evaluate the accuracy and coverage of scholarly networks based on training keyword oriented Labeled-LDA model and validate different domain or domain referral graphs.
The objective of our paper is to present the overview of knowledge trade between LIS and other categories. Based on the related citation data of LIS from JCR ranging from 2003 to 2018, we not only focus on LIS's knowledge destinations and knowledge balance, but also propose Citation Peak Lag (CPL) as an emerging indicator to measure the speed of knowledge exchange. We have proved that LIS is one of the pivots which integrate knowledge coming from both hard science and soft science, while its ratios of export to import are quite different in all sorts of categories. By analyzing the changes of LIS's Citing CPL and Cited CPL over time, we found that the former is generally bigger than the latter, and we observed the different patterns of CPL in related categories of LIS. Moreover, we also discovered that hard science (e.g. Computer Science) has lower CPL than soft science (e.g. Business, Management) and the concrete patterns of knowledge waves vary from fields to fields. This study fills the research gap in analyzing bi-directional knowledge flows of LIS, and measuring the velocity of knowledge diffusion.
Scientific papers are complex and understanding the usefulness of these papers requires prior knowledge. Peer reviews are comments on a paper provided by designated experts on that field and hold a substantial amount of information, not only for the editors and chairs to make the final decision, but also to judge the potential impact of the paper. In this paper, we propose to use aspect-based sentiment analysis of scientific reviews to be able to extract useful information, which correlates well with the accept/reject decision.
While working on a dataset of close to 8k reviews from ICLR, one of the top conferences in the field of machine learning, we use an active learning framework to build a training dataset for aspect prediction, which is further used to obtain the aspects and sentiments for the entire dataset. We show that the distribution of aspect-based sentiments obtained from a review is significantly different for accepted and rejected papers. We use the aspect sentiments from these reviews to make an intriguing observation, certain aspects present in a paper and discussed in the review strongly determine the final recommendation. As a second objective, we quantify the extent of disagreement among the reviewers refereeing a paper. We also investigate the extent of disagreement between the reviewers and the chair and find that the inter-reviewer disagreement may have a link to the disagreement with the chair. One of the most interesting observations from this study is that reviews, where the reviewer score and the aspect sentiments extracted from the review text written by the reviewer are consistent, are also more likely to be concurrent with the chair's decision.
Scientific digital libraries speed dissemination of scientific publications, but also the propagation of invalid or unreliable knowledge. Although many papers with known validity problems are highly cited, no auditing process is currently available to determine whether a citing paper's findings fundamentally depend on invalid or unreliable knowledge. To address this, we introduce a new framework, the keystone framework, designed to identify when and how citing unreliable findings impacts a paper, using argumentation theory and citation context analysis. Through two pilot case studies, we demonstrate how the keystone framework can be applied to knowledge maintenance tasks for digital libraries, including addressing citations of a non-reproducible paper and identifying statements most needing validation in a high-impact paper. We identify roles for librarians, database maintainers, knowledgebase curators, and research software engineers in applying the framework to scientific digital libraries.
We investigate the use of data annotations to increase transparency and data access in social science research. We present preliminary findings from a study that asked experts to re-view a research paper and judge the validity of empirical claims made - first without data and then with author supplied annotations that provide access to underlying data.We demonstrate that in domains where there is an emergent culture of data sharing reviewer expectations about what material should be shared are met, but this does not necessarily influence trust in an author's claims. Reviewers describe ways that shared materials aide validation tasks, but that these materials just as often sharpen their critique and introduce new lines of questioning. We believe that these findings hold important implications for scholarly communications and digital repository development, in particular as these two communities engage in developing policies to guide authors on how to prepare data for review, supporting reviewers with linked repository systems, and for the design of peer-review processes more generally.
Scientific data publications may include interactive data applications designed by scientists to explore a scientific problem. Defined as knowledge systems, their development is complex when data are aggregated from multiple sources over time. Multimodal data are created, encoded, and maintained differently, and even when reporting about identical phenomena, fields and their values may be inconsistent across datasets. To assure the validity and accuracy of the application, the data has to abide by curation requirements similar to those ruling digital libraries. We present a novel, inquiry-driven curation approach aimed to optimize multimodal datasets curation and maximize data reuse by domain researchers. We demonstrate the method through the ASTRIAGraph project, in which multiple data sources about near earth space objects are aggregated into a central knowledge system. The process involves multidisciplinary collaboration, resulting in the design of a data model as the backbone for both data curation and scientific inquiry. We demonstrate a) how data provenance information is needed to assess the uncertainty of the results of scientific inquiries involving multiple data sources, and b) that continuous curation of integrated datasets is facilitated when undertaken as integral to the research project. The approach provides flexibility to support expansion of scientific inquiries and data in the knowledge system, and allows for transparent and explainable results.
Person names are essential and important entities in the Named Entity Recognition (NER) task. Traditional NER models have shown success in recognising well-formed person names from text with consistent and complete syntax, such as news articles. However, user-generated text such as academic homepages, academic resumes, articles in online forums and social media may contain lots of free-form text with incomplete syntax including person names with various forms. This brings significant challenges for the NER task. In this paper, we address person name recognition in this context by proposing a fine-grained annotation scheme based on anthroponymy together with a new machine learning model to perform the task of person name recognition. Specifically, our proposed name annotation scheme labels fine-grained name forms including first, middle, or last names, and whether the name is a full name or initial. Such fine-grained annotations offer richer training signals for models to learn person name patterns in free-form text. We then propose a Co-guided Neural Network (CogNN) model to take full advantage of the fine-grained annotations. CogNN uses co-attention and gated fusion to co-guide two jointly trained neural networks, each focusing on different dimensions of the name forms. Experiments on academic homepages and news articles demonstrate that our annotation scheme together with the CogNN model outperforms state-of-the-art significantly.
Calligraphy is an important part of traditional Chinese culture. Authentic calligraphy works are usually preserved on paper, bamboo slips and stone tablets, which are easy to be destroyed and not easily appreciated by most calligraphy lovers anytime and anywhere. Therefore, in order to facilitate the preservation and appreciation of calligraphy works, digital technology and computer technology are used for digital storage, management and service.
The time complexity of image recognition algorithm based on traditional method is relatively high, especially the algorithm based on feature matching has low tolerance to Chinese character deformation. Therefore, this paper tries to use deep learning to study the image recognition of Chinese characters. However, some existing data sets of calligraphic images have a small number of corresponding images, which cannot support deep learning network training. In view of this problem, a data augmentation method based on character glyph and stroke characteristics is proposed. This method starts from two aspects of Chinese character glyph and stroke. Chinese character glyph is changed by affine changes of the restricted conditions. The stroke information of Chinese characters is extracted by super pixel and the stroke information is changed to expand the diversity of Chinese characters in the data set. In the process of character recognition, mixup data augmentation algorithm is combined with Inception v4 network which has excellent performance in Imagenet classification tasks?for network recognition training to improve network generalization ability. Experiments show that the algorithm proposed in this paper based on deep learning has a higher recognition effect than previous algorithms, especially in cursive and running recognition.
The digital calligraphy knowledge service system based on web terminal and WeChat small program terminal is designed and implemented, which can provide the digital calligraphy knowledge service of calligraphy work segmentation and calligraphy character recognition.
Sentences about future work (FWS) mentioned in the academic papers are very important, which contain valuable information and can provide researchers with new research topics or directions. At present, researchers' analysis of academic papers mainly focuses on the content of citations, bibliographic information, etc., and little attention is paid to the FWS contained in the full text. This paper constructs a corpus of the FWS based on the full text content of academic papers, and analyzes characteristics and the rules of FWS. Taking 4,024 conference papers in Natural Language Processing (NLP) as the research object, three basic annotation specifications are formulated, and 3,067 sentences about future work are extracted from 4,509 chapters by manual annotation. Then, all the FWS are manually coded, and the sentences are classified into 6 main categories, which are method, resources, evaluation, application, problem and other, and 17 sub-categories. Finally, we analyze the future work in different types. The results show that sentences mentioning methods account for the highest proportion. There is little difference in the number of the FWS in other categories. To the best of our knowledge, this is the first attempt at constructing the corpus of FWS, which will provide the basis for the automatic extraction and classification of the FWS and facilitate researchers to conduct large-scale research on future work in academic papers. Our future works include extending scale of the FWS corpus and automatic extraction of the FWS.
Keyphrase extraction models are usually evaluated under different, not directly comparable, experimental setups. As a result, it remains unclear how well proposed models actually perform, and how they compare to each other. In this work, we address this issue by presenting a systematic large-scale analysis of state-of-the-art keyphrase extraction models involving multiple benchmark datasets from various sources and domains. Our main results reveal that state-of-the-art models are in fact still challenged by simple baselines on some datasets. We also present new insights about the impact of using author- or reader-assigned keyphrases as a proxy for gold standard, and give recommendations for strong baselines and reliable benchmark datasets.
Many large text collections exhibit graph structures, either inherent to the content itself or encoded in the metadata of the individual documents. Example graphs extracted from document collections are co-author networks, citation networks, or named-entity-cooccurrence networks. Furthermore, social networks can be extracted from email corpora, tweets, or social media. When it comes to visualising these large corpora, either the textual content or the network graph are used.
In this paper, we propose to incorporate both, text and graph, to not only visualise the semantic information encoded in the documents' content but also the relationships expressed by the inherent network structure. To this end, we introduce a novel algorithm based on multi-objective optimisation to jointly position embedded documents and graph nodes in a two-dimensional landscape. We illustrate the effectiveness of our approach with real-world datasets and show that we can capture the semantics of large document collections better than other visualisations based on either the content or the network information.
The existence of numerous and rich traditional music collections, their importance for preserving cultural heritage and an increasing interest in this type of music were the key factors leading to the concept of a music research support environment for ethnomusicologists. Our experience with Polish traditional music collections and archives shows that their existence is not equivalent to their availability for search, retrieval, processing and analysis. The idea behind the environment is to provide stable infrastructure and software solutions necessary to enable musicological research and, in a wider perspective, to open traditional music resources for a larger group of users. The paper describes our motivation for building such music research support environment and a number of issues and challenges we have encountered in the process. We present the environment concept, building stages already completed and plans for its further development. The environment is founded on the dLibra digital library adapted to the requirements of traditional music content and collections and with consideration for the current needs of ethnomusicologists. It combines the advantages of a user-centric layered digital library and Linked Open Data enrichment with system-centric music processing tools. A few of such tools have already been developed, for example, to support automatic music transcription of a large number of recordings in order to make them available for research and analysis. Future development plans include content aggregation and content-based indexing and search.
State of the Art Neural Language Models (NLMs) such as Word2Vec are becoming increasingly successful for important biomedical tasks such as the literature-based prediction of com-plex chemical properties or for finding novel drug-disease associations (DDAs). However, NLMs have the disadvantage of being hard to interpret. Therefore, it is notoriously difficult to explain why an artificial neural network learned or predicted some specific association.
Considering that digital libraries offer well-curated contexts, the challenge is to automatically create a reasonable explanation that is intuitively understandable for a user. For a pharmaceutical use case, we present a new method that generates pharmaceutical explanations for predicted DDAs in intuitively understandable sentences. In other words, our approach enables a context-aware access to embedded entities. We test the accuracy of our approach with a comprehensive retrospective analysis considering real DDA predictions. Our explanations can automatically determine the association type (Drug treats or induces a disease) of a predicted DDA with up to 83%. For existing DDAs, we even achieve accuracies up to 87%. We show that we perform better than deep-learning approaches in this classification task by up to 9%.
Word embeddings enable state-of-the-art NLP workflows in im-portant tasks including semantic similarity matching, NER, question answering, and document classification. Recently also the biomedical field started to use word embeddings to provide new access paths for a better understanding of pharmaceutical entities and their relationships, as well as to predict certain chemical properties. The central idea is to gain access to knowledge embedded, but not explicated in biomedical literature. However, a core challenge is the interpretability of the underly-ing embeddings model. Previous work has attempted to inter-pret the semantics of dimensions in word embeddings models to ease model interpretation when applied to semantic similarity task. To do so, the original embedding space is transformed to a sparse or a more condensed space, which then has to be inter-preted in an exploratory (and hence time-consuming) fashion. However, little has been done to assess in real-time whether specific user-provided semantics are actually reflected in the original embedding space. We solve this problem by extracting a semantic subspace from large embedding spaces that better fits the query semantics defined by a user. Our method builds on least-angle regression to rank dimensions according to given semantics properly, i.e. to uncover a subspace to ease both in-terpretation and exploration of the embedding space. We com-pare our methodology to querying the original space as well as to several other recent approaches and show that our method consistently outperforms all competitors.
Digital Libraries benefit from the use of text classification strategies since they are enablers for performing many document management tasks like Information Retrieval. The effectiveness of such classification strategies depends on the amount of available data and the classifier used. The former leads to the design of data augmentation solutions where new samples are generated into small datasets based on the semantic similarity between existing samples and concepts defined within external linguistic resources. The latter relates to the capability of finding, which is the best learning principle to adopt for designing an effective classification strategy suitable for the problem. In this work, we propose a neural-based architecture thought for addressing the text classification problem on small datasets. Our architecture is based on BERT equipped with one further layer using the sigmoid function. The hypothesis we want to verify is that by using embeddings learned by a BERT-based architecture, one can perform effective classification on small datasets without the use of data augmentation strategies. We observed improvements up to 14% in the accuracy and up to $23%$ in the f-score with respect to baseline classifiers exploiting data augmentation.
Digital libraries are online collections of digital objects that can include text, images, audio, or videos in several languages. It has long been observed that named entities (NEs) are key to the access to digital library portals as they are contained in most user queries. However, NEs can have different spellings for each language which reduces the performance of user queries to retrieve documents across languages. Cross-lingual named entity linking (XEL) connects NEs from documents in a source language to external knowledge bases in another (target) language. The XEL task is especially challenging due to the diversity of NEs across languages and contexts. This paper describes an XEL system applied and evaluated with several languages pairs including English and various low-resourced languages of different linguistic families such as Croatian, Finnish, Estonian, and Slovenian. We tested this approach to analyze documents and NEs in low-resourced languages and link them to the English version of Wikipedia. We present the resulting study of this analysis and the challenges involved in the case of degraded documents from digital libraries. Further works will make an extensive analysis of the impact of our approach on the XEL task with OCRed documents.
The quality of OCR has a direct impact on information access, and an indirect impact on the performance of natural language processing applications, making fine-grained (e.g., semantic) information access even harder. This work proposes a novel post-OCR approach based on a contextual language model and neural machine translation, aiming to improve the quality of OCRed text by detecting and rectifying erroneous tokens. This new technique obtains results comparable to the best-performing approaches on English datasets of the competition on post-OCR text correction in ICDAR 2017/2019.
The ability to understand not only that a piece of research has been cited, but why it has been cited has wide-ranging applications in the areas of research evaluation, in tracking the dissemination of new ideas and in better understanding research impact. There have been several studies that have collated datasets of citations annotated according to type using a class schema. These have favoured annotation by independent annotators and the datasets produced have been fairly small. We argue that authors themselves are in a primary position to answer the question of why something was cited. No previous study has, to our knowledge, undertaken such a large-scale survey of authors to ascertain their own personal reasons for citation. In this work, we introduce a new methodology for annotating citations and a significant new dataset of 11,233 citations annotated by 883 authors. This is the largest dataset of its type compiled to date, the first truly multi-disciplinary dataset and the only dataset annotated by authors. We also demonstrate the scalability of our data collection approach and perform a comparison between this new dataset and those gathered by two previous studies.
Plagiarism detection systems are essential tools for safeguarding academic and educational integrity. However, today's systems require disclosing the full content of the input documents and the document collection to which the input documents are compared. Moreover, the systems are centralized and under the control of individual, typically commercial providers. This situation raises procedural and legal concerns regarding the confidentiality of sensitive data, which can limit or prohibit the use of plagiarism detection services. To eliminate these weaknesses of current systems, we seek to devise a plagiarism detection approach that does not require a centralized provider nor exposing any content as cleartext. This paper presents the initial results of our research. Specifically, we employ Private Set Intersection to devise a content-protecting variant of the citation-based similarity measure Bibliographic Coupling implemented in our plagiarism detection system HyPlag. Our evaluation shows that the content-protecting method achieves the same detection effectiveness as the original method while making common attacks to disclose the protected content practically infeasible. Our future work will extend this successful proof-of-concept by devising plagiarism detection methods that can analyze the entire content of documents without disclosing it as cleartext.
Literature-Based Discovery (LBD) refers to the process of detecting implicit, novel knowledge linkages hidden in scientific digital libraries and its contribution in accelerating research innovations is widely recognised. Despite significant advances, almost all the prior research efforts suffer from a major research deficiency which is lack of portability. That is, the existing LBD models are highly dependent on specialised and domain-dependent knowledge resources that restrict their applicability to limited problem areas or domains. However, LBD, the process of discovering new knowledge from unstructured text that potentially leads to novel research innovations is crucial despite the domain. Thus, this study proposes an interdisciplinary LBD framework by circumventing the existing impediments in the LBD workflow towards promoting portable scientific problem solving. To this end, we considered the revolutionary opportunities offered through Semantic Web by employing DBpedia for the first time in the LBD workflow. The suitability of our proposals to overcome the prevailing limitations are evaluated by comparing them with commonly used domain specific resources.
This article presents the results of an investigation into how safe, from a cyber-security standpoint, our Open Source Digital Library (DL) systems are. The fact that these systems use open source software presents particular challenges in terms of securely running a web-based digital repository, as a malicious user has the added advantage that they can study the source code to the system to establish new vectors of attack, in addition to the many well documented black-box forms of web hacking. To scope the work reported we focused on two widely used digital library systems: DSpace and Greenstone, undertaking both Static Application Security Testing (SAST) and Dynamic Application Security Testing (DAST), in addition to more traditional port scans. We summarize the deficiencies found and detail how to make improvements to both systems to make them more secure. We conclude by reflecting more broadly on the forms of security concerns found, to help inform future development of DL software architectures.
To understand and properly use scientific data, it is important that the metadata, which describes information related to data, contains sufficient information. Keywords are one of the metadata items, and for assigning keywords to the earth science dataset, it is common to select and assign appropriate keywords from a controlled vocabulary with a hierarchical structure. Keyword information plays an important role in dataset search and classification; however, the cost of selecting appropriate keywords from a controlled vocabulary is high, and in many cases, a sufficient number of keywords are not actually assigned. In this study, we focus on keyword recommendations for earth science datasets using the definition sentences given to the keywords in the controlled vocabulary, and propose content-based recommendation methods considering the hierarchical structure in the controlled vocabulary.
This preliminary study explored multiple information sources for music recommendation system (MRS), including users' personality traits measured by the Ten-Item Personality Inventory (TIPI) and physiological signals recorded by a wearable wristband. A dataset of 23 participants and 628 song listening records were obtained from a user experiment, with matched personality, physiological signals as well as music acoustic features. Based on the dataset, a machine learning experiment with four regression algorithms was conducted to compare recommendation performances across different combinations of feature sets. Results show that personality features contributed significantly to the improvement of recommender accuracy, while physiological features contributed less. Analysis of top features in the best performing model revealed the importance of some physiological features. Future studies are called for to further investigate multimodal MRS through exploiting user properties and context data.
The study of reading has a long history in the digital library community, but one issue that has been largely ignored is gender. Gender is known to play a significant role in the acquisition, reading and use of print material. However, there it is unknown to what degree the influence of reading norms carries into digital reading. In this paper we examine the differences in the readership of a variety of magazines, between their print and electronic editions. The results reveal that digital reading is, in general, less gender-conforming than print reading. However, it also appears that consumption of digital editions on mobile phones reverts towards the gender stereotypes found in print. Together, this data serves to demonstrate that digital library services, including search engines, should consider the risk of reinforcing gender stereotypes that occur when reading is a public performance, and entrenching those biases when reading is done privately
The affordances of virtual reality (VR) have made it widely adopted for presenting content of cultural heritage in digital libraries. In recent years, non-specialists including university students were involved in creating low-end VR content using low-cost equipment and software. Among various design options, the question as to which ones could be more effective in presenting cultural heritage remains. This study aims to evaluate and compare the effectiveness of user-created VR content of cultural heritage with different designs, through collecting and analyzing self-report and eye movement data from end users. Results show that the presence of text annotations in the VR content helped users understand the cultural heritage being presented, whereas users' visual attention was largely attracted to the text annotations and additional images when the VR content contained such visual information. This preliminary study also explores the feasibility of using the eye-tracking method to analyze user interactions with VR content of cultural heritage. The results provide empirical evidence on the effects of different designs of user-created VR content on end users' understanding of cultural heritage.
Literature-Based Discovery (LBD) which is a sub-discipline of text mining, aims to detect meaningful implicit knowledge linkages in digital libraries that have the potential in generating novel research hypotheses. The input can be considered as one of the most critical components of the LBD process as the entire knowledge discovery is solely dependent on the content and quality of input. However, there is no uniform selection of the input since different LBD studies have picked different input types (e.g., titles, abstracts, keywords). This emphasises the need for assessing the information richness of inputs to decide the most suited input type for LBD workflow. Therefore, this study focuses on a large-scale assessment of the information richness of different variants of popular LBD input types. Our observations are consistent with all of the five golden test cases in the discipline.
In recent years, Digital Humanities (DH) tools play a more and more important role in humanities research and education. But there is lack of well-developed classification system to organize these tools, which makes it difficult for humanities scholars to find suitable digital support at various stages of research work. An effective classification system for DH tools should be designed according to the process of humanities research, thus help humanities scholars to locate suitable tools accordingly. In this study, we interviewed 20 humanities scholars and analyzed 60 widely-used DH tools. We derived 9 research tasks (processing, exploration, collection, collation, analysis, interpretation, presentation, reading and communication) and 4 research stages of humanities research process from interviews, and 4 categories of research techniques (text analysis, network analysis, geospatial analysis, and temporal analysis) from analyzing DH tools. Thus, a two-dimensional classification system was developed based on the humanities research process and applied research techniques. Then the collected DH tools were organized into this classification system, and a navigation website was developed to enable humanities scholars to find digital humanities resources effectively and efficiently.
Literature search and recommendation systems have traditionally focused on improving recommendation accuracy through new algorithmic approaches. Less research has focused on the crucial task of visualizing the retrieved results to the user. Today, the most common visualization for literature search and recommendation systems remains the ranked list. However, this format exhibits several shortcomings, especially for academic literature. We present an alternative visual interface for exploring the results of an academic literature retrieval system using a force-directed graph layout. The interactive information visualization techniques we describe allow for a higher resolution search and discovery space tailored to the unique feature-based similarity present among academic literature. RecVis - the visual interface we propose - supports academics in exploring the scientific literature beyond textual similarity alone, since it enables the rapid identification of other forms of similarity, including the similarity of citations, figures, and mathematical expressions.
Each year the number of Open Access (OA) papers is gradually increasing. We carried out a study investigating 400 universities from 8 countries to examine: i) the total number of OA papers per country, ii) proportion of OA papers published by representative universities in each country classified into three tiers of research quality: high, middle and low, iii) how universities within the same country compare to each other and iv) the growth of OA papers in countries per year. We conclude that among the analysed countries the UK and USA rank first and second respectively, while Russia and India are positioned towards the bottom of the list. We observe no link between the proportion of OA papers published by authors at a university and the university ranking, with some universities in the middle university rank tier having a larger proportion of OA papers than those in the high tier.
Recent advances in the area of legal information systems have led to a variety of applications that promise support in processing and accessing legal documents. Unfortunately, these applications have various limitations, e.g., regarding scope or extensibility. Furthermore, we do not observe a trend towards open access in digital libraries in the legal domain as we observe in other domains, e.g., economics of computer science. To improve open access in the legal domain, we present our approach for an open source platform to transparently process and access Legal Open Data. This enables the sustainable development of legal applications by offering a single technology stack. Moreover, the approach facilitates the development and deployment of new technologies. As proof of concept, we implemented six technologies and generated metadata for more than 250,000 German laws and court decisions. Thus, we can provide users of our platform not only access to legal documents, but also the contained information.
Traditional media outlets are known to report political news in a biased way, potentially affecting the political beliefs of the audience and even altering their voting behaviors. Many researchers focus on automatically detecting and identifying media bias in the news, but only very few studies exist that systematically analyze how theses biases can be best visualized and communicated. We create three manually annotated datasets and test varying visualization strategies. The results show no strong effects of becoming aware of the bias of the treatment groups compared to the control group, although a visualization of hand-annotated bias communicated bias in-stances more effectively than a framing visualization. Showing participants an overview page, which opposes different viewpoints on the same topic, does not yield differences in respondents' bias perception. Using a multilevel model, we find that perceived journalist bias is significantly related to perceived political extremeness and impartiality of the article.
We present our work on creating a virtual reality personal library environment to enable people with severe visual disabilities to engage in reading tasks. The environment acts as a personal study or library to an individual, who under other circumstances would not be able to access or use a public library or a physical study at home. We present tests undertaken to identify the requirements and needs of our users to inform this environment and finally present the working prototype.
Named entity recognition has been extensively studied in the past decade. The state-of-the-art models, trained on general text such as Wikipedia articles and newsletters, have achieved F_1>0.90. Entity types are focused on people, location, organization, etc. However, entity recognition from domain-specific text, in particular research papers, is still challenging. In this paper, we perform a comparative study of sequence tagging (ST) methods on this task using a manually curated corpus from biomedical papers on Lyme disease. Each model we compare consists of a ST and a non-ST classification component. In this pilot study, we freeze the non-ST classifier to study how the ST component performs with variants of the conditional random field (CRF) and bidirectional long short-term memory (BiLSTM). The results shed light on the importance of pre-trained word embeddings such as ELMo and the residual unit. The attention mechanism and enriched features do not seem to boost the performance in recognizing entity mentions and their positions, which is likely to be caused by the relatively small training sample. We plan to improve the model by increasing the training corpus size and trying different combinations of features.
How hard is it to systematically identify and disambiguate place names in scientific text? In order to address this question, we applied MapAffil, a toponymic search interface, on a random sample of 500 place name sentences from PubMed abstracts.
The algorithm correctly identified and disambiguated 39.2% of the place names in sentences. An error analysis revealed six unique challenges: Biological terms (14.2%), Method terms (11.6%), Acronyms (10%), References (6%), Other entity names (4.2%), and Other errors (2.2%). Interestingly, a large portion of the correctly identified place names appeared irrelevant to the subject matter.
Many of these errors can be fixed easily, but irrelevance is much harder to address, for it depends on semantics and purpose. To study the role of place in scientific text, it is not sufficient to disambiguate accurately, but it is also necessary to be able to assess the degree of relevance.
This paper investigates the limitations and challenges of the curated datasets provided by digital libraries in support of digital humanities (DH) research. Our presented work provides a use case utilizing an English literature dataset of 178,381 volumes curated by the HathiTrust Research Center (HTRC) for measuring the change of three literature genres. These volumes were selected from over 17 million digitized items in the HathiTrust Digital Library. We demonstrate our methods and workflow for improving the representativeness and scholarly usability of the existing datasets. We analyzed and effectively overcame three common limitations: duplicate volumes, uneven distribution of data and OCR errors. We suggest that stakeholders of digital libraries should flag and address these limitations to improve their provisions' usability in the context of digital humanities research.
Searching on the Web is not only an issue of finding answers to a specific question. In fact, searching is closely related to learning and users intend to increase their knowledge about new topics. To better understand users' knowledge construction process in exploratory search, we conducted a qualitative user study with 25 participants from a variety of domains. We collected data from audio and video recordings of think-aloud sessions and semi-structured interviews. The preliminary findings reveal a number of core processes, such as knowledge nodes, knowledge communities and knowledge networks, each of them composes several activities. Based on this findings, a conceptual model of knowledge construction process in exploratory search is presented.
Online learning platforms that aim to improve reading interests and proficiency of young readers, particularly students in elementary schools, rarely have automated personalized recommendation services. This study attempts to bridge this gap by developing and evaluating two book recommenders that are integrated into an online learning platform for young readers. A preliminary user experiment was conducted to measure the effectiveness and usability of the recommender prototypes. Results of think-aloud usability testing, post-test questionnaires, and a semi-structured interview verified the feasibility of adding these book recommenders to improve personalization of the online learning platform. Further improvements of the recommenders were also suggested. The user evaluation framework provides a reference for future studies on personalized learning material recommendation.
Music information encountering within the mobile Internet environment mainly occurs in hedonic-oriented situations, which is quite different from the information encountering in utilitarian-oriented situations in prior work. To explore the main influencing factors of users' music information encountering experiences, we recruited 30 participants and conducted semi-structured interviews on their information encountering interacting with mobile music applications. The qualitative data were analyzed by three rounds of coding, and a model was proposed to identify the relevant influencing factors. The findings show that five main categories of factors can trigger users' music information encountering experience from both the user- and the application-related perspective.
This study uses comparative research to compare the effects of the subjective difficulty and objective difficulty of search tasks on users' search behaviors. Data regarding users' opinions about task difficulty was obtained via a post-search questionnaire using a 5-Likert scale. When measuring subjective difficulty, tasks with ratings above 3 were considered difficult. When measuring objective difficulty, tasks with ratings higher than the average difficulty score were considered difficult. The study's findings indicate that it is better to develop task difficulty prediction models that are based on subjective difficulty because these models are more stable. Models based on objective difficulty could not match the performance of models based upon subjective difficulty. The findings shed light on issues related to experiment design that will be valuable for future research.
Research collaborations especially long-distance collaborations are increasingly prevalent all over the world. Research leadership plays a key role in research collaborations. However, the collaboration relationship is homogeneously assigned to all authors in the same article, ignoring the dominance/leadership of the collaboration. Furthermore, in many cases, when evaluating research actors' performance, spatial features, which are found to be important to academic influence, are ignored. This paper aims to fill the gap by constructing a weighted and directed spatial research leadership network and propose SRLRank based on the network. Systemic analysis of spatial distribution and dynamic pattern of research leadership flows are performed. Thorough evaluations are conducted with comparisons to a set of traditional indices. The results indicate the superb performance of the proposed SRLRank. Finally, comprehensive implications and further applications of the SRLRank are discussed.
The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-level search and recommendation. The sparsity of labeled data makes it challenging to construct supervised machine learning solutions for automatic discourse-level segmentation of abstracts in non-bio domains. In this paper, we address this problem using transfer learning. We define three discourse categories -- BACKGROUND, TECHNIQUE, and OBSERVATION -- for an abstract because these three categories are most common. We train a deep neural network on structured abstracts from PubMed, then fine-tune it on a small hand-labeled corpus of computer science papers. We observe an accuracy of 75% on the test corpus of computer science papers. We also perform an ablation study to highlight the roles of the different parts of the model. Our method appears to be a promising solution to the automatic segmentation of abstracts, where the labeled data is sparse.
One of the most time-consuming tasks that researchers usually have to undergo is finding existing, relevant papers to study and cite in their articles. Manual effort that involves searching relevant papers using keywords not only is time-consuming, but also yields low recall. To mitigate these issues, many automatic citation recommendation methods that find possible citations, using a matrix to represent citation graph, and extracting features to predict citations relevant to the input article, have been proposed. A majority of these methods, however, are proximity-based, which lack global knowledge of the entire citation graph. In this paper, we present a preliminary investigation on a novel approach to recommend citations via knowledge graph embedding. Specifically, ConvCN, an extension of ConvKB algorithm designed for citation knowledge graph embedding, is proposed. We evaluate our approach against the state-of-the-art baselines on WN18RR dataset and citation datasets. The empirical results, using the link prediction protocol, show that the proposed method outperforms all baseline methods in all datasets.
Uncertainty in text is an important linguistic phenomenon that is relevant in many areas of natural language processing. In this paper, we present a neural approach towards detecting uncertainty cues in texts. We have explored a series of neural network architectures and evaluated the models with respect to three different data sources belonging to domains such as bio-medical texts, privacy policies, and product reviews. Our preliminary analysis showed that the relation aware attention models outperform the existing baseline systems across all the domains. We have also observed for domain specific texts incorporating character level embeddings significantly improves the performance.
Semantic frameworks build foundations for digital libraries and repositories to enable structured data and information representation and interoperability in today's interlinked information systems. Conceptual modeling and ontological schemas provide effective communication and powerful tools for creating shared understanding and sustainable systems in various digital libraries. This panel will present cases in which conceptual modeling and ontologies are used to enrich content representation and reach consensus among communities of practice, especially in fast changing digital society and emerging application domains. Four experts in knowledge organization will first give a brief introduction for their research in conceptual modeling and ontology building and then engage the audience with question answering interactions.
Data and information literacy (DIL) is a set of critical skills for every citizen in the twenty-first century. It is important to help both college students and other individuals become skillful data/information users, creators, and lifelong learners. As DIL has a profound impact on education, employment, and quality of life in today's data-intensive and information-rich environment, it is essential to make everyone aware of the importance of DIL and to deliver effective DIL education. Currently, DIL education confronts multiple challenges including insufficient instruction, heterogeneous student groups, inadequate faculty, out-dated instructional materials, etc. To overcome these challenges, effective generic and specific education methods and models are desired to be proposed and applied. The goal of this panel is to further discuss these challenges, methods, and models of DIL education. Innovative development strategies are put forward. The importance of enhancing educators' teaching skills are also emphasized. This panel will include two main activities. First, the panelists will give presentations about recent efforts to improve DIL education. Second, the moderator will host a discussion among the panelists and the audience, drawing on previous presentations.
This study investigates types of tactics that blind and sighted users applied in their initial exploration with a digital library (DL). Sixty participants with 30 blind and 30 sighted novice users were recruited for the study. Multiple data collection methods were employed to collect data: questionnaires, think-aloud protocols, and transaction logs. The findings of the study show that sighted participants focused on browsing DL content when blind participants concentrated on browsing DL structure. As the first study of its kind, it has both theoretical and practical implications.
This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve the accessibility and discoverability of mathematical knowledge in Wikimedia projects further.
User experience is an important indicator of mobile library service quality. Current users hope to enjoy good sensory and functional experiences, as well as pleasant and satisfying emotional experiences, from the service. Based on the Pleasure-Arousal-Dominance (PAD) Emotional State Model and the Five Factor Model (FFM) of personality, this research constructs a model to measure mobile library users' emotional experiences. Fifty college students from Wuhan University were randomly selected for our experiment, through which the validity of our measurement model has been verified. These students' emotional changes due to the mobile library's service interface, service functions and service environment were measured. We found that the mobile service of the library has good usability, but the service satisfaction and ease of use need to be improved. According to the results of the data analysis, we provided some suggestions for the optimization of the mobile library service quality.
Detecting dark businesses in a decentralized eCommerce ecosystem (e.g. eBay, eBid, and Taobao) is a critical research problem. In this paper, we investigate the characteristics of dark implicit products, the associated buyer seeking behaviors, and features of classification model. Results demonstrate that dark implicit product detection is a challenging problem, while buyer seeking behavior information could be useful as a critical alternative to address this problem.
To explore the key research technology knowledge graphs related to Library and Information Science(LIS), articles between 1998 and 2020 were collected from the "Web of Science TM". By using the visualization software CiteSpace, the pivotal literature related to big data in the field of LIS, as well as countries, institutions, and keywords, were visualized and recognized. The results show that the research hot spots in this field mainly include: big data brings influences and challenges to LIS, big data analysis technology, and data management and user privacy.
News is a central source of information for individuals to inform themselves on current topics. Knowing a news article's slant and authenticity is of crucial importance in times of "fake news," news bots, and centralization of media ownership. We introduce Newsalyze, a bias-aware news reader focusing on a subtle, yet powerful form of media bias, named bias by word choice and labeling (WCL). WCL bias can alter the assessment of entities reported in the news, e.g., "freedom fighters" vs. "terrorists." At the core of the analysis is a neural model that uses a news-adapted BERT language model to determine target-dependent sentiment, a high-level effect of WCL bias. While the analysis currently focuses on only this form of bias, the visualizations already reveal patterns of bias when contrasting articles (overview) and in-text instances of bias (article view).
Dataset exploration is a set of techniques crucial in many research and data science projects. For textual datasets, commonly used techniques include topic modeling, document summarization, and methods related to dimension reduction. Despite their robustness, these techniques suffer from at least one of the following drawbacks: document summarization does not explicitly set documents in relation, the others yield summaries or topics that often are difficult to interpret and yield poor results for topics that consist of context-dependent terms. We propose a method for dataset exploration that employs cross-document near-identity resolution of mentions of semantic concepts, such as persons, other named entity types, events, actions. The method not only sets documents in relation and thus allows for comparative dataset exploration, but also yields well interpretable document representations. Additionally, due to the underlying approach for cross-document resolution of concept mentions, the method is able to set documents in relation as to their near-identity terms, e.g., synonyms that are not universally valid but only in the given dataset.
The new pandemic disease caused by COVID-19 virus is the crucial event over the world in the beginning of 2020. Studies on corona viruses have been however carried since several decades ago, with recent research papers published on weekly basis. We demonstrate a simple approach to explore CORD-19 dataset to provide a high level overview of important semantic changes that occurred over time. Our method aims to support better understanding of large domain-specific collections of scholarly publications that span long time periods and could be regarded as complementary to frequency-based analysis.
Topic modeling is a technique used in a broad spectrum of use cases, such as data exploration, summarization, and classification. Despite being a crucial constituent of many use cases, established topic models, such as LDA, often produce statistically valid yet non-meaningful topics, i.e., that cannot easily be interpreted by humans. In turn, the usability of topic modeling approaches, e.g., in document summarization, is non-optimal. We propose a topic modeling approach that uses TCA, a method for also near-identity cross-document coreference resolution. TCA showed promising results when resolving mentions of not only persons and other named entities, but also broad, vague, or abstract concepts. In a preliminary evaluation on news articles, we compare the approach with state-of-the-art topic modeling. We find that (1) the four baselines produce statistically valid yet hollow topics or topics that only refer to events in the dataset but not the events' topical composition. (2) TCA is the only approach that extracts topics that distinctively describe meaningful parts of the dataset.
This paper explores the relationship between nations or organizations from the perspective of British Parliament. The co-occurrence network of countries was constructed to detect the characteristics and interaction relationship among these countries. The evolution venation was also mapped to elucidate its continuous development. Results show that the analyses methods proposed in this paper and its application in British parliamentary debates can foster a deep understanding of the status and development of international relations among countries (or organizations).
Taxonomy is the human understanding and organizing of domain knowledge. How to automatically evolve taxonomy becomes critical in this knowledge explosion world. In this paper, we introduce a model to automatically updating taxonomy in a semi-supervised learning way. In our model, the goal is to train a graph neural network model, which can efficiently classify the edge between newly add term with and existing term to be the three types: true hyponym-hypernym relation, transductive hyponym-hypernym relation, and false hyponym-hypernym relation. We explore the ability of graph convolutional neural network, hyperbolic graph convolutional neural network and graph attention network to fulfill this task. We conduct an experiment on SemEval-2016 Task 13 data to test the quality of taxonomy obtained through evolution compared with the winning team's regeneration algorithms.
News articles covering policy issues are an essential source of information in the social sciences and are also frequently used for other use cases, e.g., to train NLP language models. To derive meaningful insights from the analysis of news, large datasets are required that represent real-world distributions, e.g., with respect to the contained outlets' popularity, topically, or across time. Information on the political leanings of media publishers is often needed, e.g., to study differences in news reporting across the political spectrum, which is one of the prime use cases in the social sciences when studying media bias and related societal issues. Concerning these requirements, existing datasets have major flaws, resulting in redundant and cumbersome effort in the research community for dataset creation. To fill this gap, we present POLUSA, a dataset that represents the online media landscape as perceived by an average US news consumer. The dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum. Each outlet is labeled by its political leaning, which we derive using a systematic aggregation of eight data sources. The news dataset is balanced with respect to publication date and outlet popularity. POLUSA enables studying a variety of subjects, e.g., media effects and political partisanship. Due to its size, the dataset allows to utilize data-intense deep learning methods.
Author name ambiguity is a common problem in digital libraries. The problem occurs because multiple individuals may share the same name and the same individual may be represented by various names. Researchers have proposed various techniques for author name disambiguation (AND). In this paper, we study AND in the context of research publications indexed in the PubMed citation database. We perform an empirical study where we experiment with two ensemble-based classification algorithms, namely, random forest and gradient boosted decision trees, on a publicly available corpus of manually disambiguated author names from PubMed. Results show that random forest produces higher accuracy, precision, recall and F1-score, but gradient boosted trees perform competitively. We also determine which features are most discriminative given the feature set and the classifiers.
Nowadays, the discussion of library smart service in academic circles mostly stays at the stage of putting forward concepts, and there are various different opinions. The research uses the methods of literature review as well as content analysis to sort out the core academic theory of library smart service in 66 academic papers and excavates the core elements of library smart service and their relationship through the way of framework construction. Finally, follow the framework of "Smart City - Smart Library - Smart Service - User" and under the core of "information technology + multi-service perspective", this research integrates five integrated elements including macro background layer, related factors layer, technology equipment layer, service creating layer and service providing layer, into the theoretical framework of core elements of library smart service.
The comments of government social media contain a lot of netizens' opinions. To extract netizens' opinions quickly and accurately, this poster used BiLSTM-CRF model. To verify the effectiveness of the model, this poster selected a microblog of "China Police Online", crawled the comments and trained the model on the basis of manual annotation. Then the trained model was used to identify opinions of netizens from a large number of comments. The experiment showed that the model was effective and could accomplish the task of extracting opinions from comments of government microblog.
The electronic health record is one of the most promising research fields in medical informatics. In order to reveal the development trajectory of EHR, based on bibliometrics, this study analyzed the subjects and topics distribution of EHR from the perspective of information resource management. The result shows that EHR related research are mainly including population health and disease risk prediction, medical informatics related technologies and research, quality improvement and acceptance research, and performance and impact assessment of information systems.
Electronic health records(EHRs) has become very popular in last few years. To describe the current landscape of EHR-related research,based on lda2vec and co-occurrence analysis, this study focuses on its research areas and topic distribution. It is found that studies about population health and risk prediction have grown rapidly, and much attention has been paid to the application of AI in medicine. Moreover, the standards for knowledge representation and information sharing have been basically improved.
This study explores differences in the linguistic characteristics of questions from different disciplines on academic social Q&A platforms. Based on 1968 questions collected from five disciplines on ResearchGate Q&A, the Kruskal-Wallis test showed that questions in different disciplines differ significantly across multiple linguistic characteristics. This study will help scholars better understand the expressive preferences of their disciplines on social media.
A mathematical paper contains various mathematical statements, including definitions, theorems, lemmas, and so on. The mining of mathematical literature currently focuses on formulas and disregards statements. The present study investigates the (automatic) subdiscipline classification for mathematical statements. The classification results are applied into inter-subdiscipline analysis, including proportion and dependency analyses. First, a statement learning data is directly compiled from mathematical textbooks with a little human labeling to train an effective subdiscipline classifier. Second, a relatively large corpus, namely, analysis data, is compiled from mathematical journals. The classification results on the analysis data are subsequently used to quantify the inter-subdisciplinary relationships and conduct proportion analysis. Lastly, the dependency of different subdisciplines is analyzed and dependency chains among subdisciplines can be obtained.
The citation relationship between scientific publications constitutes a huge and complex citation network, which is of great significance for hotspot analysis and cutting-edge prediction in different fields. Nevertheless, how to evaluate the novelty and impact of a scientific publication in its early stages remains an open question. To address this issue, we apply a network representation learning approach (struc2vec) to represent the full complexity of citation network structure, explore the extent to which an emerging science publication has changed the network structure of existing knowledge, and explain the relationship between this change and the paper's cited numbers from both clustering and network visualization perspectives. We found that the structural features captured by struc2vec can predict future citations of scientific publications to some extent. The predictive effects can be interpreted by how a new publication connects to and alters the existing structure of scientific knowledge in our visual analytics.
As currently the ACM Digital Library (ACM DL) is developing its new version, this study aimed to understand the usages of ACM DL. Based on the log data of ACM DL, we investigated the characteristics of the queries submitted by users, the distribution features of search topics in ACM DL and the relationship between query reformulation and search topic. In the future, we will study the complex relation between search sessions in-depth and how to classify the users' search topics better. It will help digital library serve users better.
We present DAIRE (Deep Archival Image Retrieval Engine), an image exploration tool based on latent representations derived from neural networks, which allows scholars to "query" using an image of interest to rapidly find related images within a web archive. This work represents one part of our broader effort to move away from text-centric analyses of web archives and scholarly tools that are direct reflections of methods for accessing the live web. This short piece describes the implementation of our system and a case study on a subset of the GeoCities web archive.
Linking to code repositories, such as on GitHub, in scientific papers becomes increasingly common in the field of computer science. The actual quality and usage of these repositories are, however, to a large degree unknown so far. In this paper, we present for the first time a thorough analysis of all GitHub code repositories linked in scientific papers using the Microsoft Academic Graph as a data source. We analyze the repositories and their associated papers with respect to various dimensions. We observe that the number of stars and forks, respectively, over all repositories follows a power-law distribution. In the majority of cases, only one person from the authors is contributing to the repository. The GitHub manuals are mostly kept rather short with few sentences. The source code is mostly provided in Python. The papers containing the repository URLs as well as the papers' authors are typically from the AI field.
In this study, the authors are working on the development of the Sound Wheelchair to increase the activity of wheelchair users. The Sound Wheelchair is a self-propelled wheelchair equipped with a 9-axis motion sensor module to convert the wheelchair user's operations to the digital media data. We aim to create a map that can be used for self-propelled wheelchair users' exercise promotion according to wheelchair users' disability grade by making a digital library of the ground surface information and activity. In this paper, we proposed a method for discriminating ground surface information of the Sound Wheelchair and we made a simple collection game with ground surface information for wheelchair users' as a method of exercise promotion.
The services of national research data centers represent the best practices of research data services in China. In this study, 536 cases are collected from 20 national research data centers, and content analysis method is used to quantitatively and qualitatively discuss the users, services and effects according to "3W" model. The first "W"--serve for whom, the main users are institutes, universities and their researchers. The second "W"--what is served, data services are various and individualized, while large-scale data retrieval and downloads are still the main services, data-based and value-added technical support are included as well. The third "W"--with what effects, the effect and impact is mainly reflected in supporting the development of science & technology and economy & society, the support for government decision-making and international issues are not yet obvious. Hopefully, this study can help government, data centers and academic libraries for further actions.
There exist debates on whether task complexity significantly influences on information seeking performance. This poster conducted a meta-analysis to explore whether objective and subjective task complexities significantly affect information searching performance measured by time. Result shows there is a significant correlation between task complexity and time performance. Objective task complexity has a highly negative effect on time performance, while subjective task complexity has a moderately negative effect on it. As for the effect of objective task complexity on time performance, the study samples are homogeneous. However, they are heterogeneous on the effect of subjective task complexity on time.
Academic social networking sites (SNSs) have become one of the important ways for scholars to communicate informally. There are multiple ways in which researchers used SNSs .This poster analyzed and compared scholars' blog topics posted on ScienceNet in relation with their publication keywords. Results suggest that 1): while some use SNS as an idea discussion platform, and others use it as a paper promotion platform. 2) topics of the blogs on SNS span to both personal and professional domains. These findings can provide some preliminary understanding of how researchers use SNSs and allow platforms to automatically divide the types of content published by scholars, and help scholars make better use of academic social platforms for informal academic communication.
Scientific retraction helps purge the continued use of flawed research. However, the practical influence of it needs to be identified and quantified. In this study, we analyzed the citations of 46 psychological articles from Web of Science to explore the influence of retraction combining qualitative with quantitative methods. Our results show that 1) the overall rate of retraction was only 0.02% in psychology. The mean time from publication to retraction was much longer than that in medical fields; 2) retraction caused a significant decline in the post-retraction citations and triggered a series of changes in the lifecycle of citation; 3) some negative citations of retracted articles resulted from retraction, which occupied only a low percentage. The overall influence of retraction on citation was clear but remained to be improved.
This paper analyzes trends of citation and altmetrics with respect to different OA types (e.g., gold, hybrid, green). The analysis based on Unpaywall, Altmetric, and COCI shows that articles with a green license obtain more citations than other OA types. Regarding patents, hybrid, green, and bronze articles get more mentions compared to closed and gold articles. In terms of social media (e.g., Twitter and Facebook), bronze articles receive the most mentions.
Media bias may often affect individuals' opinions on reported topics. Many existing methods that aim to identify such bias forms employ individual, specialized techniques and focus only on English texts. We propose to combine the state-of-the-art in order to further improve the performance in bias identification. Our prototype consists of three analysis components to identify media bias words in German news articles. We use an IDF-based component, a component utilizing a topic-dependent bias dictionary created using word embeddings, and an extensive dictionary of German emotional terms compiled from multiple sources. Finally, we discuss two not yet implemented analysis components that use machine learning and network analysis to identify media bias. All dictionary-based analysis components are experimentally extended with the use of general word embeddings. We also show the results of a user study.
Large-scale digital libraries such as the HathiTrust, with access to over 17 million texts, include duplicative texts and metadata inconsistencies that impede information access and retrieval. At this scale, it is untenable to manually evaluate each text. The SaDDL (Similarities and Duplication in Digital Libraries) project has been developing content-based methods for identifying similar text relationships within a digital library. This framework allows us to quantify words, themes, and concepts in order to identify similarity on a massive scale currently unobtainable by human effort. Secondly, this method allows us to identify the most representative scan of a target work. This poster presents the way of reconstructing the same work relationships deriving from content comparison directly, rather than matched superficial metadata to obtain the most representative copy---defined as the most complete, correct, cleanest expression of work. The philosophy behind this approach is to study the collection emphasizing the content itself, which is closer to the essence of the text.
Although text mining is helpful in extracting complicated character and/or place relationships of text, how it could be utilized to enhance the reading experience has not been well studied. We propose a four-stage method to build read-aiding e-books directly from text, integrating the technologies in text mining and the ideas in interactive aesthetics. By applying this method, we manage to identify and present multiple complex relationships in the classical Chinese novel Romance of the Three Kingdoms, providing readers with vivid scenes and rich interactions for better comprehension.
As scholars increasingly use academic user-generated content from various social media to support their research, evaluating the quality of this content becomes challenging. This study recruited researchers to conduct retrieval experiments and participate in post-experiment interviews, aiming to identify the factors that affect their evaluation of various types of academic user-generated content. By analyzing the interview data, 23 factors affecting the evaluation were obtained. This pilot study provides foundations for building a quality evaluation model for academic user-generated content.
Researchers reuse data from past studies to avoid costly re-collection of experimental data. However, large-scale data reuse is challenging due to lack of consensus on metadata representations among research groups and disciplines. Dataset File System (DFS) is a semi-structured data description format that promotes such consensus by standardizing the semantics of data description, storage, and retrieval. In this paper, we present analytic-streams - a specification for streaming data analytics with DFS, and streaming-hub - a visual programming toolkit built on DFS to simplify data analysis workflows. Analytic-streams facilitate higher-order data analysis with less computational overhead, while streaming-hub enables storage, retrieval, manipulation, and visualization of data and analytics. We discuss how they simplify data pre-processing, aggregation, and visualization, and their implications on data analysis workflows.
Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, institutions, and advisors. The method is evaluated on a ground truth dataset comprised of rectified metadata provided by the Virginia Tech and MIT libraries. Our heuristic method achieves an accuracy of up to 97% on the fields of the ETD text files. Our method poses a strong baseline for machine learning based methods. To our best knowledge, this is the first work attempting to extract metadata from non-born-digital ETDs.
Scientific literature is crucial for researchers to inspire novel research ideas and find state-of-the-art solutions to various scientific problems. This paper presents a pilot study of a reading task for novice researchers using eye-tracking measures. The study focused on the scan path, fixations, and pupillary activity of the participants.
As the basic work of knowledge mining and service based on full-text of articles, recognizing the categories of section in academic articles can help us to understand the function of content in different parts of the article. There is no existing a large-scale annotated corpus of section categories which can be used to classify the sections of the articles via machine learning. This study developed an annotating platform, namely CASS to implement the functions, including grouping annotators, task assignment, online tagging and doubtful corpus review. It can improve the tagging efficiency and provide convenience for task management, data preservation, results review and verifying inconsistent results.
High solicitation for publishing a paper in scientific journals has led to the emergence of a large number of open-access predatory publishers. They fail to provide a rigorous peer-review process, thereby diluting the quality of research work and charge high article processing fees. Identification of such publishers has remained a challenge due to the vast diversity of the scholarly publishing ecosystem. Earlier works utilises only the objective features such as metadata. In this work, we aim to explore the possibility of identifying predatory behaviour through text-based features. We propose PredCheck, a four-step classificaton pipeline. The first classifier identifies the subject of the paper using TF-IDF vectors. Based on the subject of the paper, the Doc2Vec embeddings of the text are found. These embeddings are then fed into a Naive Bayes classifier that identifies the text to be predatory or non-predatory. Our pipeline gives a macro accuracy of 95% and an F1-score of 0.89.
We present a method for source code plagiarism detection that is independent of the programming language. Our method EsaGst combines Explicit Semantic Analysis and Greedy String Tiling. Using 25 cases of source code plagiarism in C++, Java, Ja-vaScript, PHP, and Python, we show that EsaGst outperforms a baseline method in identifying plagiarism across programming languages.
Academic genealogy graphs (AGGs) identify the lineage of researchers and highlight the roles of mentors and academic institutions in the evolution of a discipline. In this poster, we study EduTree, the AGG of the discipline of education. We identify the major characteristics of EduTree including the main areas of research, the pioneering institutions and the researchers with high mentorship index.
Automatic scientific keyphrase extraction is a challenging problem facilitating several downstream scholarly tasks like search, recommendation, and ranking. In this paper, we introduce SEAL, a scholarly tool for automatic keyphrase extraction and classification. The keyphrase extraction module comprises two-stage neural architecture composed of Bidirectional Long Short-Term Memory cells augmented with Conditional Random Fields. The classification module comprises of a Random Forest classifier. We extensively experiment to showcase the robustness of the system. We evaluate multiple state-of-the-art baselines and show a significant improvement. The current system is hosted at http://lingo.iitgn.ac.in:5000/.
Data-intensive research and decision-making continue to gain adoption across diverse organizations. As researchers and practitioners increasingly rely on analyzing large data products to both answer scientific questions and for operational needs, data acquisition and pre-processing become critical tasks. For environmental science, the Canadian Surface Prediction Archive (CaSPAr) facilitates easy access to custom subsets of numerical weather predictions. We demonstrate a new open-source interface for CaSPAr that provides easy-to-use map-based querying capabilities and automates data ingestion into the CaSPAr batch processing server.
The NewsEye project demonstrator is a proof of concept of a digital platform dedicated to historical newspapers, intended to show benefits for researchers and the general public. This platform presently hosts newspapers from partner libraries in four different languages (Finnish, Swedish, German and French) providing users with various analysis tools as well as allowing them to manage their research in an interactive way. The platform gives access to these enriched data sets, and additionally interfaces with analysis tools developed in the NewsEye project, letting users experiment with tools specifically developed for investigating historical newspapers.
Nowadays, real time feed video data are publicly available through various social network platforms. These data might be a live CCTV recording, news broadcast, parliament session, and so on. As the data usually contains many interesting events, various video semantic analysis tasks can be performed, such as video summarizing, motion detection, face recognition, video tracking and style detection, for the purpose of extracting these events from large video datasets.In this paper, we propose a face recognition module to be used for video content analysis of the Parliament of Malaysia sessions. The proposed system is expected to help Malaysian citizens in identifying their respective parliament representative's performance in quantitative and more objective manner.
We introduce Publindex, a system that retrieves, classifies, and returns research publications of a given researcher according to the criteria and in the format predefined by the user.
We introduce an AI-enabled portal that presents an excellent visualization of Mahatma Gandhi's life events by constructing temporal and spatial social networks from the Gandhian literature. Applying an ensemble of methods drawn from NLTK, Polyglot and Spacy we extract the key persons and places that find mentions in Gandhi's written works. We visualize these entities and connections between them based on co-mentions within the same time frame as networks in an interactive web portal. The nodes in the network, when clicked, fire search queries about the entity and all the information about the entity presented in the corresponding book from which the network is constructed, are retrieved and presented back on the portal. Overall, this system can be used as a digital and user-friendly resource to study Gandhian literature.
Narrative Navigation connects mobile storytelling with geo-locations and Augmented Reality (AR). Narrative Navigation is designed to help users better understand a story. In this paper, we introduce our mobile app prototype that guides users to the physical locations of a story using AR flags. We indicate not just story locations and the user's distance to those story points, but also show the flow of a story through its geo-locations. The prototype thus provides users with a new way to immerse themselves in a location-based story.
To integrate the application of advanced technologies with library services promotion and transformation is important. However, the application of artificial reality technology in library services is still at the starting stage, it's newfangled and to be explored. This article introduces new technology adoption and service promotion by Guangzhou Children's Library under the tendency of information revolution, to apply new technology to education will probably be more meaningful to children than just games. with the continuous innovation of library services, we believe that children's library will have better development in the future.
Traditional institutional repository (IR) has been broadly used and improved in practice for decades. Current Research Information System (CRIS) is one of the extended systems that broadens the traditional IR systems' functionality by expanding the data and visualization modules. Beyond the basic functions of an IR, CRIS extends to distribute multimedia scholarly publications, manage research data, provide evaluation on research performance, visualize research network, enable research profiling, support project-based activities, integrate citation metrics, etc. This paper introduces the first DSpace-CRIS system that was implemented in mainland China at Wenzhou-Kean University (WKU) and explains the localized efforts of technologies and customized development of modules. The development team has also released the installation of developed modules as open sources on GitHub. The paper outlines the future development plan for the institution.
Fudan University Library's digital collection Seal Stamping Catalogs Virtual Library follows the IIIF (International Image Interoperability Framework) standard and adopts the Serverless architecture based on AWS. Therefore, it implements the advanced interactive functionality of image-based resources to support academic research and development. At the same time, it provides a service of high-precision image resources with a good interactive experience yet being cost-effective.
"Resource Map" is an innovative subject resource navigation system aiming to support subject development in the context of "Double First-class" university construction in China. Resources with various taxonomies have been aggregated according to the discipline classification of the Ministry of Education of China through automatic mapping technology. Institutional featured data and evaluative data have been harvested to describe resources. Subject resource navigation interface and subject guide interface have been designed to guide patrons to understand subject resources from worldwide perspective as well as institutional perspective.
This paper aims to extract knowledge including entities and relationships, from multi-source heterogeneous cultural heritage (CH) resources. The proposed crowdsourcing human-computer interaction framework utilizes museum-user-algorithm cooperation to achieve high-quality and scalable CH knowledge extraction. This paper also proposes crowdsourcing optimization mechanisms to improve participation and quality of crowdsourcing project. Finally, this paper discusses how extracted knowledge can support CH digital resource construction and knowledge-driven intelligent applications in Museum.
The article contributes to the ongoing discussion on the transdisciplinary nature of DH research quantitatively. A bibliometric analysis of published articles in DH is conducted to examine the structure and patterns of transdisciplinary collaborations, as well as the evolving overall pattern. The findings indicate that the scope of disciplines involved in DH research is broad, but that the disciplinary distribution is unbalanced. Centering around a few important disciplines, all disciplines related to DH research are aggregated into communities, suggesting multiple related research areas and disciplines for DH research. The evolving graph of disciplines provides support for the transdisciplinary nature of DH.
This paper introduces information literacy micro-lessons, and their role in STEM coursework. Micro-lessons are a high-impact, low-effort means to incorporate practical skills such as information literacy and research ethics into a course to enhance student learning. The authors discuss lessons learned from implementation at the Colorado School of Mines, provide examples of successful lessons and provide guidance to faculty and librarians on enhancing their curriculum with micro-lessons.
Virginia Tech Libraries has developed a cloud-native, microservervices-based digital libraries platform to consolidate diverse access and preservation infrastructure into a set of flexible, independent microservices in Amazon Web Services. We have been an implementer and contributor to various community digital library and repository projects including DSpace, Fedora, and Samvera3. However, the complexity and cost of maintaining disparate application stacks have reduced our capacity to build new infrastructure. Virginia Tech has a long history of participation in and contribution to community-driven Open Source projects and has, in that time, developed more than a dozen independent applications architected on these stacks. The cost of independently addressing vulnerabilities, which often requires work to mitigate incompatibilities; reworking each application to comply with developing branding guidelines; and feature development and improvement has burgeoned, threatening to overwhelm our capacity. Like many of our peers5, our maintenance obligations have made continued growth unsustainable and have pushed older applications to near abandonware. We have designed and developed the Digital Libraries Platform to address these concerns thus reducing our maintenance obligations and costs associated with feature development across digital libraries. This approach represents a departure from the monolithic architectures of our legacy systems and, as such, shares more infrastructure among individual digital library implementations. The shared infrastructure facilitates rapid inclusion of new and improved features into each digital library instance. New features can be developed independent of any digital library instance and integrated into that instance by inclusion of that feature in the React/Amplify template. Changes to the template super class, such as those necessitated by evolving branding guidelines, may be immediately inherited by the template instances that subscribe to it. The platform implements Terraform6 deployment templates, Lambda serverless functions, and other cloud assets to form a microservices architecture on which multiple template-based sites are built. Individual sites are configured in AWS DynamoDB, Amazon's NoSQL database service, and via modification of shared template. Additional services provide digital preservation support including auditing, file fixity validation, replication to external cloud storage providers, file format characterization, and deposit to third-party preservation services. This presentation also discusses the cost of operating these services in AWS and strategies for mitigating those costs. These strategies include containerization to allow deployment of high-cost, asynchronous services to local infrastructure to take full advantage of existing infrastructure and advantageous utility pricing while allowing for local redeployment. In the past, developers worked in local, independent environments. New features and fixes were submitted to a central development environment testing and validation, which significantly slowed development. Migrating development, review, integration, and deployment processes to AWS decreased the time and resource bottlenecks for those processes. Our AWS cost accounting demonstrates an 87% savings over our traditional, on-premises Fedora/Samvera approach For a team of four software developers, the total cost using a traditional server-based (a t2-medium EC2 instance) development approach is about $133 per month versus our serverless-based development approach using AWS Amplify at an average of $17 per month. As the Digital Libraries Platform project expands, we anticipate publishing a set of API documents allowing us and others to reimplement specific microservices independent of the architecture.
In July 2018, the East Asian Library (EAL) of the University of Pittsburgh Library System (ULS) initiated the Contemporary Chinese Village Data (CCVD) project to create an open-access online dataset of statistics selected from the library's collection of Chinese village gazetteers. This unique initiative has produced a dataset of significant value to the humanities and social sciences based on Chinese village gazetteers, which include quantitative and qualitative data critical to supporting Chinese studies in fields such as politics, economics, sociology, environmental science, history, and public health. The first 500 villages' data were opened for access in October 2019 http://www.chinesevillagedata.library.pitt.edu. In June 2020, the number will expand to 1,000 and a database that allows effective and efficient ingesting, querying, manipulating, and displaying CCVG Data will be available for use.
Data literacy is a critical part of effective researching and a growing demand for data literacy services among college students and faculty is currently observed, especially in the COVID-19 health emergency which has been remarkably severe since December 2019. With the network technology, PKUL innovated different modes of data literacy services, such as recommending data resources, providing digital lectures in collaboration with departments or database suppliers, and giving online services of data visualization processing via conference applications, or APPs. The teachers and students embraced these innovations, which widened the scope of online library data quality services.
The rapid development of digital libraries and the proliferation of scholarly big data have created an unprecedented opportunity to explore scientific production and reward at scale. Fueled by the data exploration and computational advances in digital libraries, the science of science is an emerging multidisciplinary field that aims to quantify patterns for scientific relationships and dependencies, and how scientific progress emerges from the scholarly big data. In this tutorial, we will provide an overview of the science of science, including major topics on scientific careers, scientific collaborations and scientific ideas. We will also discuss its historical context, the state-of-art models and exciting discoveries, and promising future directions for participants interested in mining scholarly big data.
Computational analyses are playing an increasingly central role in research and are a feature of many advanced digital libraries. Journals, sponsors, and researchers, including in the digital library field, are calling for published research to include associated data and code. However, many involved in research have not received training in best practices and tools for building systems (e.g., using containers) and implementing methods that facilitate sharing code and data. This tutorial aims to address this gap in training while also providing those who support researchers with curated best practices guidance and tools.
This tutorial is unique compared to other reproducibility events due to its practical, step-by-step design. It is comprised of hands-on exercises to prepare research code and data for computationally reproducible publication. Although the tutorial starts with some brief introductory information about computational reproducibility, the bulk of the tutorial is guided work with data and code. The basic best practices for publishing code and data are covered with curated resources. Examples will include from the digital library and information retrieval domains. Participants move through preparing research for reuse, organization, documentation, automation, and submitting their code and data to share. Tools to support reproducibility will be introduced but all lessons will be platform agnostic.
This tutorial is a thorough and deep introduction to the Digital Libraries (DL) field, providing a firm foundation: covering key concepts and terminology, as well as services, systems, technologies, methods, standards, projects, issues, and practices. It introduces and builds upon a firm theoretical foundation (starting with the '5S' set of intuitive aspects: Streams, Structures, Spaces, Scenarios, Societies), giving careful definitions and explanations of all the key parts of a 'minimal digital library', and expanding from that basis to cover key DL issues. Illustrations come from a set of case studies, including from multiple current projects, including with webpages, tweets, and social networks. Attendees will be exposed to four Morgan and Claypool books that elaborate on 5S, published 2012--2014. Complementing the coverage of 5S will be an overview of key aspects of the DELOS Reference Model and DL.org activities. Further, new material will be added on building digital libraries using container and cloud services, on developing a digital library for electronic theses and dissertations, and methods to integrate UX and DL design approaches.
This tutorial provides writing about research in data science. The focus will be on writing conference papers and journal articles. The target audience is graduate students, post-doctoral fellows, and early-career faculties. The teaching methodology will include lectures and a heavy hands-on component.
This tutorial is designed for those who want an introduction to building a digital library using an open source software program. The tutorial will focus on the Greenstone digital library software. In particular, participants will work with the Greenstone Librarian Interface, a flexible graphical user interface designed for developing and managing digital library collections. Attendees do not require programming expertise, however they should be familiar with HTML and the Web, and be aware of representation standards such as Unicode, Dublin Core and XML. The Greenstone software has a pedigree of more than two decades, with over 1 million downloads from SourceForge. The premier version of the software has, for many years, been Greenstone~2. This tutorial will introduce users to Greenstone 3---a redesign and reimplementation of the original software to take better advantage of newer standards and web technologies that have been developed since the original implementation of Greenstone. Written in Java, the software is more modular in design to increase the flexibility and extensibility of the software design. Emphasis in the tutorial is placed on where Greenstone~3 goes beyond what Greenstone~2 can do. Through the hands-on practical exercises participants will, for example, build collections where geo-tagged metadata embedded in photos is automatically extracted and used to provide a map-based view in the digital library of the collection.
The goal of this workshop is to engage the related communities in open problems in the extraction and evaluation of knowledge entities from scientific documents. This workshop entitles this cutting-edge and cross-disciplinary direction Extraction and Evaluation of Knowledge Entity (EEKE), highlighting the development of intelligent methods for identifying knowledge claims in scientific documents, and promoting the application of knowledge entities. The website of this workshop is at: https://eeke2020.github.io/.
The focus of the MEDA 2020 workshop is biomedical data in digital form, especially biomedical literature on the one hand, and genomic data on the other. Like in previous editions, this edition will include presentation of original work, a keynote talk, and a panel discussion.
The annual SIG-CM workshop addresses the development, use, and evolution of conceptual models in the context of digital libraries, archives, and museums. The 2020 workshop will focus on emerging themes in information science research related to information modeling, including the development of empirical techniques to develop and evaluate conceptual models, as well as innovations in pedagogy for educating practitioners in knowledge organization and representation. Activities at the workshop will include paper presentations, round-table discussions, and a keynote from an expert in the conceptual and logical foundations of information organization systems.
This virtual workshop organized as part of JCDL 2020 conference serves as continuation of the workshop "Organizing Data, Information, and Knowledge in Big Data Environments" held at the JCDL 2019 conference. The workshop focuses on the challenges and opportunities provided by Big Data environment for information and computing professionals, and explores strategies and solutions for organizing data, information, and knowledge on a large scale.
The entire body of research literature is currently estimated at 100-150 million publications with an annual increase of around 1.5 million. Systematically reading and analysing the full body of knowledge is now beyond the capacities of any human being. Consequently, it is important to better understand how we can leverage Natural Language Processing/Text Mining techniques to aid knowledge creation and improve the process by which research is being done. This workshop aims to bring together people from different backgrounds who: (a) have experience with analysing and mining databases of scientific publications, (b) develop systems that enable such analysis and mining of scientific databases (especially those who manage publication databases) or (c) who develop novel technologies that improve the way research is being done.
This workshop will explore integration of Web archiving and digital libraries and cover all stages of its complete life cycle, including creation/authoring, uploading/publishing, crawling, indexing, exploration, and archiving, etc. It will include particular coverage of current topics of interest, like: big data, social media archiving, and systems.
Centering around the theme "Data-driven Smart Library Services in China", this workshop aims to gather digital library (DL) experts, scholars, graduate students and practitioners, who study or are interested in DL services, technological development, theories and concepts, social and practical trends, and issues and problems in China. Also, it is hoped that the workshop can promote the JCDL community and its reputation in China, as well as can enhance the participation and engagement of Chinese DL researchers and professionals, who were not involved in previous JCDL events organized outside of China.
The Doctoral Consortium is a workshop for Ph.D. students from all over the world who are in the early phases of their dissertation work (i.e., the consortium is not intended for those who are finished or nearly finished with their dissertation). The goal of the Doctoral Consortium is to help students with their thesis and research plans by providing feedback and general advice in a constructive atmosphere.