Skip to Main Content

HathiTrust: Introduction for text & data mining

What is HathiTrust?

HathiTrust (http://hathitrust.org/) is a not-for-profit collaborative of academic and research libraries, which offers access to over 17 million digitized items from a web interface. Their website provides two primary access points to their collection of full-text items: first is their Digital Library, providing a platform to search for and read these digitized materials. A sister service, the HathiTrust Research Center, is also available for performing quantitative analyses of these digitized materials. 

Many researchers first encountered HathiTrust between Spring 2019 and Summer 2021, through their Emergency Temporary Access Service (ETAS) which provided access to digitzed collections at their many partner institutions. When not using their Emergency Temporary Access Service, HathiTrust typically provides access to 16 million volumes in 400 languages, between 1700 and the present day, though the focus is primarily on English-language materials published between 1800-1999. Materials provided by HathiTrust are sourced from Google, the Internet Archive, Microsoft, and digital preservation departments from a wide network of partner institutions. HathiTrust is therefore is a collection of collections available for our use.

HathiTrust's holdings covers fictional and non-fiction works, and one of their strengths is a robust government documents collection, though it is worth pointing out that their collection is certainly not representative of everything ever printed. In addition, contemporary public domain laws and community standards guide access, meaning that many resources are only available to member institutions. Penn State is a member and major contributor to this project.

What does HathiTrust provide access to?

In addition to providing access to thousands of digitized materials from 170+ partner institutions, HathiTrust provides access to non-consumptive, derived data from these digitized materials for a range of quantitative analyses, identification purposes, and cross-referencing. Your needs might be different depending on your project; they have tried to be wide-ranging in the materials they offer.

HathiTrust as a collection offers a wide view of the history of printed text, primarily in English, but also in German, French, Spanish, and Russian, among over 400 other languages. HathiTrust offers both fictional and non-ficitonal writing, from early novels to present-day works, and a very robust collection of government documents from their partner institutions. 

What do you mean by non-consumptive research? (also sometimes called “non-consumptive analytics”) describes research in which computational analysis is performed on one or more volumes (textual or image objects) in a collection, but not research in which a researcher reads or displays substantial portions of an in-copyright or rights-restricted volume to understand the expressive content presented within that volume. A “substantial portion” means a portion of an individual volume sufficient in quality or quantity to provide a substitute for access to the volume’s expressive content. A portion that merely reveals factual information (about the work or about the world) is not thereby a substitute for access to the volume’s original expressive content. This is in line with fair use law, which requires outputs to not supersede the original work in its ordinary and likely markets. 

Non-consumptive analytics can include such computational tasks as text extraction, textual analysis and information extraction, linguistic analysis, automated translation, image analysis, file manipulation, OCR correction, and indexing and search.