Skip to Main Content

HathiTrust: Introduction for text & data mining

Data sources


The HathiTrust Research Center provides specific kinds of data from your worksets for you to work with directly. There are additional datasets available for download to your own servers if these do not meet your needs.

HTRC Derived Datasets
There are two additional datasets based on HathiTrust collections that you can work with.

1) The Extracted Features Dataset includes basic bibliographic metadata as well as counts for various elements in each book (e.g., number of pages, number of words on a specific page).

2) The Word Frequencies in English-Language Literature, 1700-1923 dataset provides word frequency counts for genres in the English language. 

Read more about and download these Derived Datasets at the HathiTrust page: https://analytics.hathitrust.org/datasets

OAI
You can retrieve bibliographic records for full-view content using The Open Archives Initiative (OAI) Protocol for Metadata Harvesting (OAI-PMH) is a protocol used in libraries and archives for the automated delivery of structured bibliographic metadata. You can use this option to retrieve metadata in MARC21 or unqualified Dublin Core formats. Learn more about OAI-PMH at this page: https://www.openarchives.org/pmh/

Tab-delimited Files 
Metadata describing all works in the HathiTrust collection are available for download as tab-delimited files. These files include some bibliographic metadata as well as data elements unique to the HathiTrust collection. Learn more about what is included in these metadata files: https://www.hathitrust.org/hathifiles_description

Renewal ID file
A tab-delimited file containing US copyright renewal registration numbers in connection with a HathiTrust volume identifier is available. Learn more about what is included in a renewal ID file and download it directly fromhttps://www.hathitrust.org/renewal-data-files

 

HathiTrust also offers several APIs 
Unlike other APIs, HathiTrust APIs do not work like other search APIs where you use a keyword to search across the collection. However, users can use the HathiTrust APIs to query and retrieve data when you have a known identifier. This kind of search strategy can be more direct than a keyword or catalogue search. They offer a Bibliographic API, D

Bibliographic API
Using a variety of common identifiers (e.g., ISBN, LCCN, OCLC, etc.), you can retrieve information about any works associated with those identifiers with brief or full bibliographic records. 

Data API
The Data API allows you to retrieve page images, OCR text for individual pages, and METS metadata. To retrieve the OCR for more than a few volumes, we recommend that you request a dataset. Restrictions apply.

HTRC Data API
The HTRC Data API (provided by the HathiTrust Research Center) let you retrieve the OCR for a limited set of HathiTrust volumes that don’t have any download restrictions. 

What can I do with this data?

A variety of algorithmic techniques are pre-programmed for your use. Their documentation page describes all their available algorithms, limitations on using them, and anticipated outcomes.

These are all versions of major, recongnizable text analysis algorithms, written specifically for the HathiTrust infrastructure and run on their servers.

On the HathiTrust wiki (login required) you can follow specific tutorials for executing each of these algorithms. 

 

I want to do my own coding and run my own analyses. 

The Hathi Trust Research Center supports that, too! They offer a Data Capsule service, a secure computing environment for performing researcher-driven text analysis on the HathiTrust corpus. You will need to be able to use either a virtual network computing client or a secure shell to access their Data Capsules. An additional benefit of the Data Capsule service is that users at member institutions (such as Penn State) have the exclusive option to select “Full Corpus Access,” which includes copyrighted items.

Capsules come with Anaconda and R pre-installed, for writing and executing your own scripts in your preferred language.