Skip to Main Content

HathiTrust: Introduction for text & data mining

HathiTrust Research Center in Research

The HathiTrust Research Center is a robust resource for teaching and learning, but is most beneficial for research projects outright. In the classroom, the HathiTrust Digital Library (the reading interface) is probably a better bet! However, here are two ways you could use the Research Center in your research:

What major themes arise in NAACP documents? 

Use the topic modeling algorithm to pull out some high-saliency concepts or themes that are latently available in a collection of NAACP documents. What are their major issues? How do they relate to each other? 

I want to identify all the place-names from a set of historical documents about Philadelphia to map them for further analysis.

Use the Named Entity Recognition algorithm to identify these proper nouns. (You might have to do some clean up and differentiate between place-names and other proper nouns such as dates, times, percentages, and monetary terms). Then, take the exported data and move them to the GIS software of your choice to observe patterns.

What about that extracted features option?

This option is great if you want do perform analyses on your own with specific features. This is the best way to get the widest amount of data at once and apply your own algorithms and data analysis on them using the software of your choice (rather than being committed to the versions implemented in the HTRC).

I have a question about this resource and it's not covered here.

Contact Heather Froehlich, Literary Informatics Librarian (hgf5@psu.edu), and have a look at the text mining guide for further suggestions.