Text Mining (sometimes called Data Mining) is a popular way to 'read' large collections of language-associated data. This page is intended to support researchers in the Penn State University community seeking resources to mine. This guide assumes you already have a sense of what you are looking to do but need a corpus to make your research possible.
The present guide is therefore more focused on where to find resources for performing the analysis. If you are looking for advice on getting started with text mining more generally, please see the Text Analysis: Introduction Guide.
Before you begin any data mining project, you should be aware of the limitations surrounding copyright and fair use (especially if you are dealing with data that may be under copyright)!
The Association of Research Libraries (ARL) and The International Federation of Library Associations (IFLA) both provide advice and statements on data and text mining, which you can find below.
If you wish to undertake a text or data mining project with content from the Libraries’ licensed databases, please contact your subject librarian to investigate options, which may include negotiating with the vendor or purchasing access to the data. Although many database licenses prohibit text and data mining and the use of software such as scripts, agents, or robots, we regularly negotiate text mining rights with database vendors. Unauthorized text or data mining in violation of our licenses can result in loss of access for the entire Penn State community.
Resources that are freely available are marked accordingly in this guide. If you are unsure about issues surrounding copyright, fair use, or dissemination of compiled corpora, contact the Office of Scholarly Communications & Copyright in the first instance.
If you would like to build a more robust sense of legal literacies around text and data mining, this free, online book will help you.
You probably need a librarian's assistance to get started on a data mining project. See our staff directory for help finding your subject librarian, or reach out directly to a member of Research, Informatics, and Publishing.
It might take some time. Text and data mining is sometimes negotiated with publishers as necessary and very often on a case-by-case basis. You may be responsible for some, all, or none of the negotiations. We'll help you navigate that, but please plan for this when developing your project.
Start small first. A proof-of-concept prototype goes a long way to find out if something is viable and provides meaningful results before committing yourself fully!