Skip to Main Content

Text mining: Web-based resources

Text and Data Mining at Penn State

Text Mining (sometimes called Data Mining) is a popular way to 'read' large collections of language-associated data. This page is intended to support researchers in the Penn State University community seeking resources to mine. This guide assumes you already have a sense of what you are looking to do but need a corpus to make your research possible.

The present guide is therefore more focused on where to find resources for performing the analysis. If you are looking for advice on getting started with text mining more generally, please see the Text Analysis: Introduction Guide.

 

Copyright and Fair Use

Before you begin any data mining project, you should be aware of the limitations surrounding copyright and fair use (especially if you are dealing with data that may be under copyright)!

The Association of Research Libraries (ARL) and The International Federation of Library Associations (IFLA) both provide advice and statements on data and text mining, which you can find below.

If you would like to build a more robust sense of legal literacies around text and data mining, Building Legal Literacies for Text and Data Mining will help you.

If you wish to undertake a text or data mining project with content from the Libraries’ licensed databases, please contact your subject librarian to investigate options, which may include negotiating with the vendor or purchasing access to the data. Although many database licenses prohibit text and data mining and the use of software such as scripts, agents, or robots, we regularly negotiate text mining rights with database vendors. Unauthorized text or data mining in violation of our licenses can result in loss of access for the entire Penn State community.

Resources that are freely available are marked accordingly in this guide. If you are unsure about issues surrounding copyright, fair use, or dissemination of compiled corpora, contact the Office of Scholarly Communications & Copyright in the first instance.

Data Sharing

Given funder and publisher mandates for sharing data, researchers who work on text and data mining may need to share their textual data. When sharing a corpus, be sure to include a README file or other documentation that explains any modifications or data cleaning that you have done. Before depositing your data, please consider the following guidelines to understand what you can share:

  • If the texts gathered are in the public domain, then you can share your corpus in any repository.
  • If the texts are in copyright, it is sufficient to document the texts that your corpus contains using stable links (DOIs, URIs, or Perma.cc / Wayback Machine links) and appropriate citations so that other researchers are able to reassemble your corpus.
  • If the texts are in copyright and you would like to share the whole corpus, reach out to our copyright office to discuss what might be possible.