Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Gale Digital Scholars Lab: an introduction

A brief survey of offerings from the Gale Digital Scholarship Lab for Penn State researchers

What do you mean by cleaning and preprocessing?

Most text analysis projects require some kind of preprocessing before performing an analysis. This often includes tasks like removing boilerplate language, unnecessary punctuation, or special characters. Other tasks can include changing the case (upper- to lower-case), removing stop words, and/or identifying and correcting optical character recognition errors from the first round of digitization. The Gale DSL offers pre-written scripts to iteratively perform these tasks, allowing users to observe how these changes affect digitized primary source documents both en masse and in aggregate.

The lab includes help guides for preprocessing in both video and written formats. You can save your cleaning processes and apply them iteratively to your curated content sets. Since you can upload your own content sets to the DSL you can also apply these rules to your own data. This sort of cleanup work becomes much easier with pre-written scripts and processes to apply rather than starting from scratch.

Some tasks will require more iteration than others, so experimentation is encouraged. The lab encourages benchmarking and observing what different scripts and processes do, and sometimes running the same script repeatedly can increase accuracy further. Some 

For more information on text preprocessing and cleaning, see the Introduction to Text Analysis guide.