Skip to main content

Text Analysis: Introduction

What is Text Analysis?

Text analysis or text mining are blanket terms for analyzing lots of documents (books, tweets, news reports, etc) at scale and with the aid of computers. This is a process which is linguistic in practice, as it deals specifically with language, context and patterns. Text analysis is therefore not a discipline but a methodological approach which also intersects with other disciplines, including but certainly not limited to sociology, communications, history, literary study, math, logic, cognitive science, and computer science. Text analysis is performed on corpora, collections of machine-readable text that are designed to answer specific kinds of questions. This guide offers the basics of the theoretical perspectives behind different kinds of text mining and software and methods involved, with a particular bent towards corpus methods.

What are corpora?

 
A corpus is a collection of texts which are machine-readable (often in a plain-text or xml format). Multiple corpuses are called corpora, pronounced with the stress on the second syllable (cor-POR-ah). The selection of  texts for inclusion in a corpus must be able to answer the questions "what texts are included and excluded?" and "why these texts?"
 
Variables such as age, gender, location, and social background can contribute to the construction principles of corpora; further considerations need to be made about genre, form and style. In some cases, issues of dialect can arise too. A complete list of commonly used ways to describe corpora and examples of each can be found in this list by Richard Xiao, part of the University of Lancaster's Corpus Survey [link].
 
A corpus can cover: