Research and Teaching Notebook

Data sets or corpora potentially available for a licensing fee

This list is a work in progress ... running document of corpora and/or datasets that are available through a membership or licensing fee, and/or entities I know to have licensed data on nan ad hoc basis.

1. BYU Corpora: https://www.english-corpora.org/corpora.asp

Online interfaces and downloadable data if licensed. Includes:

- iWeb: The Intelligent Web-based Corpus
- News on the Web (NOW)
- Global Web-Based English (GloWbE)
- Wikipedia Corpus
- Corpus of Contemporary American English (COCA)
- Coronavirus Corpus
- Corpus of Historical American English (COHA)
- The TV Corpus
- The Movie Corpus
- Corpus of American Soap Operas

- Hansard Corpus
- Early English Books Online
- Corpus of US Supreme Court Opinions
- TIME Magazine Corpus
- British National Corpus (BNC) *
- Strathy Corpus (Canada)
- CORE Corpus
- American English from Google Books n-grams
- British English from Google Books n-grams

2. Linguistic Data Consortium: https://catalog.ldc.upenn.edu/

Extensive catalog of corpora. Membership fee and additional licensing fees for some corpora. Their top ten corpora list includes:

- OntoNotes Release 5.0
- TIMIT Acoustic-Phonetic Continuous Speech Corpus
- Web 1T 5-gram Version 1
- CELEX2
- Treebank-3
- The New York Times Annotated Corpus
- TIDIGITS
- Switchboard-1 Release 2
- ACE 2005 Multilingual Training Corpus
- English Gigaword Fifth Edition

3. LIWC: http://liwc.wpengine.com/

LIWC (Linguistic Inquiry and Word Count) is a text analysis program available for purchase. It calculates the degree to which various categories of words are used in a text, and can process texts ranging from e-mails to speeches, poems and transcribed natural language in either plain text or Word formats.

4. New York Times:

Article metadata available through their API. Some full text for recent articles as well. Will license OCR data (xml) for a negotiated fee.

5. Proquest:

Numerous collections available via library subscription. Will license metadata, OCR data (xml), and article scans (pdf) for a negotiated fee. I did this with the American Periodical Series and they shipped a multi-terabyte hard drive.

6. EBSCO/Gale:

I haven't licensed any data from EBSCO but, from what I can gather, their processes are very similar to Proquest.

Last Updated:

March 03, 2021

Tags:

data Corpora