This list is a work in progress ... running document of corpora and/or datasets that are available through a membership or licensing fee, and/or entities I know to have licensed data on nan ad hoc basis.
1. BYU Corpora: https://www.english-corpora.org/corpora.asp
Online interfaces and downloadable data if licensed. Includes:
- iWeb: The Intelligent Web-based Corpus
- News on the Web (NOW)
- Global Web-Based English (GloWbE)
- Wikipedia Corpus
- Corpus of Contemporary American English (COCA)
- Coronavirus Corpus
- Corpus of Historical American English (COHA)
- The TV Corpus
- The Movie Corpus
- Corpus of American Soap Operas
- Hansard Corpus
- Early English Books Online
- Corpus of US Supreme Court Opinions
- TIME Magazine Corpus
- British National Corpus (BNC) *
- Strathy Corpus (Canada)
- CORE Corpus
- American English from Google Books n-grams
- British English from Google Books n-grams
2. Linguistic Data Consortium: https://catalog.ldc.upenn.edu/
Extensive catalog of corpora. Membership fee and additional licensing fees for some corpora. Their top ten corpora list includes:
- OntoNotes Release 5.0
- TIMIT Acoustic-Phonetic Continuous Speech Corpus
- Web 1T 5-gram Version 1
- CELEX2
- Treebank-3
- The New York Times Annotated Corpus
- TIDIGITS
- Switchboard-1 Release 2
- ACE 2005 Multilingual Training Corpus
- English Gigaword Fifth Edition
3. LIWC: http://liwc.wpengine.com/
LIWC (Linguistic Inquiry and Word Count) is a text analysis program available for purchase. It calculates the degree to which various categories of words are used in a text, and can process texts ranging from e-mails to speeches, poems and transcribed natural language in either plain text or Word formats.
4. New York Times:
Article metadata available through their API. Some full text for recent articles as well. Will license OCR data (xml) for a negotiated fee.
5. Proquest:
Numerous collections available via library subscription. Will license metadata, OCR data (xml), and article scans (pdf) for a negotiated fee. I did this with the American Periodical Series and they shipped a multi-terabyte hard drive.
6. EBSCO/Gale:
I haven't licensed any data from EBSCO but, from what I can gather, their processes are very similar to Proquest.