Centre for Digital Humanities

Corpora

There are many publicly available text corpora. The list below is not exhaustive, but is a good starting point.

A selection of corpora

If you are looking for digitized source materials, the Utrecht University Library is a good place to start. With a vast collection of licensed e-books, digital text corpora, and access to platforms facilitating research on diverse digital text and (audio) visual corpora, the library can provide valuable support.

  • Delpher has more than 130 million pages from Dutch newspapers, books and magazines.
  • The KB Lab hosts experimental tools and data built for and from the digital collection of the Koninklijke Bibliotheek.
  • Project Gutenberg is a library with of over 60,000 free digitized books (eBooks) of the world’s great literature, with a focus on older works for which U.S. copyright has expired.
  • The online text exploration application I-Analyzer makes the following corpora accessible:
    • Digital Library for Dutch Literature (DBNL)
    • Financial reports of Dutch companies
    • Dutch Newspapers from the Royal Library: public dataset and full dataset (available upon request)
    • Eighteenth Century Collections Online (available for Utrecht University users)
    • Jewish Funerary Inscriptions
    • Book reviews from Goodreads
    • The Guardian-Observer newspaper archives (available for Utrecht University users)
    • 19th century UK Periodicals (available for Utrecht University users)
    • Dutch court rulings
    • Times newspaper archives (available for Utrecht University users)
    • Dutch monarchs’ speeches
    • Dutch parliamentary debates
  • Via Yoda, the research data management service of Utrecht University, the following files of raw data are available:
    • Eighteenth Century Collections Online
    • Guardian & Observer (1791-1909 and 1910-2003)
    • Nineteenth Century U.K. Periodicals, Module 1
    • Times Digital Archives (1785-2011)
    • Times Literary Supplement (1902-2014)
  • The Gale Literary Sources is a platform for databases such as Dictionary of Literary Biography Complete Online, Literature Criticism Online, Contemporary Authors Online, Literature Resources Center, and LitFinder. You can find here, among others, full-text literary works, journal articles in the field of literary studies, reviews, and biographical articles.
  • Druid, provided by the Clariah Media Suite, hosts structured datasets from social and economic history and allows you to store, browse, query and visualize your Linked Data.
  • Our World in Data, a project of the Global Change Data Lab, is a collection of datasets that focuses on large global problems.
  • Clio Infra is a collection of interconnected datasets containing worldwide data on social, economic and institutional indicators for the past five centuries.
  • The International Institute of Social History (IISG) beheert duizenden datasets op het gebied van sociaal-economische geschiedenis.
  • Openarchives contains data of Dutch and Belgian archives and societies.
  • The Golden Agents project is a research infrastructure that contains datasets and Linked Open data on the long Golden Age of the Dutch Republic (ca 1580 – 1750).
  • UNdata is a web-based data service that brings together international statistical databases, compiled by the United Nations statistical system and other international agencies.
  • Het Nationale Dataportaal van de Nederlandse overheid is a collection of datasets from Dutch government institutions.

Digitization service by Utrecht University Library

Do you require specific source materials for your research or teaching activities that are not yet available in digital format? The Utrecht University Library has its own ‘production line’ with three scanners for scanning and digitizing and offers a digitization service, where an increasing number of texts are being digitized upon request. To submit a digitization request, please reach out to one of the team members.

Other corpora

Do you need advice on where to find certain corpora? Feel free to reach out to the Digital Humanities Team at the Utrecht University Library for advice. Do you want to build your own corpus for your research and do you need help with that? Visit our weekly DH walk-in hours to consult with a developer from the CDH Research Software Lab or email cdh@uu.nl.