Centre for Digital Humanities

Events

CDH webinar: Beyond “Ctrl-F”: automating searches in large textual corpora

Event details

Date:
29 October 2021
Time:
15:00 - 16:15

In this presentation, dr. Dirk van Miert (GKG) and Liliana Melgar (GKG) demonstrate, with practical examples and hands-on activities, how to search simultaneously in a large number of digitized texts. Using regular expressions and tools such as Poppler and Python scripts, they show how scholars can move beyond doing manual and repetitive searches in individual files using “ctrl-F”.

Abstract

In the period 1400-1800, scholars and scientists often referred to the world of science, learning and scholarship as a ‘Respublica literaria’ or ‘Republic of Letters’. Erasmus popularized this concept and it was used until the time of Kant. This concept is usually associated with ideals of exchanging knowledge freely, with tolerance and with egalitarianism: the Respublica Literaria transcended political, geographical and religious borders in order to create a European-wide knowledge community. No wonder that modern historians of the period are in love with this concept, and use it eagerly to refer to the early modern world of knowledge.

But how widespread was the use of this actors category? Was it used just as frequently in the 16th as in the 18th century? Did German, French, Dutch and Spanish scholars all inscribe themselves into this community? Did they all even use the term? This is one of the questions that researchers in the ERC Consolidator project ‘Sharing Knowledge in Learned and Literary Networks – The Republic of Letters as a Pan-European Knowledge Society’ (SKILLNET) try to answer (‘Sharing Knowledge in Learned and Literary Networks – The Republic of Letters as a Pan-European Knowledge Society’). Two challenges face them: first, the complexity of the Latin term ‘respublica literaria’ which occurs in 192 variations, due to different spellings, declinations and word order. Second, the fact that the texts available for machine-reading are quite dirty.

In this presentation project leader Dirk van Miert and data specialist Liliana Melgar share some solutions to overcome these challenges. They demonstrate, with practical examples and hands-on activities, how to search simultaneously in a large number of digitized texts (captured from Google Books and other online sources). Using regular expressions and “tools” such as Poppler and Python scripts, they show how scholars can move beyond doing manual and repetitive searches in individual files using “ctrl-F”. Also discussed is the importance of collaboration and data preparation as an essential part of any research project that involves the use of digital sources.