Portfolio
I-Analyzer
Many academic disciplines, spearheaded by, but not limited to the Humanities and Social Sciences, have embraced digital technologies to process large amounts of text data. Text datasets can be used to quantify trends observed in close reading, or, conversely, to pinpoint sources which might be interesting for closer, manual analysis.
While a lot of specialized software packages exist for tasks like syntactic analysis, topic modelling, or collocation highlighting, the Research Software Lab (in collaboration with Utrecht University Library) observed a gap when it comes to searching and filtering digitized text corpora, such as newspaper archives, prior to further analysis steps. Existing software to search and filter text corpora, such as Delpher, often focusses on specific datasets, limiting their universal applicability.
I-Analyzer has been developed to bridge this gap. I-Analyzer allows searching and exploring text corpora, visualizing trends, and downloading tables of text and metadata for further analysis. I-Analyzer is open-source software and freely available.
Adding your own corpus to I-Analyzer
If you are interested in having your own corpus added to I-Analyzer, please contact us at cdh@uu.nl to explore the possibilities.
Available corpora in I-Analyzer
- U-Blad (Utrecht University newspaper) print editions, 1969-2010
- Dutch newspaper collection, Royal Library, 1600-1876
- The Dutch Throne Speech, 1814-2023
- Hebrew epigraph collection, 769-849
- Goodreads reviews of translated literary texts, 2007-2022
- Judicial system Netherlands (court rulings), 1900-2022
- Digital Library for Dutch Literature (DBNL), 1200-1890
- Dutch parliamentary debates (Eerste Kamer & Tweede Kamer), 1815-2022
Also available after login with UU employee or student account (Solis-id):
Current projects
- Update of the Delpher newspaper corpus
- Improvement of the content and functionality of I-Analyzer, in collaboration with the Utrecht University Library (UBU)
The University Library has many different text corpora, archives and its own digitized material in-house. This material can be made findable, accessible, searchable, interoperable and reusable (FAIR) via I-Analyzer. This project aims to increase the digital accessibility of the UBU collection and also to enable modern forms of data-driven research with this material. We will do this by improving the delivery of UBU corpora, improving the functionality of I-Analyzer and adding more and diverse material to I-Analyzer. Furthermore, we want to invest in the accessibility of the material and the digital literacy of students and researchers by increasing the visibility, accessibility and user-friendliness of I-Analyzer.
This project has four sub-goals:
- Improving the pipeline for adding (UB) material to I-Analyzer
- Adding new corpora to I-Analyzer
- Improving the visibility of I-Analyzer
- Expansion of I-Analyzer functionalities
Recently, a survey has been published to gather user feedback on the current content and functionality of I-Analyzer, as well as to collect information on which features users would like to be added to I-Analyzer:
- TextMiNER
The TextMiNER project will enable a wide audience to browse and visualize a text corpus using Named Entity Recognition. Named Entity Recognition is a fairly established technique, and models for NER are widely available and, for most use contexts, sufficiently accurate. We will provide I-Analyzer as an infrastructure in which named enitites can be explored in various text corpora (newspapers, books, reviews), or in user-provided documents. Using a Docker environment, the application will be easy to run and customize. Text corpora can be analyzed with SpaCY and the discovered named entities are indexed in Elasticsearch.
For many researchers, the steps involved to tag documents with NER, and after that, analyze a corpus of documents using named entities, are very costly, such that this technique often falls out of the scope of research projects. The current project lowers the threshold for researchers from various disciplines to explore named entities in text corpora.
Back