Keywords

  • Linguistic example sentences
  • Database

Short description

Excalibur (Example sentences Calibrated for Use in Research) is a project on the database of linguistic example sentences that the CDH Research Software Lab is creating in collaboration with Maarten Schermer from Research Data Management Support.

The database will include example sentences, their translation (Dutch-English), and interlinear glosses. The project includes a pipeline for extracting, correcting and annotating glosses from publications. It also aims to automatically generate translations and glosses for new, user-supplied example sentences.

This database of linguistic examples will:

  • Enable linguists to store their example sentence data in the database which is part of the
    CLARIAH infrastructure, together with some crucial annotations (glosses, translation) and
    metadata (source, reference, page number, example number), language, phenomenon
    described, judgment of the data)-accessible to every other researcher via PIDs. This enables
    enhanced publications;
  • Enable linguists to search for relevant examples in this database based on words, glosses,
    grammatical codes, metadata, language, linguistic phenomenon;
  • Assist linguists in preparing example sentences found in the system for their publications, including MS Word, Google Docs, OpenOffice, LibreOffice, and LaTeX;
  • Enable linguists to automatically or semi-automatically extract example sentences from existing
    literature (in PDF, MS Word) to store in the example database repository;
  • Facilitate making a glossary of used grammatical codes, and their mapping to an explicit
    semantics;
  • Facilitate performing checks on the proper use of grammatical codes;
  • Check proper alignment of the glossing.

The technologies used for this project include: Python, machine translation, and automated POS-tagging.

Background

This project aims to accelerate linguistic research by making publishing easier, and making enhanced publications possible. The database will also enable publications of data that have signifianct value, but cannot always end up in a publication.