Excalibur
Keywords
- Linguistic example sentences
- Database
Short description
Excalibur (Example sentences Calibrated for Use in Research) is a project on the database of linguistic example sentences that the CDH Research Software Lab is creating in collaboration with Maarten Schermer from Research Data Management Support.
The database will include example sentences, their translation (Dutch-English), and interlinear glosses. The project includes a pipeline for extracting, correcting and annotating glosses from publications. It also aims to automatically generate translations and glosses for new, user-supplied example sentences.
This database of linguistic examples will:
- Enable linguists to store their example sentence data in the database which is part of the
CLARIAH infrastructure, together with some crucial annotations (glosses, translation) and
metadata (source, reference, page number, example number), language, phenomenon
described, judgment of the data)-accessible to every other researcher via PIDs. This enables
enhanced publications; - Enable linguists to search for relevant examples in this database based on words, glosses,
grammatical codes, metadata, language, linguistic phenomenon; - Assist linguists in preparing example sentences found in the system for their publications, including MS Word, Google Docs, OpenOffice, LibreOffice, and LaTeX;
- Enable linguists to automatically or semi-automatically extract example sentences from existing
literature (in PDF, MS Word) to store in the example database repository; - Facilitate making a glossary of used grammatical codes, and their mapping to an explicit
semantics; - Facilitate performing checks on the proper use of grammatical codes;
- Check proper alignment of the glossing.
The technologies used for this project include: Python, machine translation, and automated POS-tagging.
Background
This project aims to accelerate linguistic research by making publishing easier, and making enhanced publications possible. The database will also enable publications of data that have signifianct value, but cannot always end up in a publication.