Skip to content

mainlp/WikiDIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

WikiDIR

A significant amount of local knowledge and culture-specific knowledge can only be found in documents written in dialects, yet cross-dialect information retrieval (CDIR) has not been extensively studied. In this work, we propose WikiDIR, a retrieval dataset for German dialects. The dataset is described in detail in our paper Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages by Robert Litschko, Oliver Kraus, Verena Blaschke and Barbara Plank.

Dataset

We make the WikiDIR dataset available on huggingface 🤗. Each folder (e.g. de.bar/) contains German queries, documents written in dialects and relevance labels. Query and document ID's correspond to Wikipedia CurID's, you can use them browse to the Wikipedia article from they were extracted:

Queries: queries.jsonl

{"id": "3215", "contents": "München"}

Documents: docs.jsonl

{"id": "12259", "contents": "Minga Aussproch: [ˈmɪŋ(:)ɐ] is d'Haptstod vo Bayern. In da Umgebung (20–30 km) hoaßt ma s'Minga oda oft aa oafach d'Stod. In Minga sogt ma München. Minga is mid mehra wia 1,5 Milliona Eihwohna d'gresste Stod vo Bayern und hinta Berlin und Hamburg d'drittgresste Stod vo Deitschland ..."}

Relevance labels: {train, dev, test, analysis, analysis_variants}.jsonl

{
  "src_id": "3215",
  "src_query": "München",
  "tgt_results": [
    ["12259", 6], ["3154", 5], ["12198", 4], ["3966", 4], ["20246", 3], ...
  ]
}

Dialect variation dictionaries

You can find our dialect dictionaries on huggingface 🤗.

{
  "de_id": "3215",
  "de_title": "München",
  "dial_id": "12259",
  "dial_title": "Minga",
  "variants": [
    "Münch'n",
    "Minkcha",
    "Minkn",
    "Minchn",
    "Mingna",
    "Minkhn",
    "Münchn"
  ]
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published