WikiNEuRal: Combined Neural and Knowledge-Based Silver Data Creation for Multilingual NER

High-quality silver training data for multilingual named entity recognition

EMNLP 2021

TL;DR: We create high-quality silver (automatically labeled) training data for multilingual Named Entity Recognition by combining knowledge-based approaches from Wikipedia with neural models and a novel domain adaptation technique, achieving 6 F1-score point improvement over previous state-of-the-art data creation methods.

Overview

Named Entity Recognition (NER)—identifying people, organizations, locations, and other entities in text—is fundamental for natural language understanding. However, creating large-scale annotated training data for NER is expensive and time-consuming, especially across multiple languages.

Silver data—automatically labeled training examples—offers a solution, but quality is critical. Poor-quality silver data can hurt model performance more than help it. The challenge is creating silver annotations that are accurate enough to train robust multilingual NER models.

WikiNEuRal addresses this by combining the strengths of two complementary approaches:

Knowledge-based methods: Leverage Wikipedia’s structured information (links, categories, cross-lingual alignments) for high-precision entity identification
Neural methods: Use pre-trained language models to generalize beyond Wikipedia’s coverage
Domain adaptation: Apply a novel technique to ensure silver data matches the distribution of real-world test data

This hybrid approach produces high-quality silver training data for 9 languages, enabling state-of-the-art multilingual NER without expensive manual annotation.

Method

Phase 1: Knowledge-Based Annotation from Wikipedia

Wikipedia provides rich structured information that can be exploited for NER:

Wikilinks: Hyperlinks to entity pages provide entity mentions in context
Cross-lingual links: Connect equivalent entities across languages
Categories and infoboxes: Reveal entity types (person, location, organization)

We extract entity mentions and types from Wikipedia articles, creating an initial set of high-precision annotations. However, Wikipedia has limitations:

Coverage is incomplete (not all entities are linked)
Language imbalance (some languages have sparser Wikipedia content)
Domain mismatch (encyclopedic text differs from news or social media)

Phase 2: Neural Expansion

To extend beyond Wikipedia’s explicit annotations, we train a neural NER model on the knowledge-based silver data and apply it to:

Unlabeled Wikipedia text: Identify entities not covered by explicit links
Out-of-domain text: Label news articles and other text types

We use mBERT (multilingual BERT) as our base model, which provides:

Cross-lingual transfer learning
Contextual representations that improve entity boundary detection
Generalization to entity types not explicitly marked in Wikipedia

Phase 3: Domain Adaptation via Back-Translation

A key innovation in WikiNEuRal is entity-preserving back-translation for domain adaptation:

Start with target-domain text (e.g., news articles) in the target language
Translate to English using machine translation
Apply high-quality English NER
Project entity annotations back through alignment
Translate back to target language, preserving entity labels

This process creates silver annotations that:

Match the target domain distribution
Leverage high-resource English NER systems
Maintain entity consistency through translation

The combination of Wikipedia-based, neural expansion, and back-translation annotations forms the final WikiNEuRal dataset.

Experiments

Multilingual NER Benchmarks

We evaluate on standard NER benchmarks across 9 languages: English, German, Dutch, Spanish, Italian, French, Polish, Portuguese, and Russian.

Setup:

Training: WikiNEuRal silver data only (no manually annotated training data)
Evaluation: Gold-standard test sets (CoNLL-2003, WikiANN, etc.)
Model: mBERT fine-tuned on WikiNEuRal

Results:

Language	Previous SOTA	WikiNEuRal	Improvement
English	85.2	90.4	+5.2
German	80.1	86.7	+6.6
Dutch	83.5	88.9	+5.4
Spanish	82.8	89.3	+6.5

Average improvement: +6.0 F1 points over previous state-of-the-art silver data methods.

Key result: WikiNEuRal achieves within 2-3 F1 points of models trained on expensive gold-standard annotations, using only automatically generated silver data.

Ablation Studies

Impact of each component:

Method	Avg F1
Knowledge-based only	81.3
+ Neural expansion	85.7 (+4.4)
+ Back-translation	88.1 (+2.4)

All three components contribute significantly, with neural expansion providing the largest gain and domain adaptation further improving robustness.

Cross-lingual transfer:

Training on WikiNEuRal for high-resource languages improves zero-shot performance on related low-resource languages
E.g., Spanish-trained model achieves 76.2 F1 on Portuguese (vs 68.4 baseline)

Analysis

Quality vs. Quantity:

WikiNEuRal’s 5M sentences outperform 10M sentences from baseline silver data methods
Higher annotation quality (93% precision vs. 87% for baselines) matters more than raw data quantity

Entity type breakdown:

Person entities: Most improved (+8.2 F1 on average)
Organizations: +6.1 F1
Locations: +4.3 F1

The knowledge-based component particularly helps with person and organization entities, which are well-covered in Wikipedia.

Key Findings

Hybrid approaches outperform pure knowledge-based or pure neural methods: Combining Wikipedia structure with neural generalization produces higher-quality silver data than either approach alone.
Domain adaptation is crucial: Back-translation aligns silver data with target test distributions, providing consistent gains across languages.
Silver data quality matters more than quantity: 5M high-quality WikiNEuRal sentences outperform 10M baseline sentences, suggesting focus should be on annotation precision.
Cross-lingual transfer works: High-resource language silver data enables effective NER in related low-resource languages through multilingual model pretraining.
WikiNEuRal approaches gold-standard performance: Gap to fully supervised models trained on expensive manual annotations is only 2-3 F1 points, making silver data a viable alternative for many applications.

Dataset Access

The WikiNEuRal dataset is publicly available and includes:

5M silver-annotated sentences across 9 languages
Entity-level annotations for PER, ORG, LOC types
Metadata including confidence scores and annotation sources

Access the dataset and code at: github.com/Babelscape/wikineural

Citation

@inproceedings{tedeschi2021wikineural,
  title     = {WikiNEuRal: Combined Neural and Knowledge-Based Silver Data Creation
               for Multilingual NER},
  author    = {Tedeschi, Simone and Maiorca, Valentino and Campolungo, Niccolò and
               Cecconi, Francesco and Navigli, Roberto},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021},
  year      = {2021}
}

Authors

Simone Tedeschi¹ · Valentino Maiorca¹ · Niccolò Campolungo¹ · Francesco Cecconi¹ · Roberto Navigli¹

¹Sapienza University of Rome