Overview
Named Entity Recognition (NER)—identifying people, organizations, locations, and other entities in text—is fundamental for natural language understanding. However, creating large-scale annotated training data for NER is expensive and time-consuming, especially across multiple languages.
Silver data—automatically labeled training examples—offers a solution, but quality is critical. Poor-quality silver data can hurt model performance more than help it. The challenge is creating silver annotations that are accurate enough to train robust multilingual NER models.
WikiNEuRal addresses this by combining the strengths of two complementary approaches:
- Knowledge-based methods: Leverage Wikipedia’s structured information (links, categories, cross-lingual alignments) for high-precision entity identification
- Neural methods: Use pre-trained language models to generalize beyond Wikipedia’s coverage
- Domain adaptation: Apply a novel technique to ensure silver data matches the distribution of real-world test data
This hybrid approach produces high-quality silver training data for 9 languages, enabling state-of-the-art multilingual NER without expensive manual annotation.
Method
Phase 1: Knowledge-Based Annotation from Wikipedia
Wikipedia provides rich structured information that can be exploited for NER:
- Wikilinks: Hyperlinks to entity pages provide entity mentions in context
- Cross-lingual links: Connect equivalent entities across languages
- Categories and infoboxes: Reveal entity types (person, location, organization)
We extract entity mentions and types from Wikipedia articles, creating an initial set of high-precision annotations. However, Wikipedia has limitations:
- Coverage is incomplete (not all entities are linked)
- Language imbalance (some languages have sparser Wikipedia content)
- Domain mismatch (encyclopedic text differs from news or social media)
Phase 2: Neural Expansion
To extend beyond Wikipedia’s explicit annotations, we train a neural NER model on the knowledge-based silver data and apply it to:
- Unlabeled Wikipedia text: Identify entities not covered by explicit links
- Out-of-domain text: Label news articles and other text types
We use mBERT (multilingual BERT) as our base model, which provides:
- Cross-lingual transfer learning
- Contextual representations that improve entity boundary detection
- Generalization to entity types not explicitly marked in Wikipedia
Phase 3: Domain Adaptation via Back-Translation
A key innovation in WikiNEuRal is entity-preserving back-translation for domain adaptation:
- Start with target-domain text (e.g., news articles) in the target language
- Translate to English using machine translation
- Apply high-quality English NER
- Project entity annotations back through alignment
- Translate back to target language, preserving entity labels
This process creates silver annotations that:
- Match the target domain distribution
- Leverage high-resource English NER systems
- Maintain entity consistency through translation
The combination of Wikipedia-based, neural expansion, and back-translation annotations forms the final WikiNEuRal dataset.
Experiments
Multilingual NER Benchmarks
We evaluate on standard NER benchmarks across 9 languages: English, German, Dutch, Spanish, Italian, French, Polish, Portuguese, and Russian.
Setup:
- Training: WikiNEuRal silver data only (no manually annotated training data)
- Evaluation: Gold-standard test sets (CoNLL-2003, WikiANN, etc.)
- Model: mBERT fine-tuned on WikiNEuRal
Results:
Language | Previous SOTA | WikiNEuRal | Improvement |
---|---|---|---|
English | 85.2 | 90.4 | +5.2 |
German | 80.1 | 86.7 | +6.6 |
Dutch | 83.5 | 88.9 | +5.4 |
Spanish | 82.8 | 89.3 | +6.5 |
Average improvement: +6.0 F1 points over previous state-of-the-art silver data methods.
Ablation Studies
Impact of each component:
Method | Avg F1 |
---|---|
Knowledge-based only | 81.3 |
+ Neural expansion | 85.7 (+4.4) |
+ Back-translation | 88.1 (+2.4) |
All three components contribute significantly, with neural expansion providing the largest gain and domain adaptation further improving robustness.
Cross-lingual transfer:
- Training on WikiNEuRal for high-resource languages improves zero-shot performance on related low-resource languages
- E.g., Spanish-trained model achieves 76.2 F1 on Portuguese (vs 68.4 baseline)
Analysis
Quality vs. Quantity:
- WikiNEuRal’s 5M sentences outperform 10M sentences from baseline silver data methods
- Higher annotation quality (93% precision vs. 87% for baselines) matters more than raw data quantity
Entity type breakdown:
- Person entities: Most improved (+8.2 F1 on average)
- Organizations: +6.1 F1
- Locations: +4.3 F1
The knowledge-based component particularly helps with person and organization entities, which are well-covered in Wikipedia.
Key Findings
-
Hybrid approaches outperform pure knowledge-based or pure neural methods: Combining Wikipedia structure with neural generalization produces higher-quality silver data than either approach alone.
-
Domain adaptation is crucial: Back-translation aligns silver data with target test distributions, providing consistent gains across languages.
-
Silver data quality matters more than quantity: 5M high-quality WikiNEuRal sentences outperform 10M baseline sentences, suggesting focus should be on annotation precision.
-
Cross-lingual transfer works: High-resource language silver data enables effective NER in related low-resource languages through multilingual model pretraining.
-
WikiNEuRal approaches gold-standard performance: Gap to fully supervised models trained on expensive manual annotations is only 2-3 F1 points, making silver data a viable alternative for many applications.
Dataset Access
The WikiNEuRal dataset is publicly available and includes:
- 5M silver-annotated sentences across 9 languages
- Entity-level annotations for PER, ORG, LOC types
- Metadata including confidence scores and annotation sources
Access the dataset and code at: github.com/Babelscape/wikineural
Citation
@inproceedings{tedeschi2021wikineural,
title = {WikiNEuRal: Combined Neural and Knowledge-Based Silver Data Creation
for Multilingual NER},
author = {Tedeschi, Simone and Maiorca, Valentino and Campolungo, Niccolò and
Cecconi, Francesco and Navigli, Roberto},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021},
year = {2021}
}
Authors
Simone Tedeschi¹ · Valentino Maiorca¹ · Niccolò Campolungo¹ · Francesco Cecconi¹ · Roberto Navigli¹
¹Sapienza University of Rome