Add Wiktionary dump download and translation extraction capabilities by axif0 · Pull Request #666 · scribe-org/Scribe-Data

axif0 · 2026-03-03T22:25:14Z

Contributor checklist

This pull request is on a separate branch and not the main branch
I have tested my code with the pytest command as directed in the testing section of the contributing guide

Description

The Scribe-Data CLI now supports downloading and extracting translations directly from Wiktionary XML dumps to avoid Wikidata rate limits. Users can download dumps with scribe-data download --wiktionary-dump and extract translations using scribe-data get -dt translations -wtp enwiktionary. Additionally, the interactive mode (scribe-data interactive) has been updated to include guided flows for both downloading dumps and configuring translation extraction.

scribe-data download --wiktionary-dump   # English Wiktionary dump
scribe-data download --wiktionary-dump --language de    # specific language's Wiktionary dump

scribe-data g -dt translations -lang de -wtp enwiktionary  # Extract translations for a specific language
scribe-data g -dt translations -wtp enwiktionary  # Extract translations for ALL supported languages

Ultimate structure I made like for book word -

Related issue

Explore using Wiktionary dumps to derive translation data #650

github-actions · 2026-03-03T22:25:34Z

Thank you for the pull request! 💙

The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊

Note

Scribe uses Conventional Comments in reviews to make sure that communication is as clear as possible.

github-actions · 2026-03-03T22:25:34Z

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

Tests for changes have been written and the pytest, linting and formatting workflows within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

axif0 · 2026-03-03T22:28:51Z

@andrewtavis do we really need it to filter out Lists of valid tags and mappings for tags canonicalization ?

axif0 · 2026-03-03T22:29:53Z

Happy to work on the tests if the PR marged.

andrewtavis · 2026-03-03T22:42:06Z

Amazing, @axif0! Thank you so much for the work here! I'll try to get to the review by Friday :)

andrewtavis · 2026-03-03T22:43:28Z

Does this include the SQLite conversion of the resulting JSON? If not, do you want to make an issue for that?

axif0 · 2026-03-03T23:02:19Z

Does this include the SQLite conversion of the resulting JSON?

No, actually I want to confirmed like ~ once the JSON format is approved and other functionality, then we can work on the SQLite conversion later 😄

andrewtavis · 2026-03-04T09:41:00Z

Sounds like a great plan, @axif0 :) Do you want to make an issue for the SQLite conversion so we have it logged, or would you prefer to include it in this PR?

DeleMike

Thanks @axif0. W PR!

I have added my comment on the cli/donwload.py file.

TL;DR
I think we need to support multiple languages download at one go. What do you think about this option?

DeleMike · 2026-03-04T10:49:52Z

src/scribe_data/cli/download.py

Thanks @axif0 for this!
First of all, I am trying the uv dependency management flow for the first time, and it looks really clean and simple to use.
Also, this PR is really good! I hope it removes the instability with the usual latest-lexems.bz2 files.

For this download procedure, I was thinking there should be a multiple-option procedure. Where I can pass, scribe-data download --wiktionary-dump --language=[en,de,fr,sv] or maybe something similar...I hope you understand what I mean. Unless there is an option for this and I am not aware of it. Below is our current scribe-data -h result:

The Scribe-Data CLI is a tool for extracting language data from Wikidata and other sources. positional arguments: {list,l,get,g,total,t,convert,c,download,d,interactive,i,check_contracts,cc,filter_data,fd} list (l) List languages, data types and combinations of each that Scribe-Data can be used for. get (g) Get data from Wikidata and other sources for the given languages and data types. total (t) Check Wikidata for the total available data for the given languages and data types. convert (c) Convert data returned by Scribe-Data to different file types. download (d) Download Wikidata lexeme dumps. interactive (i) Run in interactive mode. check_contracts (cc) Check the data in the following directory to see that all needed language data is included. filter_data (fd) Filter data based on provided data contract values. options: -h, --help Show this help message and exit. -v, --version Show the version of the Scribe-Data CLI. -u, --upgrade Upgrade the Scribe-Data CLI to the latest version. Visit the codebase at https://github.com/scribe-org/Scribe-Data and documentation at https://scribe-data.readthedocs.io to learn more!

Overall, I think all is great as is!! Honestly, thank you and well done!

Thanks for the positive feedback, @DeleMike! I'm in agreement on all of the above :)

andrewtavis · 2026-03-05T00:18:49Z

I agree with @DeleMike that we should have multiple output languages, and we also need the ability to pass a specific dump ID. I'm seeing here that as of now we're just using enwiktionary, but it should be the Wiktionary dump for the language we want translations for. I don't know if we need to pass the Wiktionaries as an array though.

I think that one string arg for --wiktionary-dump can be mapped to multiple output languages, and within Scribe-Server we can run the same command multiple times? I just worry about what the command would look like if we had an array for --wiktionary-dump. Really hard to read 🤔 From the passed file we can get the needed ISO for the source language of the translations. If there's a latest Wiktionary dump that's corrupted, then we can rerun the command with a different one manually :)

… mama

axif0 · 2026-03-06T13:36:07Z

Updated command for sqlite conversion -

scribe-data convert -lang english -dt wiktionary_translations -ot sqlite

Add Wiktionary dump download and translation extraction capabilities

2f76946

axif0 requested review from DeleMike and andrewtavis March 3, 2026 22:25

remove unused os import

c83edd8

axif0 requested a review from catreedle March 3, 2026 22:27

andrewtavis mentioned this pull request Mar 3, 2026

Add MariaDB tables for translations scribe-org/Scribe-Server#58

Open

2 tasks

DeleMike reviewed Mar 4, 2026

View reviewed changes

axif0 added 2 commits March 6, 2026 19:03

Merge branch 'main' of https://github.com/scribe-org/Scribe-Data into…

57299f9

… mama

JSONs to SQLite and add wiktionary_translations to valid data types.

41c6e65

axif0 added 2 commits March 6, 2026 22:50

add tests and update some bug

8471d39

fix tests

bbe38bc

Conversation

axif0 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contributor checklist

Description

Related issue

Uh oh!

github-actions bot commented Mar 3, 2026

Thank you for the pull request! 💙

Uh oh!

github-actions bot commented Mar 3, 2026

Maintainer Checklist

Uh oh!

axif0 commented Mar 3, 2026

Uh oh!

axif0 commented Mar 3, 2026

Uh oh!

andrewtavis commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewtavis commented Mar 3, 2026

Uh oh!

axif0 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewtavis commented Mar 4, 2026

Uh oh!

DeleMike left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DeleMike Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

andrewtavis Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

andrewtavis commented Mar 5, 2026

Uh oh!

axif0 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

axif0 commented Mar 3, 2026 •

edited

Loading

andrewtavis commented Mar 3, 2026 •

edited

Loading

axif0 commented Mar 3, 2026 •

edited

Loading

DeleMike left a comment •

edited

Loading

axif0 commented Mar 6, 2026 •

edited

Loading