Skip to content

Add Wiktionary dump download and translation extraction capabilities#666

Open
axif0 wants to merge 6 commits intoscribe-org:mainfrom
axif0:mama
Open

Add Wiktionary dump download and translation extraction capabilities#666
axif0 wants to merge 6 commits intoscribe-org:mainfrom
axif0:mama

Conversation

@axif0
Copy link
Member

@axif0 axif0 commented Mar 3, 2026

Contributor checklist


Description

The Scribe-Data CLI now supports downloading and extracting translations directly from Wiktionary XML dumps to avoid Wikidata rate limits. Users can download dumps with scribe-data download --wiktionary-dump and extract translations using scribe-data get -dt translations -wtp enwiktionary. Additionally, the interactive mode (scribe-data interactive) has been updated to include guided flows for both downloading dumps and configuring translation extraction.

scribe-data download --wiktionary-dump   # English Wiktionary dump
scribe-data download --wiktionary-dump --language de    # specific language's Wiktionary dump

scribe-data g -dt translations -lang de -wtp enwiktionary  # Extract translations for a specific language
scribe-data g -dt translations -wtp enwiktionary  # Extract translations for ALL supported languages

image

Ultimate structure I made like for book word -

image

Related issue

@axif0 axif0 requested review from DeleMike and andrewtavis March 3, 2026 22:25
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Thank you for the pull request! 💙

The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊

Note

Scribe uses Conventional Comments in reviews to make sure that communication is as clear as possible.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

  • Tests for changes have been written and the pytest, linting and formatting workflows within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@axif0 axif0 requested a review from catreedle March 3, 2026 22:27
@axif0
Copy link
Member Author

axif0 commented Mar 3, 2026

@andrewtavis do we really need it to filter out Lists of valid tags and mappings for tags canonicalization ?

@axif0
Copy link
Member Author

axif0 commented Mar 3, 2026

Happy to work on the tests if the PR marged.

@andrewtavis
Copy link
Member

andrewtavis commented Mar 3, 2026

Amazing, @axif0! Thank you so much for the work here! I'll try to get to the review by Friday :)

@andrewtavis
Copy link
Member

Does this include the SQLite conversion of the resulting JSON? If not, do you want to make an issue for that?

@axif0
Copy link
Member Author

axif0 commented Mar 3, 2026

Does this include the SQLite conversion of the resulting JSON?

No, actually I want to confirmed like ~ once the JSON format is approved and other functionality, then we can work on the SQLite conversion later 😄

@andrewtavis
Copy link
Member

Sounds like a great plan, @axif0 :) Do you want to make an issue for the SQLite conversion so we have it logged, or would you prefer to include it in this PR?

Copy link
Collaborator

@DeleMike DeleMike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @axif0. W PR!

I have added my comment on the cli/donwload.py file.

TL;DR
I think we need to support multiple languages download at one go. What do you think about this option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @axif0 for this!
First of all, I am trying the uv dependency management flow for the first time, and it looks really clean and simple to use.
Also, this PR is really good! I hope it removes the instability with the usual latest-lexems.bz2 files.

For this download procedure, I was thinking there should be a multiple-option procedure. Where I can pass, scribe-data download --wiktionary-dump --language=[en,de,fr,sv] or maybe something similar...I hope you understand what I mean. Unless there is an option for this and I am not aware of it. Below is our current scribe-data -h result:

The Scribe-Data CLI is a tool for extracting language data from Wikidata and other sources.

positional arguments:
  {list,l,get,g,total,t,convert,c,download,d,interactive,i,check_contracts,cc,filter_data,fd}
    list (l)                                                List languages, data types and combinations of each that Scribe-Data can be used for.
    get (g)                                                 Get data from Wikidata and other sources for the given languages and data types.
    total (t)                                               Check Wikidata for the total available data for the given languages and data types.
    convert (c)                                             Convert data returned by Scribe-Data to different file types.
    download (d)                                            Download Wikidata lexeme dumps.
    interactive (i)                                         Run in interactive mode.
    check_contracts (cc)                                    Check the data in the following directory to see that all needed language data is included.
    filter_data (fd)                                        Filter data based on provided data contract values.

options:
  -h, --help                                                Show this help message and exit.
  -v, --version                                             Show the version of the Scribe-Data CLI.
  -u, --upgrade                                             Upgrade the Scribe-Data CLI to the latest version.

Visit the codebase at https://github.com/scribe-org/Scribe-Data and documentation at https://scribe-data.readthedocs.io to learn more!

Overall, I think all is great as is!! Honestly, thank you and well done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the positive feedback, @DeleMike! I'm in agreement on all of the above :)

@andrewtavis
Copy link
Member

I agree with @DeleMike that we should have multiple output languages, and we also need the ability to pass a specific dump ID. I'm seeing here that as of now we're just using enwiktionary, but it should be the Wiktionary dump for the language we want translations for. I don't know if we need to pass the Wiktionaries as an array though.

I think that one string arg for --wiktionary-dump can be mapped to multiple output languages, and within Scribe-Server we can run the same command multiple times? I just worry about what the command would look like if we had an array for --wiktionary-dump. Really hard to read 🤔 From the passed file we can get the needed ISO for the source language of the translations. If there's a latest Wiktionary dump that's corrupted, then we can rerun the command with a different one manually :)

@axif0
Copy link
Member Author

axif0 commented Mar 6, 2026

Updated command for sqlite conversion -

scribe-data convert -lang english -dt wiktionary_translations -ot sqlite
image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants