Add Wiktionary dump download and translation extraction capabilities#666
Add Wiktionary dump download and translation extraction capabilities#666axif0 wants to merge 6 commits intoscribe-org:mainfrom
Conversation
Thank you for the pull request! 💙The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the Note Scribe uses Conventional Comments in reviews to make sure that communication is as clear as possible. |
Maintainer ChecklistThe following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) |
|
@andrewtavis do we really need it to filter out Lists of valid tags and mappings for tags canonicalization ? |
|
Happy to work on the tests if the PR marged. |
|
Amazing, @axif0! Thank you so much for the work here! I'll try to get to the review by Friday :) |
|
Does this include the SQLite conversion of the resulting JSON? If not, do you want to make an issue for that? |
No, actually I want to confirmed like ~ once the JSON format is approved and other functionality, then we can work on the SQLite conversion later 😄 |
|
Sounds like a great plan, @axif0 :) Do you want to make an issue for the SQLite conversion so we have it logged, or would you prefer to include it in this PR? |
There was a problem hiding this comment.
Thanks @axif0 for this!
First of all, I am trying the uv dependency management flow for the first time, and it looks really clean and simple to use.
Also, this PR is really good! I hope it removes the instability with the usual latest-lexems.bz2 files.
For this download procedure, I was thinking there should be a multiple-option procedure. Where I can pass, scribe-data download --wiktionary-dump --language=[en,de,fr,sv] or maybe something similar...I hope you understand what I mean. Unless there is an option for this and I am not aware of it. Below is our current scribe-data -h result:
The Scribe-Data CLI is a tool for extracting language data from Wikidata and other sources.
positional arguments:
{list,l,get,g,total,t,convert,c,download,d,interactive,i,check_contracts,cc,filter_data,fd}
list (l) List languages, data types and combinations of each that Scribe-Data can be used for.
get (g) Get data from Wikidata and other sources for the given languages and data types.
total (t) Check Wikidata for the total available data for the given languages and data types.
convert (c) Convert data returned by Scribe-Data to different file types.
download (d) Download Wikidata lexeme dumps.
interactive (i) Run in interactive mode.
check_contracts (cc) Check the data in the following directory to see that all needed language data is included.
filter_data (fd) Filter data based on provided data contract values.
options:
-h, --help Show this help message and exit.
-v, --version Show the version of the Scribe-Data CLI.
-u, --upgrade Upgrade the Scribe-Data CLI to the latest version.
Visit the codebase at https://github.com/scribe-org/Scribe-Data and documentation at https://scribe-data.readthedocs.io to learn more!Overall, I think all is great as is!! Honestly, thank you and well done!
There was a problem hiding this comment.
Thanks for the positive feedback, @DeleMike! I'm in agreement on all of the above :)
|
I agree with @DeleMike that we should have multiple output languages, and we also need the ability to pass a specific dump ID. I'm seeing here that as of now we're just using enwiktionary, but it should be the Wiktionary dump for the language we want translations for. I don't know if we need to pass the Wiktionaries as an array though. I think that one string arg for |


Contributor checklist
pytestcommand as directed in the testing section of the contributing guideDescription
The Scribe-Data CLI now supports downloading and extracting translations directly from Wiktionary XML dumps to avoid Wikidata rate limits. Users can download dumps with scribe-data download --wiktionary-dump and extract translations using scribe-data get -dt translations -wtp enwiktionary. Additionally, the interactive mode (scribe-data interactive) has been updated to include guided flows for both downloading dumps and configuring translation extraction.
Ultimate structure I made like for book word -
Related issue