Rule-first Russian proofreading engine in Rust.
This is not a regex toy. The project is already building a transparent language-model layer for Russian: tokenization, morphology, agreement, government, quantity frames, punctuation safe zones, suppressions, debug snapshots, dataset importers, and regression fixtures.
Run the strict profile to see the current research-grade surface:
cargo run -- check examples/bad.txt --rules rules --profile strictRepresentative checks that already work:
Согласно новому приказа, встреча перенесена.
→ preposition + nominal-group government conflict
Два новых дом стояли рядом.
→ numeral + short nominal-group agreement conflict
Сто двадцать два новых дом стояли рядом.
→ typed compound-numeral component selects the governing numeral part
Новый важному приказу присвоили номер.
→ non-adjacent modifier/head agreement conflict inside a nominal group
Девочка пришёл.
→ subject/predicate agreement conflict
Он говорит о задачу.
→ verb valency frame: prepositional object must use the modeled case
Он помогает проектом.
→ verb valency frame: direct-object seed expects another case
Он хочет разсказать и потом расбить лёд.
→ word-formation-backed prefix-final з/с assimilation
Из за дождя кое кто опоздал. Я то понял. Ну ка проверь.
→ model-backed hyphenation for compound prepositions, кое-, -то, -ка
Он ушёл потому что устал. Конечно это риск.
→ punctuation with clause-marker and introductory-phrase safe-zone logic
Мaма сказала: я незнаю, но надо учится.
→ mixed alphabet, common не-confusable, and -тся/-ться heuristic
Пол лимона осталось.
→ conservative пол- hyphenation before л/vowel/proper-name contexts
The engine also keeps negative fixtures for false-positive risk. For example,
corrected targets from LORuGEC, RLC, and RuSpellGold are stored as regression
fixtures so model-backed grammar rules do not keep firing on normal sentences
like В один из летних вечеров..., От него целый год..., or
к потере слуха и зрения.
The long-term product direction is a serious Russian grammar/proofreading checker with minimal ML and maximal explicit linguistic structure. Regex is allowed only for surface phenomena. Grammar rules should be backed by reusable models:
- typed morphology and feature unification;
- preposition and verb government frames;
- nominal-group and quantity models;
- clause boundaries and syntactic safe zones;
- debug snapshots with machine-readable proofs;
- dataset-based regression tests.
sudo pacman -S --needed rust cargo jq parallel python-pytestcargo run -- check examples/bad.txt --rules rules
cargo run -- check examples/bad.txt --rules rules --profile strict
cargo run -- check examples/bad.txt --rules rules --format json | jqUse a custom morphology TSV:
cargo run -- check input.txt --rules rules --morph-lexicon data/lexicon/demo_morph.tsvInspect the debug layer:
cargo run -- debug input.txt --rules rules --profile strict | jqValidate the rule corpus:
cargo run -- validate --rules rulesRun embedded rule examples as executable specs:
cargo run -- test-examples --rules rulesBatch check with GNU Parallel:
find texts -type f -name '*.txt' -print0 | \
parallel -0 'cargo run --quiet -- check {} --rules rules --format json > {.}.lint.json'Default checks run conservative implemented rules only. strict enables all
implemented rules, including style, heuristic, and advanced demo/model-backed
checks.
cargo run -- check examples/bad.txt --rules rules --profile default
cargo run -- check examples/bad.txt --rules rules --profile strict
cargo run -- check examples/bad.txt --rules rules --profile typography-only
cargo run -- list-rules --rules rules --all --status researchTargeted filters:
cargo run -- check examples/bad.txt --rules rules --domain punctuation --severity error
cargo run -- check examples/bad.txt --rules rules --rule-id ru.punctuation.no_space_before_mark
cargo run -- check examples/bad.txt --rules rules --exclude-rule ru.typography.hyphen_instead_of_dashSuppressions are off by default. Enable inline directives explicitly:
cargo run -- check input.txt --rules rules --allow-inline-suppressions
cargo run -- check input.txt --rules rules --suppress-rule ru.punctuation.no_space_before_markSee docs/CLI_USAGE.md for the JSON contract, suppression syntax, timing mode,
and batch examples.
Local-only importers currently support:
- RLC annotated / RLC-GEC;
- LORuGEC;
- RuSpellGold.
The repository does not vendor full raw corpora by default. Use local archives from your machine:
python scripts/eval/run_benchmark.py \
--dataset lorugec \
--archive "$HOME/Downloads/LORuGEC-main.zip" \
--checker-bin target/debug/orthos \
--rules rules \
--profile strict \
--limit 100The checked-in fixture testdata/fixtures/eval/gec_dataset_regressions.jsonl
keeps a tiny, stable slice of real dataset-derived false-positive cases for
normal development.
The repository includes data/lexicon/demo_morph.tsv, a small built-in lexicon
used by tests and examples. For broader morphology-backed checks, download the
OpenCorpora-derived cache from the GitHub release:
mkdir -p data/lexicon
curl -L -o data/lexicon/opencorpora.bincache.zst \
https://github.com/xa9e/orthos/releases/download/morph-cache-v1/opencorpora.bincache.zst
curl -L -o data/lexicon/opencorpora.bincache.idx.zst \
https://github.com/xa9e/orthos/releases/download/morph-cache-v1/opencorpora.bincache.idx.zst
zstd -d data/lexicon/opencorpora.bincache.zst -o data/lexicon/opencorpora.bincache
zstd -d data/lexicon/opencorpora.bincache.idx.zst -o data/lexicon/opencorpora.bincache.idxAfter that, orthos automatically uses data/lexicon/opencorpora.bincache
when no --morph-lexicon is passed.
The cache is derived from the OpenCorpora morphological dictionary. OpenCorpora data is distributed under CC BY-SA 3.0; the project code is licensed under AGPL-3.0-or-later.
cargo fmt --check
cargo check
cargo test
cargo run -- validate --rules rules
cargo run -- test-examples --rules rules
python3 scripts/dev/project_health.py
python3 -m pytest -q tests/test_importers.py tests/dev_tools/test_project_health.py
git diff --check- whitespace, punctuation spacing, sentence-final punctuation, lowercase starts;
- repeated words and repeated punctuation;
- mixed Cyrillic/Latin words;
- common spelling and confusable-token seeds;
- clitic and compound-preposition hyphenation;
не + verb/gerund/participlespacing;-тся/-тьсяcontext heuristic;пол-hyphenation in conservative contexts;- prefix-final
з/сassimilation through a seed word-formation model; - missing comma before frequent subordinators with safe-zone suppression;
- introductory phrase comma at sentence start;
- coordination comma slots for homogeneous modifier series;
- adjective/noun and nominal-group modifier agreement;
- subject/predicate agreement;
- preposition case government;
- preposition + nominal-group government;
- numeral/noun and numeral + nominal-group quantity agreement;
- compound and typed-component compound numeral agreement;
- verb government from data-backed valency seeds;
- phrase-map style, pleonasm, and government seeds;
- debug proof generation for model-backed diagnostics.
Read IDEA.md first. The project direction is rule-first and debug-first:
every serious grammar rule should become a small inspectable language model, not
a one-off detector.
Useful docs:
docs/DATASETS.mddocs/CLI_USAGE.mddocs/ARCHITECTURE.mddocs/language-model/docs/debug-layer.md
The project is licensed under the GNU Affero General Public License,
version 3 or any later version (AGPL-3.0-or-later). See LICENSE for the full
text. External datasets keep their own licenses and are never vendored into
this repository; see docs/DATASETS.md.
Create a portable repository bundle:
git bundle create orthos.bundle --all
git bundle verify orthos.bundleRestore it:
git clone orthos.bundle orthos
cd orthosFor a source-only archive without Git history:
git archive --format=tar HEAD | xz -T0 > orthos-source.tar.xz