EncoDB is a lightweight artifact for studying encoding-related behavioral divergences across database systems and conversion tools.
- PostgreSQL 18.3
- MySQL 8.4.0
- MariaDB 11.8.6
- DuckDB 1.4.3, with the
encodingsextension (version b5a547e) - (GNU libc) iconv 2.43
- RQ1: Measure how DBMSs behave after malformed bytes are admitted on the write path. The basic method is to enumerate candidate byte sequences for a target encoding, insert them into DBMS tables, read them back immediately, and record admission, readback success, and emitted character mappings. See
RQ1/. - RQ2: Compare query-side validity, mapping, and failure semantics on common encodings. The basic method is to probe DuckDB and
iconv, normalize the outputs into*-CHARS.txtfiles, and run pairwise differential analysis withRQ2/compare_chars.py. SeeRQ2/. - RQ3: Check whether observed divergences become compatibility bugs under claimed equivalence. The basic method is to use TiDB GBK probing, compare the results against MySQL outputs, and further compare
CONVERTbehavior with the same differential analysis workflow. SeeRQ3/.
In addition, the efficiency result is based on simple wall-clock timing already printed by the probe and comparison scripts. We did not introduce extra optimization specifically, but the scripts are still efficient in practice.
We also implemented a PostgreSQL patch to tighten the validation for EUC-CN (https://github.com/SWUFE-DB-Group/postgresql-encoding-validation), and conducted end-to-end testing with the patched PostgreSQL 18.3. See GB2312-PG-benchmark.
- GBK in TiDB is imcompatible with that in MySQL
- Unexpected Error for Encoding \xD7\xFA
- Character with byte sequence 0xa2 0xa3 in encoding "EUC_CN" has no equivalent in encoding "UTF8"
- Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
- The wrong byte/char column of EUC-CN and EUC-KR
- Inconsistent GBK-to-utf8mb4 conversion between implicit INSERT ... SELECT assignment and explicit CONVERT()