Skip to content

Issue with malyalam and kannad normalization, extra space is inserted before punctuation #2

@raccoon641

Description

@raccoon641

I have tried normalizing malyalam and kannada but in the output was not able to normalize phone number
Input:
{"text":"എനിക്ക് 29 വയസുണ്ട്, എന്റെ ഫോൺ നമ്പർ 9123456789 ആണ്."}
Output:
{
"normalized_text": "എനിക്ക് ഇരുപത്തൊമ്പത് വയസുണ്ട് , എന്റെ ഫോൺ നമ്പർ 9123456789 ആണ് .",
"detected_lang": "ml",
"lang_name": "Malayalam",
"processing_time": 0.05313992500305176
}

Also I am putting together all observed issues accross langauges

Observed Behavior

  1. Phone number normalization (critical)
    Phone numbers are handled differently across languages:
    Hindi (hi): Digit-wise normalization but extra leading digit introduced
    Tamil (ta): Digit-wise normalization, but non-Tamil zero lexeme (பூஜ்யம்) used
    Telugu (te): Phone number treated as a cardinal quantity, expanded into crores/lakhs
    Kannada (kn): Phone number not normalized at all
    Malayalam (ml): Phone number not normalized at all

  2. Punctuation normalization (global bug)
    Across all tested languages, an extra space is inserted before punctuation:
    Examples:
    है ।
    ஆகும் .
    ಇದೆ .
    ആണ് .

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions