I have tried normalizing malyalam and kannada but in the output was not able to normalize phone number
Input:
{"text":"എനിക്ക് 29 വയസുണ്ട്, എന്റെ ഫോൺ നമ്പർ 9123456789 ആണ്."}
Output:
{
"normalized_text": "എനിക്ക് ഇരുപത്തൊമ്പത് വയസുണ്ട് , എന്റെ ഫോൺ നമ്പർ 9123456789 ആണ് .",
"detected_lang": "ml",
"lang_name": "Malayalam",
"processing_time": 0.05313992500305176
}
Also I am putting together all observed issues accross langauges
Observed Behavior
-
Phone number normalization (critical)
Phone numbers are handled differently across languages:
Hindi (hi): Digit-wise normalization but extra leading digit introduced
Tamil (ta): Digit-wise normalization, but non-Tamil zero lexeme (பூஜ்யம்) used
Telugu (te): Phone number treated as a cardinal quantity, expanded into crores/lakhs
Kannada (kn): Phone number not normalized at all
Malayalam (ml): Phone number not normalized at all
-
Punctuation normalization (global bug)
Across all tested languages, an extra space is inserted before punctuation:
Examples:
है ।
ஆகும் .
ಇದೆ .
ആണ് .