Skip to content

IptvRepository.kt series title matching strips non-ASCII letters instead of transliterating (breaks Turkish/diacritic titles) #398

@beOne61

Description

@beOne61

Bug Description

Xtream-Codes series search/matching fails to find a series whose name contains
non-ASCII Latin letters (e.g. Turkish "ğ", "ı", "ş", "ç") when the provider's
catalog stores the title differently (with or without the diacritics) than
TMDB does.

Root Cause

IptvRepository.kt defines:

val NON_ALPHA_NUM_REGEX = Regex("[^a-z0-9]+")

private fun normalizeLookupText(value: String): String {
    ...
    .lowercase(Locale.US)
    .replace(NON_ALPHA_NUM_REGEX, " ")
    ...
}

This regex only recognizes ASCII a-z0-9. Non-ASCII letters like Turkish "ğ"
are not transliterated — they're treated as punctuation and replaced with a
space, splitting the word in two:

"Doğu" → normalizeLookupText() → "do u"

If the provider's catalog stores the title in plain ASCII (e.g. "Dogu", common
for many IPTV backends), it normalizes to "dogu" — a single token — which no
longer matches "do u" either exactly or via the word-overlap fuzzy scoring
(scoreNameMatch / looseSeriesTitleScore, which both filter out words
shorter than 3 characters, dropping "do" and "u" entirely).

Comparison with the existing Jellyfin/Emby/Plex matcher

HomeServerRepository.kt's HomeServerMatcher.normalizeTitle() already
handles this correctly via Unicode NFD decomposition before stripping:

fun normalizeTitle(title: String): String {
    val ascii = Normalizer.normalize(title, Normalizer.Form.NFD)
        .replace(DIACRITICS_REGEX, "")
    return ascii.lowercase(Locale.US)...
}

This correctly turns "Doğu" into "dogu" by stripping only the combining
diacritic mark, not the base letter.

Suggested Fix

Apply the same NFD-normalize + diacritics-strip step in
IptvRepository.kt::normalizeLookupText() before the NON_ALPHA_NUM_REGEX
replacement, so it behaves consistently with HomeServerMatcher.normalizeTitle().

Impact

Any Xtream-Codes series/VOD title containing non-ASCII Latin characters
(Turkish, German umlauts, etc.) is at risk of silently failing to match
against the user's own portal content, even though the content exists and is
correctly listed via get_series/get_series_info.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions