Skip to content

Can an obscure Unicode character be a sentence breaker? #42

@helmadik

Description

@helmadik

The Greek ano teleia (mid dot) is a sentence boundary. We successfully added the semicolon (which is a question mark) as a sentence breaker in load config, but are not sure whether that space can hold Unicode code points like the word breakers category.
It's Unicode 00B7. or UTF-8 C2B7, or "MIDDLE DOT"

I have vague memories that there is some interaction with "word breakers" elsewhere in the config, but don't remember the details.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions