fix utf8 issues in jpostal#1
Conversation
|
Any chance we can close the loop and contribute this back to the main |
Victor Schappert (@vcschapp) we definitely should. I'm off the Overture project now, so if you'd like to take this and put it up in the jpostal repo go ahead and do it. We've been using this for places for quite some time with no issues. I was talking with James Willis (can't tag him for some reason) from wherobots about this, he's taken my changes and made a few additional ones to get this working on their platform. So hopefully y'all can get this merged back in one way or the other. Relevant chat about this with whererobots folks in https://github.com/OvertureMaps/rfd/discussions/28#discussioncomment-13686845 |
Full context in https://github.com/OvertureMaps/overture-matchers/issues/56.
This is the modified version of jpostal we're currently using in our cluster for places. It is a minimally modified version of an open PR in the original jpostal repo addressing the same issue. The main modification is fixing the byte array size to include the null termination character at the end.
The issue originates from the fact that libpostal expects standard UTF-8, but Java uses a modified UTF8 version, so by working with jstrings by default at the JNI layer, you end up passing unexpected/malformed UTF-8 characters to libpostal, which eventually makes the library crash and/or hang depending on the environment it's running in. In the case of Spark clusters, this was hanging nodes indefinitely after an issue in the transliteration module of libpostal.
This issue was particularly reproducible with Meta's feed, given the presence of international places in all sorts of languages and character sets.
This requires JDK8 to build locally.
Important
While switching from jstring to jbytes allows us to work around the issue, it introduces additional risk of memory leaks happening. Keep an eye out if you observe any workers dying with OOM or similar errors. I was able to successfully compute addresses for all of Meta's dataset without seeing any issues, but we should make sure that is the case for running the full setup flow in the matcher as well.