sparklemotion/nokogiri

JRuby: UtfHelpper.writeCharToUtf8 cannot handle unicode supplementary character

Open

#2.410 geöffnet am 5. Jan. 2022

Auf GitHub ansehen
 (9 Kommentare) (0 Reaktionen) (0 zugewiesene Personen)Ruby (806 Forks)batch import
blockedhelp wantedplatform/jrubytopic/encoding

Repository-Metriken

Stars
 (5.615 Stars)
PR-Merge-Metriken
 (Keine gemergten PRs in 30 T)

Beschreibung

https://github.com/sparklemotion/nokogiri/blob/55029bfba481338825c99e78af2b182b1cc49e04/ext/java/nokogiri/internals/c14n/UtfHelpper.java#L51

since the Canonicalizer process input String character by character. Java uses 16 bits to represent a character; when the input string contains Unicode characters whose code pen are larger than 0Xffff(65535) it will be split into two char, since neither char will not be
recognized, the Unicode characters will be transferred to 2 ??(3f) instead.

for example, if I want to canonicalize an input that contains 𡏅 via c14n, in the output, 𡏅 will be replaced with ??

Contributor Guide