elastic/elasticsearch

Use charset from Content-Type header

Open

#22769 opened on Jan 24, 2017

View on GitHub
 (7 comments) (3 reactions) (1 assignee)Java (76,700 stars) (25,882 forks)batch import
:Core/Infra/REST API>enhancementTeam:Core/Infrahelp wantedteam-discusstriaged

Description

In https://github.com/elastic/elasticsearch/pull/22691#discussion_r96935452, I added a comment which points out that our code currently ignores the charset parameter of the Content-Type header and that this is something we should look into. Looking at the javadocs of JsonFactory to see how different charsets are handled:

Encoding is auto-detected from contents according to JSON
specification recommended mechanism. Json specification
supports only UTF-8, UTF-16 and UTF-32 as valid encodings,
so auto-detection implemented only for this charsets.
For other charsets use {@link #createParser(java.io.Reader)}.

Unfortunately not all clients adhere to the unicode only encodings as I have seen some send data as ISO-8859-1. I think we should consider parsing the charset from the content-type when available and handling appropriately (failing if we cannot support, convert, create parser differently etc.).

Contributor guide