Use charset from Content-Type header · elastic/elasticsearch#22769

(7 comments) (3 reactions) (1 assignee)Java (25,882 forks)batch import

:Core/Infra/REST API>enhancementTeam:Core/Infrahelp wantedteam-discusstriaged

Repository metrics

Stars: (76,700 stars)
PR merge metrics: (Avg merge 2d) (1,000 merged PRs in 30d)

Description

In https://github.com/elastic/elasticsearch/pull/22691#discussion_r96935452, I added a comment which points out that our code currently ignores the charset parameter of the Content-Type header and that this is something we should look into. Looking at the javadocs of JsonFactory to see how different charsets are handled:

Encoding is auto-detected from contents according to JSON
specification recommended mechanism. Json specification
supports only UTF-8, UTF-16 and UTF-32 as valid encodings,
so auto-detection implemented only for this charsets.
For other charsets use {@link #createParser(java.io.Reader)}.

Unfortunately not all clients adhere to the unicode only encodings as I have seen some send data as ISO-8859-1. I think we should consider parsing the charset from the content-type when available and handling appropriately (failing if we cannot support, convert, create parser differently etc.).

Contributor guide

Research direction: Investigate parsing the charset from the Content Type header in Elasticsearch's HTTP handling code. Look at how JsonFactory is used and consider adding charset aware parser creation. Handle unsupported charsets by failing gracefully or converting.
Tech stack: java
Domain: backend
Issue type: Feature
Difficulty: 2
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: JavaHTTP
Newbie friendliness: 60

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.