Allow configurable handling of invalid UTF-8 for better backward compatibility (C#) · protocolbuffers/protobuf#22288

(1 comment) (0 reactions) (0 assignees)C++ (16,128 forks)batch import

help wanted

Repository metrics

Stars: (71,223 stars)
PR merge metrics: (Avg merge 2d 11h) (185 merged PRs in 30d)

Description

What language does this apply to?

Describe the problem you are trying to solve.

With the recent change enforcing strict UTF-8 in Protocol Buffers (see commit db9b2c8e9f70dc8c53a31b52a740dfc3fad718d7), parsing a string field containing any non-UTF-8 byte sequences now throws an InvalidProtocolBufferException in C#.
This breaks backward compatibility for scenarios where legacy clients (intentionally or by mistake) send string data that is not valid UTF-8 — Commonly in older systems that use single-byte encodings such as ISO-8859-1 (Latin-1), Windows-1252, or others.
A concrete example is the German Umlaut "ö", which may be sent as the single byte 0xF6 (ISO-8859-1 Latin-1) instead of its proper two-byte UTF-8 encoding (0xC3 0xB6). For production use-cases where we cannot control all clients, this change is quite drastic and results in hard exceptions for data that was previously parsed (albeit with replacement characters, which is OK).

Describe the solution you'd like

I would like to propose an opt-in configuration for the protobuf parser in C# (maybe other languages can benefit as well) to relax this strict UTF-8 enforcement for string fields.
For example, a SuppressUtf8Exception or AllowInvalidUtf8 flag, which would restore the previous behavior —accepting invalid UTF-8, substituting replacement characters and maybe do some warning logs instead of throwing an exception.
This would allow a smoother transition period for projects with legacy clients, and help maintain backward compatibility during migrations.

Describe alternatives you've considered

Adding an extra data conversion layer before parsing, to pre-process and re-encode Latin-1/legacy encoding data to UTF-8. This adds complexity and risk of data corruption, especially when field boundaries are not known outside protobuf parsing
Forcing all clients to update and guarantee UTF-8 conformance, which is not always feasible in environments with many legacy systems and uncontrolled data sources
Using bytes fields instead of string, but this is not practical for existing proto definitions and would require substantial refactoring

Additional context

I highly value the backward compatibility policies as described in your contribution guidelines:
https://github.com/protocolbuffers/protobuf/blob/main/CONTRIBUTING.md

A configuration switch or opt-in flag for legacy compatibility would help many users avoid breaking changes while still encouraging migration to strict UTF-8 over time.

Contributor guide

Research direction: Examine the C# parser implementation, specifically the commit db9b2c8e9f70dc8c53a31b52a740dfc3fad718d7 that introduced strict UTF 8 validation. Identify where the InvalidProtocolBufferException is thrown for invalid UTF 8 in string fields. Propose adding a configuration option (e.g., SuppressUtf8Exception) that allows the parser to substitute replacement characters instead of throwing. Ensure the change maintains backward compatibility and can be tested.
Tech stack: csharp
Domain: backend
Issue type: Feature
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Clear
Prerequisites: C#Protocol Buffers
Newbie friendliness: 70

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.