Allow configurable handling of invalid UTF-8 for better backward compatibility (C#)
#22288 opened on Jun 17, 2025
Description
What language does this apply to?
C#
Describe the problem you are trying to solve.
With the recent change enforcing strict UTF-8 in Protocol Buffers (see commit db9b2c8e9f70dc8c53a31b52a740dfc3fad718d7), parsing a string field containing any non-UTF-8 byte sequences now throws an InvalidProtocolBufferException in C#.
This breaks backward compatibility for scenarios where legacy clients (intentionally or by mistake) send string data that is not valid UTF-8 — Commonly in older systems that use single-byte encodings such as ISO-8859-1 (Latin-1), Windows-1252, or others.
A concrete example is the German Umlaut "ö", which may be sent as the single byte 0xF6 (ISO-8859-1 Latin-1) instead of its proper two-byte UTF-8 encoding (0xC3 0xB6). For production use-cases where we cannot control all clients, this change is quite drastic and results in hard exceptions for data that was previously parsed (albeit with replacement characters, which is OK).
Describe the solution you'd like
I would like to propose an opt-in configuration for the protobuf parser in C# (maybe other languages can benefit as well) to relax this strict UTF-8 enforcement for string fields.
For example, a SuppressUtf8Exception or AllowInvalidUtf8 flag, which would restore the previous behavior —accepting invalid UTF-8, substituting replacement characters and maybe do some warning logs instead of throwing an exception.
This would allow a smoother transition period for projects with legacy clients, and help maintain backward compatibility during migrations.
Describe alternatives you've considered
- Adding an extra data conversion layer before parsing, to pre-process and re-encode Latin-1/legacy encoding data to UTF-8. This adds complexity and risk of data corruption, especially when field boundaries are not known outside protobuf parsing
- Forcing all clients to update and guarantee UTF-8 conformance, which is not always feasible in environments with many legacy systems and uncontrolled data sources
- Using
bytesfields instead ofstring, but this is not practical for existing proto definitions and would require substantial refactoring
Additional context
I highly value the backward compatibility policies as described in your contribution guidelines:
https://github.com/protocolbuffers/protobuf/blob/main/CONTRIBUTING.md
A configuration switch or opt-in flag for legacy compatibility would help many users avoid breaking changes while still encouraging migration to strict UTF-8 over time.