Compiler does not correctly interpret surrogate pairs when used in an identifier · dotnet/roslyn#9731

(13 comments) (7 reactions) (1 assignee)C# (4,257 forks)batch import

Area-CompilersBugLanguage-C#Tenet-Localizationhelp wanted

Repository metrics

Stars: (20,414 stars)
PR merge metrics: (Avg merge 6d 17h) (256 merged PRs in 30d)

Description

The C# specification states that an identifier can start with or contain anything matching letter-character, which is defined as:

letter-character::
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl

However, the compiler does not appear to correctly interpret some characters which match the above categories if they are part of a surrogate pair.

For example the sumerian character 𒅴 is categorized as 'OtherLetter' (matching 'Lo' above) when processed through char.GetUnicodeCategory("𒅴", 0).

However, the compiler is interpreting this character as two separate characters (and reporting CS1056 for both). It is likely checking each character individually, rather than checking if the first character is part of a surrogate pair and interpreting the character appropriately if it is.

Contributor guide

Research direction: Examine the compiler's Unicode character classification in the lexer for surrogate pair handling. Focus on how the parser determines identifier characters and whether it correctly processes surrogate pairs as single code points.
Tech stack: c
Domain: backendcompilers
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: C#Unicode
Newbie friendliness: 60

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.