Compiler does not correctly interpret surrogate pairs when used in an identifier · dotnet/roslyn#9731

(13 留言) (7 反應) (1 負責人)C# (20,414 star) (4,257 fork)batch import

Area-CompilersBugLanguage-C#Tenet-Localizationhelp wanted

描述

The C# specification states that an identifier can start with or contain anything matching letter-character, which is defined as:

letter-character::
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl

However, the compiler does not appear to correctly interpret some characters which match the above categories if they are part of a surrogate pair.

For example the sumerian character 𒅴 is categorized as 'OtherLetter' (matching 'Lo' above) when processed through char.GetUnicodeCategory("𒅴", 0).

However, the compiler is interpreting this character as two separate characters (and reporting CS1056 for both). It is likely checking each character individually, rather than checking if the first character is part of a surrogate pair and interpreting the character appropriately if it is.

貢獻者指南

技術棧: csharp
領域: tooling
議題類型: bug
難度: 3
預計時間: half day
活動狀態: stale
清晰度: clear
前置要求: C# language specificationUnicode surrogate pairscompiler internals
新手友善度: 40
研究方向: Search the Roslyn lexer source code (e.g., Lexer.cs) for identifier parsing logic. The bug is that the compiler checks each char individually instead of handling surrogate pairs. Look for calls to CharUnicodeInfo or similar and modify to recognize surrogate pairs. See comments in the issue for discussion on potential fixes.