[syntax-errors] Remaining syntax errors raised by the compiler · astral-sh/ruff#17412

(17 comments) (4 reactions) (0 assignees)Rust (2,088 forks)batch import

help wantedtracking

Repository metrics

Stars: (47,527 stars)
PR merge metrics: (平均マージ 3d 8h) (30d で 573 merged PRs)

説明

Summary

A number of semantic or compile-time syntax errors have already been implemented in #11934, but there are still many remaining ones. The purpose of this issue is to track progress on this remainder. The initial source for most of these is this comment, which reports fuzzer results comparing Python's compiler and ruff, but others may be included over time.

The next section lists the syntax errors that still need to be implemented, and the sections after that include some more information on getting started with an implementation.

Syntax Errors

Errors mapping to current rules

These might be the easiest to start with because the existing rules have working implementations and corresponding tests. However, they could possibly require new context methods (see below), which could complicate things.

return with value in async generator (B901)
- https://github.com/astral-sh/ruff/pull/21200
- partial overlap with B901, which also applies to sync generators. This is only a syntax error in the async case
no binding for nonlocal (PLE0117)
- https://github.com/astral-sh/ruff/pull/21032
break outside loop (F701)
- https://github.com/astral-sh/ruff/pull/20556
- https://github.com/astral-sh/ruff/pull/20944
continue not properly in loop (F702)
- https://github.com/astral-sh/ruff/pull/20869
- https://github.com/astral-sh/ruff/pull/20944
yield from in async function (PLE1700)
- https://github.com/astral-sh/ruff/pull/20051
future feature is not defined (F407)
- https://github.com/astral-sh/ruff/pull/20554
import * only allowed at module level (F406)
- https://github.com/astral-sh/ruff/pull/20166
multiple starred expressions in assignment (F622)
- https://github.com/astral-sh/ruff/pull/20243

Errors mapping to current parse errors

These are also straightforward and will be similar to #17131.

duplicate keyword argument (https://github.com/astral-sh/ruff/pull/17804)

def foo(x): ...

foo(x=1, x=2)  # SyntaxError: keyword argument repeated: x

New errors

name is parameter and global (https://github.com/astral-sh/ruff/pull/20426)
```
def f(a): global a
```
name is parameter and nonlocal
```
def f(a): nonlocal a
```

annotated name can't be global (https://github.com/astral-sh/ruff/pull/20868)

x: int = 1

def f():
    global x
    x: str = "foo"  # SyntaxError: annotated name 'x' can't be global

alternative patterns bind different names (https://github.com/astral-sh/ruff/pull/20682)
```
match 42:
    case [x] | [y]: ...
```
This can probably be handled in the MatchPatternVisitor along with the other match-related errors.
can't use starred expression here (https://github.com/astral-sh/ruff/pull/17624)
```
x = *value
```
nonlocal declaration not allowed at module level (https://github.com/astral-sh/ruff/pull/17559)
```
nonlocal x
```

Extensions to existing syntax errors

Assigning to and deleting attributes named __debug__ is also a syntax error

__debug__ = False  # Currently caught by ruff
x.__debug__ = False  # Not currently caught, but also a syntax error

Encoding issues

These syntax errors are all related to #6791 and should only be implemented if we decide to respect these alternate encodings in general.

These can also be a bit tricky to put into a file, so most of these have shell code snippets for generating the input rather than Python code directly.

ascii codec can't decode byte ... in position ...: ordinal not in range(128)
```
printf '# -*- coding: ascii -*-\n\xc2' > example.py
```

charmap codec can't decode byte ...

printf '# -*- coding: cp1252 -*-\n\x8d' > example.py

utf7 codec can't decode byte 0xc3
```
printf '# -*- coding: utf-7 -*-\n\xc3' > example.py
```
This one reports an E902 error, but only if it's enabled. Otherwise we only print a warning about invalid UTF-8.
encoding problem: ... with BOM
```
printf '\ufeff# -*- coding: ascii -*-' > example.py
```
I think any encoding other than UTF-8 with a BOM (\uFEFF) will cause this issue.
invalid character
invalid non-printable character
- both this and invalid character mostly seem to happen with the iso-8859-1 encoding and a unicode character
unknown encoding
```
# coding: not-a-real-encoding
```
source code string cannot contain null bytes
```
printf '# \x00' > example.py
```
This mostly seems like an issue in comments. This also isn't exactly an encoding error, but it's more closely related to these. It's also raised by the CPython parser, not the compiler.

Implementation

Issue #11934 contains links to each PR implementing a new semantic syntax error. These can be used as examples of what a new implementation looks like, but this section has a few more general notes.

All of the semantic syntax errors currently live in semantic_errors.rs and correspond to a variant of the SemanticSyntaxErrorKind enum. The SemanticSyntaxChecker is responsible for emitting these errors, but unlike other AST visitors, it does not generally traverse the AST on its owns¹. Instead, it relies on a parent visitor to call its visit_stmt and visit_expr methods in the parent visitor's visit_stmt or visit_expr methods.

The SemanticSyntaxChecker also tracks very little of its own state. Instead, it again defers to the parent visitor via the SemanticSyntaxContext trait. Examples of methods on this trait include:

python_version -- returns the target Python version for linting
in_async_context -- returns whether or not async code like await or async for loops are allowed in the current scope

Typically the parent visitor, like the Checker from the ruff_linter crate, will implement SemanticSyntaxContext and pass itself to any SemanticSyntaxChecker method requiring a context.

Another important context method is report_semantic_error, which may require special handling of new SemanticSyntaxErrorKind variants. For example, some syntax errors emitted by CPython overlap with existing ruff rules (e.g. await-outside-async (PLE1142)). These need to be transformed into normal ruff Diagnostics instead of being reported directly as syntax errors. This happens within the Checker::report_semantic_error method.

Testing

There are two main ways of testing new semantic errors. The first, and easier, method is to write inline parser tests using test_ok and test_err comments, for example:

https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_python_parser/src/semantic_errors.rs#L80-L95

These are automatically extracted from the source code by the parser's generate_inline_tests function and then run like normal snapshot tests. These work very well for simple errors that don't require detailed information from the SemanticSyntaxContext.

For errors that do require such contextual information tracked by the parent visitor, a better alternative is to use integration tests, such as those found in linter.rs in the ruff_linter crate:

https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_linter/src/linter.rs#L1015-L1025

https://github.com/astral-sh/ruff/blob/014bb526f45afd654e5f42db92310e98b5068a54/crates/ruff_linter/src/linter.rs#L1064-L1066

The second of these can be reused for other semantic errors that correspond to ruff rules, while something like the first example can be used for new syntax-error-specific tests. In both cases, these integration tests will take advantage of the existing context tracking done by the real Checker instead of trying to duplicate that tracking in the parser's test SemanticSyntaxCheckerVisitor.

Comparing to the fuzzer results

Again, the initial errors in this issue were extracted from this comment in #7633 by following this procedure:

Download and unzip the fuzzer results from here
Run the Python script below to get a partially-deduplicated list of errors
Double check these errors against ruff and CPython's parser and compiler

import argparse
import compileall
import contextlib
import dataclasses
import io
import os
import re
import subprocess
from collections import Counter

FILE = re.compile(r'File "([^"]+)", line (\d+)')
MESSAGE = re.compile(r"SyntaxError: (.*)")


@dataclasses.dataclass
class Error:
    file: str
    line_number: int
    message: str


# Ruff rules already re-implemented as semantic syntax errors, not just rules
# that match CPython syntax errors.
RULES = [
    "F404",
    "F704",
    "F706",
    "PLE0118",
    "PLE1142",
]


def run_ruff(ruff_bin, path):
    return subprocess.run(
        [
            ruff_bin,
            "check",
            "--no-cache",
            "--silent",
            "--preview",
            "--config",
            f"lint.select={RULES}",
            path,
        ]
    ).returncode


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--ruff-bin", default="ruff")

    args = parser.parse_args()

    with (
        contextlib.redirect_stdout(io.StringIO()) as out,
        contextlib.redirect_stderr(os.devnull),
    ):
        compileall.compile_dir(".", force=True, quiet=1)

    seen_messages = Counter()
    errors = list()

    file, line_number = None, None
    for line in out.getvalue().splitlines():
        if m := FILE.search(line):
            file, line_number = m.groups()
        if m := MESSAGE.search(line):
            message = m[1]
            # combine repetitive messages into single entries
            for prefix in [
                "unknown encoding",
                "invalid escape sequence",
                "'ascii' codec can't decode",
                "future feature",
                "invalid character",
                "invalid non-printable character",
                "no binding for nonlocal",
            ]:
                if message.startswith(prefix):
                    message = prefix
                    break
            if (
                message not in seen_messages
                and run_ruff(args.ruff_bin, file) == 0
            ):
                errors.append(Error(file, line_number, message))

            seen_messages[message] += 1

    for error in sorted(errors, key=lambda e: e.message):
        count = seen_messages[error.message]
        print(f"{error.file}:{error.line_number}: {error.message} ({count})")


if __name__ == "__main__":
    main()

There are some local violations of this principle. For example, the ReboundComprehensionVisitor recursively visits the expressions in a comprehension looking for any expressions rebinding one of the comprehension's iteration variables. However, this is a highly-localized AST traversal and won't have the same kind of performance implications as an additional full AST traversal. ↩

コントリビューターガイド

調査方針: まず、`crates/ruff python parser/src/semantic errors.rs` にある既存のセマンティック構文エラーの実装を調べてください。この issue でリンクされている PR の例（例: #20426, #20868）を参照してください。'name is parameter and nonlocal' のような未完了のエラーを選択し、新しい `SemanticSyntaxErrorKind` バリアントを実装します。`SemanticSyntaxChecker` にチェックを追加するパターンに従い、インラインの `test err` コメントまたは `crates/ruff linter/src/linter.rs` での統合テストを使用してテストを追加します。既存のルールと重複するエラーの適切な診断処理を確保するために、`Checker::report semantic error` メソッドを参照してください。
技術スタック: pythonrust
領域: developer experiencetooling
Issue 種別: 機能
難度: 4
推定時間: 1-2日
活動状況: アクティブ
明確さ: 明確
前提条件: RustPythonGit
初心者向け度: 30