keycloak/keycloak

Refresh token reuse counter not reverted on transaction rollback, causing permanent session revocation on transient DB failures

Open

#49,213 建立於 2026年5月21日

在 GitHub 查看
 (1 留言) (2 反應) (0 負責人)Java (34,398 star) (8,346 fork)batch import
area/corehelp wantedkind/bugpriority/normalstatus/auto-bumpstatus/auto-expireteam/core-protocolsteam/core-shared

描述

Before reporting an issue

  • I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.

Area

core

Describe the bug

When revokeRefreshToken=true is enabled on a realm, the per-session refresh-token reuse counter is incremented in-memory before the encompassing transaction commits. If the transaction subsequently rolls back for any reason, including transient database write failures, the in-memory counter mutation is not reverted. Subsequent refresh attempts re-read the advanced counter, exceed refreshTokenMaxReuse, and permanently revoke the session.

This behavior is acknowledged in the source itself with a TODO comment that has been present since 2017. It was tolerable when sessions were Infinispan-only and rarely faced partial-failure scenarios. With persistent-user-sessions (default since 26.0) the failure surface has grown to include every transient DB outage. The TODO has not been resolved.

Version

26.6.2

Regression

  • The issue is a regression

Expected behavior

If the encompassing transaction rolls back, the refresh-token reuse counter should be restored to its pre-request value. A failed refresh attempt must not consume a reuse credit.

Actual behavior

The reuse counter advances regardless of whether the transaction commits or rolls back. Multiple failed refresh attempts in succession (e.g., during a 1–2 minute Postgres failover) consume the full reuse budget and trigger session revocation. Sessions revoked this way cannot be recovered programmatically, the client receives invalid_grant: Maximum allowed refresh token reuse exceeded indefinitely, requiring out-of-band reissuance.

How to Reproduce?

Conceptual reproduction:

  1. Configure a realm with:
    • revokeRefreshToken = true
    • refreshTokenMaxReuse = 0 (default)
  2. Enable persistent-user-sessions (default since 26.0).
  3. Obtain an offline refresh token RT0 for some client/user.
  4. Submit a refresh request using RT0. Inject a transient failure that forces the request transaction to roll back after validateTokenReuseForRefresh runs but before commitImpl/asyncCommit completes. Practical injection points:
    • Set Postgres default_transaction_read_only = on on the primary so the next Hibernate flush in the request transaction throws.
    • Register a custom EventListenerProviderFactory whose onEvent(REFRESH_TOKEN, ...) throws after the reuse counter has been incremented.
  5. Observe that the client receives an error (HTTP 500 in 26.4+ with batching disabled.
  6. Submit a second refresh request with RT0 (the original token, the client has no other token because step 5 did not return a usable rotation).
  7. Observe invalid_grant: Maximum allowed refresh token reuse exceeded. The session is now permanently revoked.

Anything else?

Real-world impact

On a production OpenShift cluster, a routine kube-apiserver blip during a cluster upgrade triggered a brief Patroni failover. The old Postgres primary stayed alive but set default_transaction_read_only = on. Linux conntrack pinned existing JDBC connections to the now-read-only pod, so Keycloak writes failed for ~4 hours until the connection pool was forcibly cycled.

During that window:

  • 20,077 internal client_credentials logins succeeded (no DB writes on this path).
  • 684 external-partner offline-refresh-token sessions (738 distinct client IDs) crossed refreshTokenMaxReuse=1 within 1–2 retries and were permanently revoked. 678 of 684 (99.1%) were revoked through this counter-advancement path.
  • Recovery required manual token reissuance via an out-of-band partner-management workflow over several days.

Patroni failovers triggered by kube-apiserver blips during control-plane upgrades are an expected event class on OpenShift; this failure mode is therefore expected to recur on every cluster upgrade until fixed.

貢獻者指南