Refresh token reuse counter not reverted on transaction rollback, causing permanent session revocation on transient DB failures · keycloak/keycloak#49213

(1 留言) (2 反應) (0 負責人)Java (34,398 star) (8,346 fork)batch import

area/corehelp wantedkind/bugpriority/normalstatus/auto-bumpstatus/auto-expireteam/core-protocolsteam/core-shared

描述

Before reporting an issue

I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.

Area

core

Describe the bug

When revokeRefreshToken=true is enabled on a realm, the per-session refresh-token reuse counter is incremented in-memory before the encompassing transaction commits. If the transaction subsequently rolls back for any reason, including transient database write failures, the in-memory counter mutation is not reverted. Subsequent refresh attempts re-read the advanced counter, exceed refreshTokenMaxReuse, and permanently revoke the session.

This behavior is acknowledged in the source itself with a TODO comment that has been present since 2017. It was tolerable when sessions were Infinispan-only and rarely faced partial-failure scenarios. With persistent-user-sessions (default since 26.0) the failure surface has grown to include every transient DB outage. The TODO has not been resolved.

Version

26.6.2

Regression

The issue is a regression

Expected behavior

If the encompassing transaction rolls back, the refresh-token reuse counter should be restored to its pre-request value. A failed refresh attempt must not consume a reuse credit.

Actual behavior

The reuse counter advances regardless of whether the transaction commits or rolls back. Multiple failed refresh attempts in succession (e.g., during a 1–2 minute Postgres failover) consume the full reuse budget and trigger session revocation. Sessions revoked this way cannot be recovered programmatically, the client receives invalid_grant: Maximum allowed refresh token reuse exceeded indefinitely, requiring out-of-band reissuance.

How to Reproduce?

Conceptual reproduction:

Configure a realm with:
- revokeRefreshToken = true
- refreshTokenMaxReuse = 0 (default)
Enable persistent-user-sessions (default since 26.0).
Obtain an offline refresh token RT0 for some client/user.
Submit a refresh request using RT0. Inject a transient failure that forces the request transaction to roll back after validateTokenReuseForRefresh runs but before commitImpl/asyncCommit completes. Practical injection points:
- Set Postgres default_transaction_read_only = on on the primary so the next Hibernate flush in the request transaction throws.
- Register a custom EventListenerProviderFactory whose onEvent(REFRESH_TOKEN, ...) throws after the reuse counter has been incremented.
Observe that the client receives an error (HTTP 500 in 26.4+ with batching disabled.
Submit a second refresh request with RT0 (the original token, the client has no other token because step 5 did not return a usable rotation).
Observe invalid_grant: Maximum allowed refresh token reuse exceeded. The session is now permanently revoked.

Anything else?

Real-world impact

On a production OpenShift cluster, a routine kube-apiserver blip during a cluster upgrade triggered a brief Patroni failover. The old Postgres primary stayed alive but set default_transaction_read_only = on. Linux conntrack pinned existing JDBC connections to the now-read-only pod, so Keycloak writes failed for ~4 hours until the connection pool was forcibly cycled.

During that window:

20,077 internal client_credentials logins succeeded (no DB writes on this path).
684 external-partner offline-refresh-token sessions (738 distinct client IDs) crossed refreshTokenMaxReuse=1 within 1–2 retries and were permanently revoked. 678 of 684 (99.1%) were revoked through this counter-advancement path.
Recovery required manual token reissuance via an out-of-band partner-management workflow over several days.

Patroni failovers triggered by kube-apiserver blips during control-plane upgrades are an expected event class on OpenShift; this failure mode is therefore expected to recur on every cluster upgrade until fixed.

貢獻者指南

技術棧: javasqlpostgresql
領域: backendsecurityauthenticationdatabase
議題類型: bug
難度: 4
預計時間: over 1 week
活動狀態: fresh
清晰度: mostly clear
前置要求: JavaKeycloak internalstransaction management
新手友善度: 15
研究方向: Investigate the issue referenced at `PersistentSessionsChangelogBasedTransaction.java` line 206 where a TODO exists. The bug is that the refresh token reuse counter is incremented in memory before transaction commit and not reverted on rollback. The fix likely involves either deferring the counter increment until after commit or adding a rollback hook to revert the counter. Evaluate the `validateTokenReuseForRefresh` method and the transaction commit/rollback flow in the Infinispan session model. Also consider the `persistent user sessions` feature which increases failure surface. Existing comments and the reproduction steps provide guidance; no linked PR or assignee found.