Refresh token reuse counter not reverted on transaction rollback, causing permanent session revocation on transient DB failures
#49,213 建立於 2026年5月21日
描述
Before reporting an issue
- I have read and understood the above terms for submitting issues, and I understand that my issue may be closed without action if I do not follow them.
Area
core
Describe the bug
When revokeRefreshToken=true is enabled on a realm, the per-session refresh-token reuse counter is incremented in-memory before the encompassing transaction commits. If the transaction subsequently rolls back for any reason, including transient database write failures, the in-memory counter mutation is not reverted. Subsequent refresh attempts re-read the advanced counter, exceed refreshTokenMaxReuse, and permanently revoke the session.
This behavior is acknowledged in the source itself with a TODO comment that has been present since 2017. It was tolerable when sessions were Infinispan-only and rarely faced partial-failure scenarios. With persistent-user-sessions (default since 26.0) the failure surface has grown to include every transient DB outage. The TODO has not been resolved.
Version
26.6.2
Regression
- The issue is a regression
Expected behavior
If the encompassing transaction rolls back, the refresh-token reuse counter should be restored to its pre-request value. A failed refresh attempt must not consume a reuse credit.
Actual behavior
The reuse counter advances regardless of whether the transaction commits or rolls back. Multiple failed refresh attempts in succession (e.g., during a 1–2 minute Postgres failover) consume the full reuse budget and trigger session revocation. Sessions revoked this way cannot be recovered programmatically, the client receives invalid_grant: Maximum allowed refresh token reuse exceeded indefinitely, requiring out-of-band reissuance.
How to Reproduce?
Conceptual reproduction:
- Configure a realm with:
revokeRefreshToken = truerefreshTokenMaxReuse = 0(default)
- Enable
persistent-user-sessions(default since 26.0). - Obtain an offline refresh token
RT0for some client/user. - Submit a refresh request using
RT0. Inject a transient failure that forces the request transaction to roll back aftervalidateTokenReuseForRefreshruns but beforecommitImpl/asyncCommitcompletes. Practical injection points:- Set Postgres
default_transaction_read_only = onon the primary so the next Hibernate flush in the request transaction throws. - Register a custom
EventListenerProviderFactorywhoseonEvent(REFRESH_TOKEN, ...)throws after the reuse counter has been incremented.
- Set Postgres
- Observe that the client receives an error (HTTP 500 in 26.4+ with batching disabled.
- Submit a second refresh request with
RT0(the original token, the client has no other token because step 5 did not return a usable rotation). - Observe
invalid_grant: Maximum allowed refresh token reuse exceeded. The session is now permanently revoked.
Anything else?
Real-world impact
On a production OpenShift cluster, a routine kube-apiserver blip during a cluster upgrade triggered a brief Patroni failover. The old Postgres primary stayed alive but set default_transaction_read_only = on. Linux conntrack pinned existing JDBC connections to the now-read-only pod, so Keycloak writes failed for ~4 hours until the connection pool was forcibly cycled.
During that window:
- 20,077 internal
client_credentialslogins succeeded (no DB writes on this path). - 684 external-partner offline-refresh-token sessions (738 distinct client IDs) crossed
refreshTokenMaxReuse=1within 1–2 retries and were permanently revoked. 678 of 684 (99.1%) were revoked through this counter-advancement path. - Recovery required manual token reissuance via an out-of-band partner-management workflow over several days.
Patroni failovers triggered by kube-apiserver blips during control-plane upgrades are an expected event class on OpenShift; this failure mode is therefore expected to recur on every cluster upgrade until fixed.