1.9.0 added server.timeout = 300s to reap dead mobile connections (B3). But
Node's socket timeout fires on INACTIVITY, and a paused audio stream is
inactive (no bytes flow while backpressured) -- so a pause longer than the
timeout had the server destroy the stream's connection, forcing a reconnect
on resume. On both web and iOS that surfaced as 'I pause, then have to focus
the app for it to play again' after a multi-minute pause; pre-1.9.0 had no
such timeout, so paused streams survived (the exact D1 risk the spec flagged).
Reap genuinely dead/half-open peers (mobile network gone without FIN/RST) via
TCP keepalive instead: server.timeout = 0, and each connection gets
setKeepAlive(true, 30s) so the OS drops a socket once probes fail while a
paused-but-alive stream keeps answering and stays connected.
Production showed 24 unique-constraint violations on
DiscoveryAlbum(userId, weekStartDate, rgMbid) in 18h: the scan-completion and
reconciliation paths can both create Discovery records for the same album in
the same week, so the second create threw, rolled back the transaction, and
dropped that album's DiscoveryTrack records. Upsert makes it idempotent --
an existing record is left untouched and the track loop fills any gaps.
Audio engine rewrite, audiobook session model, podcast auto-refresh recovery,
functional settings, and the stream/QoL hardening from this cycle. Full notes
in CHANGELOG.md.
- Library auto-sync cron skips enqueuing when a scan is already active/waiting,
so it can't stack a redundant full rescan behind a manual or webhook scan.
- Subsonic star.view is now best-effort: it attempts every id, skips missing
tracks (P2003), logs genuine failures, and never early-returns mid-loop
(which left some tracks starred while reporting failure). It reports an error
only when a real failure occurred and nothing got starred.
- refreshPodcastFeed upserts episodes on (podcastId, guid) instead of
find-then-create, closing a TOCTOU race between the manual refresh route and
the auto-refresh job that could throw on the unique constraint.
- Onboarding: rename the shadowing 'user' var in the recovery path for clarity.
Review found the allowlist (9 prefixes) missed ~15 real cache namespaces
(homepage:, mixes:, search:, discovery:, colors:, preview:, fanart:, songlink:,
genres:, radio:, album:, artist:, playlists:, ...), so 'Clear Caches' was a
partial clear that would leave most caches stale. Inverted to a denylist that
spares only the operational namespaces (bull:, sess:, audio:, clap:,
enrichment, lock:, sse:) and clears everything else -- complete and drift-proof
as new caches are added. Verified read-only against production: clears 5210
cache keys, protects all 190 operational keys (queues, control plane).
Review found the 15-min grace was effectively inert: the phase parked an
entity as 'enriching'/'_queued' even when the add() no-op'd against a failed
jobId still held within the grace window -- removing it from selection until a
process restart, so the advertised auto-retry never happened. Each phase now
checks queue.getJob(jobId) and only enqueues + parks when the slot is actually
free; a held slot is skipped, leaving the entity selectable so it backs off
and genuinely retries once the grace clean frees the slot. Adds a test
asserting a held slot is skipped (no re-add, no park).
- Soulseek search relied solely on an SSE 'complete' event to clear its
spinner; if that event was dropped (connection blip, backend never emits it)
the search UI spun forever. Add a 45s fallback that force-completes the
search so the user sees whatever results arrived; late results still stream
in via the store subscription.
- Onboarding's 'username already taken' path told the user to refresh, which
can't recover the half-created account (the token never persisted). Instead
attempt a login with the same credentials and continue: resume at step 2 if
onboarding is unfinished, route home if already complete, or send to the
normal sign-in for a 2FA account. A genuine password mismatch now gets a
clear 'sign in instead' message rather than a dead end.
- Subsonic star.view swallowed every error and returned success, so a
third-party app could star a track that never saved. Now only a P2003 FK
violation (track legitimately missing) is absorbed; any other error is
logged and returns a Subsonic error. Scrobble play-log failures are logged
instead of silently discarded.
- The podcasts page sorted by author/title with a raw localeCompare on an
optional field, so one feed with no author crashed the whole page via the
error boundary. Comparators are now null-guarded.
- The audio analyzer re-logged the same 'N tracks permanently failed' warning
every idle cycle (~50s) forever; it now logs only when the count changes.
Two settings the UI presented as working did nothing. The transcode cache
size slider was saved to the DB but only ever read from the TRANSCODE_CACHE_MAX_GB
env var, which the save path never wrote -- so the slider was inert even
across the restart its own hint told the user to perform. It's now written to
.env on save, matching the restart-required contract.
The 'Auto sync library' toggle had zero readers because no periodic library
scan existed at all (scans were webhook/manual only). Adds a library-sync cron
(every 6h, gated on the autoSync setting) that enqueues a full scan so music
added outside the download pipeline is picked up automatically.
The podcast dedup-on-failure trap was live on three more queues. The artist
and mood-tags phases never cleaned their queues at all, so a failed job's
jobId marker blocked re-queue until BullMQ's 24h removeOnFail age expired --
far slower than the worker's documented intent to re-pick-up a failed track.
The admin vibe start/retry routes cleaned only completed jobs, so 'Retry
failed embeddings' silently dropped tracks with a lingering failed job.
Automatic phases now clean completed (grace 0, immediately reusable on
success) and failed (15-min grace, so a permanently-failing entity retries
on a backoff instead of every 5s cycle). The manual admin retry routes clean
failed immediately -- the user asked to retry now. Adds a 3-test regression
suite asserting the grace-0-completed / grace-positive-failed split.
The Clear Caches button never did anything: the handler used the node-redis
v4 scan signature (options object + { cursor, keys } result) against our
ioredis client, whose scan takes positional args and returns [cursor, keys].
Every call threw and cleared nothing -- which is why clearing the cache did
not dislodge the wedged podcast jobs.
Even had it run, "delete every key except sess:" would have wiped live
BullMQ queue state (bull:*, 200+ keys) and the enrichment/audio/clap control
plane. Replace that with an allowlist of genuine rebuildable caches
(MusicBrainz, cover art, Last.fm, Wikidata, Deezer, iTunes, hero images) and
delete in chunks. Verified read-only against production: clears ~5130 cache
keys, preserves all bull:/audio:/enrichment:/clap:/sess: keys.
BullMQ keeps the jobId dedup marker for failed jobs, not just completed
ones. The podcast and vibe refresh phases cleaned only "completed", so a
single failed (or Redis-corrupted, data-less) job kept its jobId marker
forever -- every later add() with that jobId silently no-op'd and the entity
never refreshed again. In production all 4 podcasts were frozen since a job
corruption event; the worker was throwing findUnique({ id: undefined }) on
data-less jobs.
Fix:
- podcast + vibe phases clean BOTH "completed" and "failed" so a failed
job's jobId is reusable.
- podcast phase optimistically advances lastRefreshed for the selected feeds
before queuing -- refreshPodcastFeed only advances it on success/304, so
this gives a failing feed a real backoff window instead of being re-queued
every cycle.
- podcast worker guards against corrupt/data-less jobs (clear error instead
of a confusing Prisma undefined-id throw).
Adds a 5-test regression suite asserting the failed-set clean and the
claim-before-queue ordering. Production Redis cleared of the poisoned jobs.
Phase B gave AudioControlsProvider a useToast() dependency, but ToastProvider
was nested inside ConditionalAudioProvider -- the hook threw on the first
authenticated render and AudioErrorBoundary blanked every page's content.
Caught by the nightly E2E suite (19 failures), invisible to tsc/lint/build/
prerender because none execute the authenticated client tree.
A 24h-old session no longer kills playback: stream URL builders refresh the
token proactively when it expires within the hour, and a terminal code-4
media error is classified via a bounded /api/auth/me probe -- a stale token
refreshes and the stream reloads at position with no teardown, no skip, and
no logout. Token refresh now distinguishes a transient network failure
(tokens kept) from a server-rejected refresh (session cleared), so an
offline moment can never force a re-login (FE11/B9). The iOS trace logger
becomes opt-in (?ios_debug=1 flag only) with batched persistence instead of
a synchronous storage write per event (FE17).
The AudioController is rebuilt as a thin DOM shell around a pure policy
module (transition(snapshot, event) -> { snapshot, effects }, 199 unit
tests). One status drives all UI; the four overlapping recovery mechanisms
(3s watchdog, 10s stalled-grace, code-2 retry loop, AbortError reload) are
replaced by a single deadline-bounded ladder that distinguishes buffering
from stalling, parks instead of auto-playing while backgrounded, and resets
its attempt budget only after sustained progress. The transport is never
disabled: players accept taps in every state, and a wedged spinner is
structurally impossible (FE2). Native stalled events are ignored entirely
(1094/1094 were noise in the production trace). Truncated deliveries are
recovered at-position instead of advancing the queue (FE10). Lock-screen
pause now routes as a user pause (FE5); terminal network errors surface
uniformly with a working retry (FE7); ended->next keeps the synchronous
event-tail play for the iOS autoplay grant (FE13); mute uses audio.muted
(FE14); the dead prefetch hint and needs-resume plumbing are gone
(FE15/FE18). AudioContext bridge preserved verbatim.
Audiobooks now play through a BookSession with a required, verified track
map: every surface (play, chapter tap, seek, ended-advance, restore) does
its book-time math through one tested translator, and a book can never be
marked finished without an affirmed last file -- killing the multi-file
progress-wipe (FE1). Chapter taps start the book at the chapter via
playAudiobookAt riding load(seekTo) instead of racing React state (FE6).
The controller gains a generation-checked load(seekTo) + isTransitioning
shim (D5): start offsets ride their own load and can never leak across
media switches (FE8), and progress saves are suppressed during transitions
(FE9). Errors on audiobooks/podcasts save progress before teardown (FE12).
Same-src loads in flight are no longer restarted (FE16). The unsafe
kima_was_playing foreground auto-resume is removed (FE5 partial; intent
routing completes in Phase D). Adds vitest with a 33-test BookSession
suite (test:unit).
Delete the per-user stream eviction that truncated actively-playing streams
(B1/B10); add server socket timeouts so dead peers cannot accumulate (B3);
run transcodes through the real queue with a 120s watchdog kill (B6); bound
the ABS proxy at 15s and cache track resolution for seeks (B2); replace the
1-year cache header with private/1h/must-revalidate plus conditional 304s
(B4); key the transcode cache on mtime equality + source size (B5); align
all range-serving surfaces on 416-or-ignore semantics per RFC 9110 (B8/B11);
fix the podcast stream rate-limit exemption (B7); release the play-log claim
on failed inserts (B12); cache audiobook track maps at sync time and expose
tracks/trackCount on list+series endpoints with an explicit tracksUnavailable
signal (FE1 backend half); fix the play-adjacent writer that left numTracks
NULL. Drop the never-read musicPath from AudioStreamingService.
The backend preview can take ~18s worst case on the Deezer path (Deezer fetch +
iTunes resolve + the 8s RSS bound). A 20s client abort left only a 2s margin, so
a slow but answerable feed could be aborted and shown as a false timeout. 25s
keeps a clear gap over the server while still bounding the #168 hang.
- A seek past a file whose stored size is wrong made Audiobookshelf return 416,
which axios surfaced as a 500. The service now lets 416 through and the route
sends a clean 416 (Content-Range forwarded, upstream stream destroyed) instead
of piping the upstream error body into the audio element.
- Sync now skips items with more than 1000 audio files: those are mis-cataloged
libraries imported as one book (the source of multi-thousand-hour, tens-of-GB
records that broke seeking). Track count, not duration -- legitimate omnibus
editions legitimately run 50-65h.
Two follow-ups from review of the critical-path trim:
- A synchronous in-process claim gates the now-background play-logging so two
concurrent stream requests for the same track can't both insert a Play row
inside the 30s window (the fire-and-forget change had widened that race).
- The no-settings-row quality fallback is now "original", matching the schema
default, instead of "medium" -- a user without a settings row no longer gets a
pointless first-play transcode.
Measured from real device traces: fresh track start was ~2.3s vs ~25ms to
resume an already-loaded track. Part of that was the stream route doing
sequential DB work before the first byte -- a recent-play lookup, a play insert,
and a settings read, all awaited up front.
Fetch the track row and the quality setting in parallel (one round-trip, not
two), and fire the play-history logging in the background instead of awaiting it.
Neither needs to gate playback. The bulk of the remaining latency is client-side
buffering of multi-hour audiobook files seeking to a saved offset, tracked
separately.
The device trace proved it unsafe: navigator.mediaDevices "devicechange" never
fires on the user's iPhone (0 events across the whole capture), so the
route-change guard that was supposed to stop an earbud unplug from auto-resuming
to the speaker is permanently inert -- sinceRouteMs is always ~Date.now(). That
is the v1.7.12 regression with no working brake, and the user reproduced audio
restarting after pulling earbuds.
The AudioContext statechange handler goes back to re-claiming the playback
session category only (safe), never calling audio.play(). Removed the now-dead
intendsToPlay flag, route-change tracking, and devicechange listener. The trace
auto-upload auth fix and the audiobook position-save fix are unaffected and stay.
Progress was saved every 30s off the "timeupdate" event, but iOS throttles and
suspends that event when the PWA is backgrounded (screen off) -- the normal way
people listen to audiobooks. So a long screen-off session was never
checkpointed, and an app update (or crash) reverted to the moment the screen was
locked. The saved data was never lost; it just stopped advancing in the
background.
Replace the timeupdate-driven save with a 15s wall-clock setInterval that runs
while playing (started on "play", stopped on "pause"/"ended"), independent of
the media event iOS throttles. saveAudiobookProgress already de-dupes an
unchanged position and the tick is gated on isPlaying(), so paused/stalled ticks
are no-ops. Applies to podcasts too.
Playback that an iOS interruption (call/notification) pauses now resumes when
the interruption ends, the behaviour other apps have.
- Track play intent separately from audio.paused: set on play/tryResume/
swapAndPlay, cleared only by explicit pause/stop/cleanup. The native "pause"
event an interruption fires does NOT clear it.
- The AudioContext statechange listener resumes on an interrupted -> running
transition when intent is set and the element is paused. Gated hard: only that
transition (not the initial bridge resume or a background suspend), never
within 1.5s of an audio-route change (the v1.7.12 unplug-to-speaker
regression), and never while a stall reload owns the resume.
- Repair the trace auto-upload: it POSTed to a requireAuth route without the
Bearer token and swallowed the 401, so no iOS trace was ever captured. It now
sends the token, so device testing finally yields real event data.
Reviewed by Opus/Sonnet passes. Known limits to confirm on-device: only fires if
WebKit returns the context to "running" (a context stuck "interrupted" -- the
force-quit symptom -- is not addressed here).
The preview hook only stops spinning when the request resolves or rejects. The
RSS parse had a 30s timeout and the client had none, so a slow/dead feed left
the spinner up 30s+ with no error -- the "infinite loading" in #168 (the v1.7.13
fix only handled the error path, not the hang).
- Frontend: previewPodcast aborts after 20s, surfacing the existing error UI.
- Backend: the two preview RSS parses are bounded to 8s (non-critical, already
falls through to partial data), so a slow feed returns the podcast quickly.
Adds a soulseekMode (p2p|slskd) setting to route Soulseek through an external slskd REST instance, so slskd mode needs no Kima-side Soulseek credentials. Includes the review fixes: https transport, reconnect on backend change, slskdUrl validation, mode-aware connection test, queue position, bounded size cache. Closes#164. By gossip31.
Complements #204 (gossip31's pre-decode ffmpeg gate). The pre-decode gate
catches corrupt files that SIGSEGV the decoder, but a worker that dies on any
other native fault (e.g. an Essentia analysis crash after a clean decode) still
left the track in 'processing' and got re-queued by the stale-cleanup sweep
WITHOUT incrementing analysisRetryCount -- so it could loop forever and never
reach the mark-failed/quarantine path.
_cleanup_stale_processing now increments analysisRetryCount when it resets a
crashed track, and marks tracks that have passed MAX_RETRIES as 'failed' (with a
reason) so they quarantine and surface in the permanently-failed accounting
instead of sitting in 'processing' limbo. Defense in depth behind the gate.
Adds an ffmpeg integrity probe before MonoLoader so corrupt files that SIGSEGV Essentia become a normal load failure (and flow into the existing retry/quarantine) instead of crash-looping the worker. By gossip31.
The model-download layer failed three recent builds (06-04 x2, 06-05) with curl
exit 28: --max-time caps the whole operation including retries, so a slow
GitHub-runner transfer trips it, and --retry does not retry a timeout. Switched
all 12 downloads to --retry-all-errors (retries timeouts/transient HTTP),
stall-based abort (--speed-limit 1024 --speed-time 60) instead of a hard total
cap, 5 retries, and -f so a bad HTTP response fails fast instead of saving a
corrupt model. The transformers==5.8.1 pin is unaffected and confirmed building.
The reporter's redis INFO shows a healthy instance (33MB used, no maxmemory
limit, noeviction, zero evictions/rejected connections), ruling out the
memory-pressure hypothesis. The connection-readiness race the fix addresses is
the actual cause, so the hedge is removed.
The first #197 fix only hardened the pub/sub subscriber; a 3-model review panel
found it incomplete. This closes the rest:
- publish() now runs on a dedicated soft-options connection (enableOfflineQueue,
infinite retries) instead of the strict shared client -- that strict publish
was still throwing the same "Stream isn't writeable" error under load.
- subscriber lifecycle: terminal "end" drops the cache, a failed psubscribe
disconnects the half-open socket instead of leaking it; transient drops
self-heal via auto-reconnect.
- both subscribe and publish are time-bounded so an unreachable Redis fails the
request instead of hanging indefinitely.
- analyzer failures ({success:false, embedding:null}, no error field) are now
rejected cleanly instead of passing null into the pgvector cast (500).
- the analyzer publishes a failure response on internal exceptions so the caller
fails fast instead of waiting out the full 15s timeout.
Reviewed by Opus/Sonnet/Haiku panels twice (original confirmed INCOMPLETE,
rewrite SHIP-WITH-CHANGES); surviving findings applied, two rejected with reason
(no publisher churn on transient error; keep setMaxListeners(0) to not re-trigger
the warning flood).
The reporter's 200k-track failure may also involve Redis memory pressure or
Python-analyzer saturation, which this makes tolerable but does not itself
resolve -- pending their redis INFO.
The scheduled nightly off main failed (2026-06-04, and 2026-06-02 the same
way): a transformers release newer than 5.8.1 references torch.float8_e8m0fnu
(a dtype added in torch 2.7) at import time, so `from transformers import
BertModel` crashes against the pinned torch==2.5.1 and the Dockerfile
fail-fast check exits 1. The unpinned `transformers>=4.30.0` let pip resolve
to that bad release. Recent branch builds only passed because BuildKit reused
a cached pip layer from before it published.
Pinned to 5.8.1 -- the exact version running in prod against torch 2.5.1+cpu.
Bump only alongside a torch bump.
On the clean 439fa68 bridge baseline (band-aids reverted), add the two
high-confidence stability fixes the resume bug actually needs:
- setAudioSessionPlayback gains a `force` arg; play() now re-claims the
iOS "playback" session category on every explicit resume, not just the
first. The one-time latch was why iOS, after an earbud/Control-Center
interruption, left the session with whatever app grabbed it (a
sleep-sounds app started playing through it).
- A statechange listener on the bridge AudioContext re-claims the session
when the OS ends an interruption and the context returns to running. It
never calls play() -- auto-resume on a route change is the v1.7.12
earbud-unplug-to-speaker regression.
Reviewed by two independent passes; their findings fixed here: play() now
actually passes force=true (the reclaim was a no-op without it); the
statechange listener + AudioContext are torn down in destroy() (no leak);
em-dash normalized.
Deliberately NOT re-adding the silent-playback watchdog (part of the
reverted band-aid stack) -- the debug instrumentation will show whether an
interrupted-context resume is still silent, and any further recovery will
be a minimal targeted fix on evidence, not another speculative layer.
Reverts the daf6210 -> 7be3322 -> 1a9f6f4 cascade that piled onto the
bridge. Root regression was daf6210: it awaited setupAudioContextBridge
and bailed play()/tryResume with needs-resume whenever the context was
not "running" -- which forfeited the iOS user-gesture token AND returned
before audio.play() ever ran. So earbud/lock-screen resume went silent
or dead-ended on a Tap-to-resume prompt the lock screen cannot show, and
iOS eventually handed the audio session to another app. 7be3322 and
1a9f6f4 were band-aids on that regression.
Keeps 439fa68 (the bridge) so backgrounded/screen-off playback still
survives, and keeps the debug ring-buffer instrumentation. play() and
tryResume return to the baseline: fire the context resume in parallel,
always attempt audio.play(), preserve the gesture.
Temporary diagnostic for the earbud-resume bug: the installed iOS PWA has no URL
bar to set ?ios_debug=1 or reach /debug/ios-log, so capture is enabled
unconditionally on iOS standalone and the buffer auto-POSTs (debounced 3s) to
/api/debug/ios-log after each event burst. Revert once the resume bug is fixed.
ensureSubscriber duplicated the parent Redis client, inheriting
enableOfflineQueue:false + maxRetriesPerRequest:0, so psubscribe threw 'Stream
isn't writeable' when the subscriber socket wasn't connected yet -- and the
rejected promise was cached, breaking vibe text search permanently until restart
(worsens with library size). The subscriber now gets its own offline queue +
retries, resets the cached promise on rejection, and drops it on 'end' so the
next request reconnects.
The smoke spec asserted the play/pause button state immediately after Play all,
racing the first audio load on a cold container (player stayed 'Not Playing').
Poll audio currentTime > 0 first. Surfaced while running the suite pre-v1.7.16.
The MediaSession 'play' action called controller.play(), which awaits the
AudioContext bridge BEFORE audio.play(). That await forfeits the iOS
user-activation token from the earbud click, so an interrupted/suspended
AudioContext never resumes -- and play() then returns (not throws) on
ctx-not-running, so the handler's reloadAndPlay() fallback never fired. Result:
earbud resume produced no audio, no native 'playing' event, no playbackState
update, and after repeated no-audio play actions iOS reassigned the audio
session to the next app.
Adds resumeFromGesture(): fires the context resume without awaiting it, calls
audio.play() synchronously in the gesture tail (mirrors swapAndPlay), and on any
rejection reloads the source to re-grab the hardware session instead of a silent
needs-resume. Wired only into the explicit MediaSession 'play' action, so it
cannot auto-resume on an ambiguous pause/route-change (the v1.7.12 earbud-unplug
-> speaker regression stays fixed). play()/tryResume()/pause/silent-watchdog
untouched. Diagnosed via 4-lens + adversary review (SHIP-AS-IS).
Requires on-device confirmation (?ios_debug=1); cannot be unit-verified.
The one-time swipe hint crowded the bottom edge above the mini-player and read
as clutter; the swipe behavior is intentional and discoverable enough without it.
Removes the hint state, markup, the markHintSeen calls (swipe behavior unchanged),
the now-unused useCallback import, and the hint-in keyframe.
The desktop sidebar was rebuilt as UnifiedPanel but the externally-registered
settings content (discover settings gear, lyrics) was never ported -- clicking
the discover gear opened the panel to the activity feed instead of the settings,
because UnifiedPanel never read settingsContent or handled set-activity-panel-tab.
It now listens for that event, renders the registered settingsContent (which
carries its own header + back button), and resets to the feed on collapse. Fixes
discover settings and lyrics on desktop. Pre-existing since the sidebar rewrite.
The scrollMargin useLayoutEffect only ran on [rows.length], so the offset went
stale when layout above the list reflowed (e.g. the responsive hero crossing the
md breakpoint, ~52px). Masked today by the 12-row overscan, but wrong and fragile
if overscan is tuned. Added a ResizeObserver re-measure. (ultrareview bug_003)