Skip to content

CoreAIPipelinedEngine producer remains active after EOS, causing the next response to crash in drain() #41

@timokoethe

Description

@timokoethe

Model name

gemma3

Command run

// as simple as possible:
let model = try await CoreAILanguageModel(resourcesAt: modelUrl!)
let session = LanguageModelSession(model: model)
        
let response = try await session.respond(to: "What is the capital city of America?")
print(response)

// The next one fails:
let response2 = try await session.respond(to: "What is the capital city of Canada?")
print(response2)

macOS / iOS target

macOS 27.0 beta 1

Xcode version

Xcode 27.0 beta 1

Python / uv version

uv 0.11.19, python 3.14.5

Full error output

== RUNNABLE ERROR:

    CrashReportError: Fatal Error in CoreAIPipelinedEngine.swift
    
    Application crashed due to fatalError in CoreAIPipelinedEngine.swift at line 151.
    
    Engine not returned after drain() — tokenSequence Task stuck?
    
    Process:             CoreAIChat [38979]
    Path:                <none>
    
    Date/Time:           2026-06-14 17:20:32 +0000

Anything else?

Steps to reproduce:

  1. In a SwiftUI App or Xcode-project: Load a dynamically shaped Core AI language model (eg. Gemma-3-4b-it-4bit-dynamic)
  2. Create one LanguageModelSession with the model..
  3. Call session.respond() and wait for the first response to complete.
  4. Immediately call session.respond() again using the same session.
  5. The first response completes normally after emitting EOS.
  6. The second call waits in CoreAIPipelinedEngine.reset() / drain().
  7. After approximately five seconds, the process crashes with:Fatal error: Engine not returned after drain() — tokenSequence Task stuck?
Image

Why does this fail?

I started some research in the repository and I found the following:
CoreAIPipelinedEngine.generate() starts an independent producer task that keeps generating tokens up to maxTokens while holding exclusive ownership of the engine.
When respondVanilla() detects EOS, it records .eos and stops consuming the stream. However, this does not cancel or await the producer task. The producer therefore continues running in the background while engineInUse remains true.

The second respond() call invokes reset(), which waits in drain() for the previous producer to release the engine. If it does not finish within approximately five seconds, drain() terminates the process with fatalError.
A single Task.yield() at the end of respondVanilla() does not guarantee that the producer has completed or released the engine.

Intended Behavior

I am not 100% sure whether consumers are intentionally expected to drain the remaining token stream before reusing the engine, or whether early termination should automatically cancel and await the producer task.

If applications are expected to handle this themselves, maybe that lifecycle requirement should be documented. However, draining the stream would still allow the engine to generate and discard all remaining tokens up to maxTokens, wasting GPU time and all belonging to that.

Thanks in advance👍

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions