Release v0.13.1 - Voice mode, Super speedy streaming, and a lot more (#255)

## Thanks for contributions - PR [#249](https://github.com/NeuralNomadsAI/CodeNomad/pull/249) "feat(speech): add prompt voice input" by [@shantur](https://github.com/shantur) - PR [#243](https://github.com/NeuralNomadsAI/CodeNomad/pull/243) "feat(i18n): Hebrew locale + full RTL support" by [@MusiCode1](https://github.com/MusiCode1) - PR [#241](https://github.com/NeuralNomadsAI/CodeNomad/pull/241) "feat(lazy loading): Implement virtual list with virtua" by [@pixellos](https://github.com/pixellos) - PR [#240](https://github.com/NeuralNomadsAI/CodeNomad/pull/240) "fix(tauri): force Windows process tree shutdown" by [@pascalandr](https://github.com/pascalandr) - PR [#239](https://github.com/NeuralNomadsAI/CodeNomad/pull/239) "perf(ui): split right panel and secondary viewer chunks" by [@pascalandr](https://github.com/pascalandr) - PR [#238](https://github.com/NeuralNomadsAI/CodeNomad/pull/238) "perf(ui): defer locale and overlay bundles" by [@pascalandr](https://github.com/pascalandr) - PR [#236](https://github.com/NeuralNomadsAI/CodeNomad/pull/236) "Suppress OS notifications for subagent (child) sessions" by `@app/codenomadbot` - PR [#235](https://github.com/NeuralNomadsAI/CodeNomad/pull/235) "fix(ui): unwrap pasted placeholders in slash commands" by `@app/codenomadbot` - PR [#232](https://github.com/NeuralNomadsAI/CodeNomad/pull/232) "fix(tauri): stop CLI process group on exit" by `@app/codenomadbot` - PR [#229](https://github.com/NeuralNomadsAI/CodeNomad/pull/229) "feat(ui): add RTL support for Hebrew/Arabic text" by [@MusiCode1](https://github.com/MusiCode1) - PR [#227](https://github.com/NeuralNomadsAI/CodeNomad/pull/227) "fix(tauri): improve Windows desktop runtime behavior" by [@pascalandr](https://github.com/pascalandr) - PR [#226](https://github.com/NeuralNomadsAI/CodeNomad/pull/226) "fix(tauri): restore desktop menu controls and fullscreen shortcut" by [@pascalandr](https://github.com/pascalandr) - PR [#225](https://github.com/NeuralNomadsAI/CodeNomad/pull/225) "fix(tauri): restore external links in the folder picker" by [@pascalandr](https://github.com/pascalandr) - PR [#224](https://github.com/NeuralNomadsAI/CodeNomad/pull/224) "fix(tauri): sync server UI bundle during prebuild" by [@pascalandr](https://github.com/pascalandr) - PR [#215](https://github.com/NeuralNomadsAI/CodeNomad/pull/215) "perf(ui): lazy-load markdown and defer diff rendering" by [@pascalandr](https://github.com/pascalandr) ## Highlights - **Voice-first conversations**: Start prompts with voice input, configure speech behavior from settings, and listen back to assistant responses with message playback and conversation playback controls. - **A complete Hebrew + RTL experience**: CodeNomad now ships with a full Hebrew locale and much broader right-to-left support, making the app feel natural for Hebrew users while improving Arabic text rendering too. - **A much faster experience in long chats**: The new virtualized message list, deferred markdown and diff rendering, and more selective loading for heavy UI surfaces make large sessions feel noticeably smoother. ## What's Improved - **More flexible speech controls**: Speech settings and playback modes now adapt better to different browsers and platform capabilities. - **Cleaner prompt workflow**: The prompt includes a quick clear action, a simpler recording indicator, and a more polished mic control layout. - **Faster startup and lighter heavy views**: Locale bundles, overlays, right-panel viewers, picker flows, markdown, and diff surfaces all load more lazily to reduce upfront UI work. - **Less notification spam**: Subagent sessions no longer fire OS notifications, so important interruptions are easier to notice. - **Better RTL behavior across the whole interface**: Session names, tool outputs, markdown blocks, file views, selectors, and layout controls behave more consistently in right-to-left contexts. ## Fixes - **More reliable Windows desktop behavior**: Process cleanup is stronger during app shutdown, background CLI process trees are terminated more reliably, desktop identity/metadata is aligned more cleanly, and stray console windows are hidden during startup and exit. - **Cleaner shutdown on macOS and Linux**: Desktop quit/close now stops the spawned CLI process group more reliably, reducing leftover background processes after exit. - **Restored desktop actions**: External links in the folder picker work again, and the desktop View/Window controls plus the fullscreen shortcut are back. - **More stable streaming and scrolling**: Reasoning streams stay pinned more consistently, follow behavior is less jumpy, spacing is cleaner in virtualized conversations, and session switching retains position more smoothly. - **Safer slash command pasting**: Pasted placeholders are resolved correctly before slash commands run, so long pasted inputs behave like normal prompts. - **More dependable desktop packaging**: Tauri prebuild now refreshes the server UI bundle correctly, which avoids packaged desktop builds picking up stale UI assets. - **Clearer speech compatibility handling**: Streaming playback limitations are surfaced more cleanly instead of failing in a confusing way. ### Contributors - [@pascalandr](https://github.com/pascalandr) - [@MusiCode1](https://github.com/MusiCode1) - [@pixellos](https://github.com/pixellos)
2026-03-27 19:58:35 +00:00
parent 153065d025 1b4eff9419
commit 27bccb8d6b
151 changed files with 7706 additions and 3516 deletions
--- a/packages/server/src/api-types.ts
+++ b/packages/server/src/api-types.ts
@@ -207,6 +207,39 @@ export interface BinaryValidationResult {
  error?: string
 }

+export interface SpeechSegment {
+  startMs: number
+  endMs: number
+  text: string
+}
+
+export interface SpeechCapabilitiesResponse {
+  available: boolean
+  configured: boolean
+  provider: string
+  supportsStt: boolean
+  supportsTts: boolean
+  supportsStreamingTts: boolean
+  baseUrl?: string
+  sttModel: string
+  ttsModel: string
+  ttsVoice: string
+  ttsFormats: string[]
+  streamingTtsFormats: string[]
+}
+
+export interface SpeechTranscriptionResponse {
+  text: string
+  language?: string
+  durationMs?: number
+  segments?: SpeechSegment[]
+}
+
+export interface SpeechSynthesisResponse {
+  audioBase64: string
+  mimeType: string
+}
+
 export type WorkspaceEventType =
  | "workspace.created"
  | "workspace.started"
--- a/packages/server/src/index.ts
+++ b/packages/server/src/index.ts
@@ -23,6 +23,7 @@ import { AuthManager, BOOTSTRAP_TOKEN_STDOUT_PREFIX, DEFAULT_AUTH_COOKIE_NAME, D
 import { resolveHttpsOptions } from "./server/tls"
 import { resolveNetworkAddresses } from "./server/network-addresses"
 import { startDevReleaseMonitor } from "./releases/dev-release-monitor"
+import { SpeechService } from "./speech/service"

 const require = createRequire(import.meta.url)

@@ -313,6 +314,7 @@ async function main() {
  })
  const fileSystemBrowser = new FileSystemBrowser({ rootDir: options.rootDir, unrestricted: options.unrestrictedRoot })
  const instanceStore = new InstanceStore(configLocation.instancesDir)
+  const speechService = new SpeechService(settings, logger.child({ component: "speech" }))
  const instanceEventBridge = new InstanceEventBridge({
    workspaceManager,
    eventBus,
@@ -397,6 +399,7 @@ async function main() {
        eventBus,
        serverMeta,
        instanceStore,
+        speechService,
        authManager,
        uiStaticDir: uiResolution.uiStaticDir ?? DEFAULT_UI_STATIC_DIR,
        uiDevServerUrl: uiResolution.uiDevServerUrl,
@@ -417,6 +420,7 @@ async function main() {
        eventBus,
        serverMeta,
        instanceStore,
+        speechService,
        authManager,
        uiStaticDir: uiResolution.uiStaticDir ?? DEFAULT_UI_STATIC_DIR,
        uiDevServerUrl: undefined,
--- a/packages/server/src/server/http-server.ts
+++ b/packages/server/src/server/http-server.ts
@@ -21,12 +21,14 @@ import { registerStorageRoutes } from "./routes/storage"
 import { registerPluginRoutes } from "./routes/plugin"
 import { registerBackgroundProcessRoutes } from "./routes/background-processes"
 import { registerWorktreeRoutes } from "./routes/worktrees"
+import { registerSpeechRoutes } from "./routes/speech"
 import { ServerMeta } from "../api-types"
 import { InstanceStore } from "../storage/instance-store"
 import { BackgroundProcessManager } from "../background-processes/manager"
 import type { AuthManager } from "../auth/manager"
 import { registerAuthRoutes } from "./routes/auth"
 import { sendUnauthorized, wantsHtml } from "../auth/http-auth"
+import type { SpeechService } from "../speech/service"

 interface HttpServerDeps {
  bindHost: string
@@ -41,6 +43,7 @@ interface HttpServerDeps {
  eventBus: EventBus
  serverMeta: ServerMeta
  instanceStore: InstanceStore
+  speechService: SpeechService
  authManager: AuthManager
  uiStaticDir: string
  uiDevServerUrl?: string
@@ -252,6 +255,7 @@ export function createHttpServer(deps: HttpServerDeps) {
    eventBus: deps.eventBus,
    workspaceManager: deps.workspaceManager,
  })
+  registerSpeechRoutes(app, { speechService: deps.speechService })
  registerPluginRoutes(app, { workspaceManager: deps.workspaceManager, eventBus: deps.eventBus, logger: proxyLogger })
  registerBackgroundProcessRoutes(app, { backgroundProcessManager })
  registerInstanceProxyRoutes(app, { workspaceManager: deps.workspaceManager, logger: proxyLogger })
--- a/packages/server/src/server/routes/settings.ts
+++ b/packages/server/src/server/routes/settings.ts
@@ -3,6 +3,7 @@ import { z } from "zod"
 import { probeBinaryVersion } from "../../workspaces/runtime"
 import type { SettingsService } from "../../settings/service"
 import type { Logger } from "../../logger"
+import { sanitizeConfigDoc, sanitizeConfigOwner } from "../../settings/public-config"

 interface RouteDeps {
  settings: SettingsService
@@ -20,10 +21,10 @@ function validateBinaryPath(binaryPath: string): { valid: boolean; version?: str

 export function registerSettingsRoutes(app: FastifyInstance, deps: RouteDeps) {
  // Full-document access
-  app.get("/api/storage/config", async () => deps.settings.getDoc("config"))
+  app.get("/api/storage/config", async () => sanitizeConfigDoc(deps.settings.getDoc("config")))
  app.patch("/api/storage/config", async (request, reply) => {
    try {
-      return deps.settings.mergePatchDoc("config", request.body ?? {})
+      return sanitizeConfigDoc(deps.settings.mergePatchDoc("config", request.body ?? {}))
    } catch (error) {
      reply.code(400)
      return { error: error instanceof Error ? error.message : "Invalid patch" }
@@ -31,12 +32,15 @@ export function registerSettingsRoutes(app: FastifyInstance, deps: RouteDeps) {
  })

  app.get<{ Params: { owner: string } }>("/api/storage/config/:owner", async (request) => {
-    return deps.settings.getOwner("config", request.params.owner)
+    return sanitizeConfigOwner(request.params.owner, deps.settings.getOwner("config", request.params.owner))
  })

  app.patch<{ Params: { owner: string } }>("/api/storage/config/:owner", async (request, reply) => {
    try {
-      return deps.settings.mergePatchOwner("config", request.params.owner, request.body ?? {})
+      return sanitizeConfigOwner(
+        request.params.owner,
+        deps.settings.mergePatchOwner("config", request.params.owner, request.body ?? {}),
+      )
    } catch (error) {
      reply.code(400)
      return { error: error instanceof Error ? error.message : "Invalid patch" }
--- a/packages/server/src/server/routes/speech.ts
+++ b/packages/server/src/server/routes/speech.ts
@@ -0,0 +1,74 @@
+import type { FastifyInstance } from "fastify"
+import { z } from "zod"
+import type { SpeechService } from "../../speech/service"
+
+interface RouteDeps {
+  speechService: SpeechService
+}
+
+const TranscribeBodySchema = z.object({
+  audioBase64: z.string().min(1, "Audio payload is required"),
+  mimeType: z.string().min(1, "Audio MIME type is required"),
+  filename: z.string().optional(),
+  language: z.string().optional(),
+  prompt: z.string().optional(),
+})
+
+const SynthesizeBodySchema = z.object({
+  text: z.string().trim().min(1, "Text is required"),
+  format: z.enum(["mp3", "wav", "opus", "aac"]).optional(),
+})
+
+function getSpeechErrorStatus(error: unknown): number {
+  if (error instanceof z.ZodError) {
+    return 400
+  }
+  if (error instanceof Error && /not configured/i.test(error.message)) {
+    return 503
+  }
+  return 502
+}
+
+function getSpeechErrorMessage(error: unknown, fallback: string): string {
+  return error instanceof Error ? error.message : fallback
+}
+
+export function registerSpeechRoutes(app: FastifyInstance, deps: RouteDeps) {
+  app.get("/api/speech/capabilities", async () => deps.speechService.getCapabilities())
+
+  app.post("/api/speech/transcribe", async (request, reply) => {
+    try {
+      const body = TranscribeBodySchema.parse(request.body ?? {})
+      return await deps.speechService.transcribe(body)
+    } catch (error) {
+      request.log.error({ err: error }, "Failed to transcribe audio")
+      reply.code(getSpeechErrorStatus(error))
+      return { error: getSpeechErrorMessage(error, "Failed to transcribe audio") }
+    }
+  })
+
+  app.post("/api/speech/synthesize", async (request, reply) => {
+    try {
+      const body = SynthesizeBodySchema.parse(request.body ?? {})
+      return await deps.speechService.synthesize(body)
+    } catch (error) {
+      request.log.error({ err: error }, "Failed to synthesize audio")
+      reply.code(getSpeechErrorStatus(error))
+      return { error: getSpeechErrorMessage(error, "Failed to synthesize audio") }
+    }
+  })
+
+  app.post("/api/speech/synthesize/stream", async (request, reply) => {
+    try {
+      const body = SynthesizeBodySchema.parse(request.body ?? {})
+      const result = await deps.speechService.synthesizeStream(body)
+      reply.header("Content-Type", result.mimeType)
+      reply.header("Cache-Control", "no-store")
+      return reply.send(result.stream)
+    } catch (error) {
+      request.log.error({ err: error }, "Failed to stream synthesized audio")
+      reply.code(getSpeechErrorStatus(error))
+      return { error: getSpeechErrorMessage(error, "Failed to stream synthesized audio") }
+    }
+  })
+}
--- a/packages/server/src/settings/public-config.ts
+++ b/packages/server/src/settings/public-config.ts
@@ -0,0 +1,40 @@
+import type { SettingsDoc } from "./yaml-doc-store"
+
+function isPlainObject(value: unknown): value is Record<string, unknown> {
+  return typeof value === "object" && value !== null && !Array.isArray(value)
+}
+
+function sanitizeServerOwner(value: SettingsDoc): SettingsDoc {
+  const next: SettingsDoc = { ...value }
+  const speech = isPlainObject(next.speech) ? { ...next.speech } : null
+
+  if (!speech) {
+    return next
+  }
+
+  const rawApiKey = typeof speech.apiKey === "string" ? speech.apiKey.trim() : ""
+  if (rawApiKey) {
+    delete speech.apiKey
+    speech.hasApiKey = true
+  } else if (!("hasApiKey" in speech)) {
+    speech.hasApiKey = false
+  }
+
+  next.speech = speech
+  return next
+}
+
+export function sanitizeConfigOwner(owner: string, value: SettingsDoc): SettingsDoc {
+  if (owner !== "server") {
+    return value
+  }
+  return sanitizeServerOwner(value)
+}
+
+export function sanitizeConfigDoc(value: SettingsDoc): SettingsDoc {
+  const next: SettingsDoc = { ...value }
+  if (isPlainObject(next.server)) {
+    next.server = sanitizeServerOwner(next.server)
+  }
+  return next
+}
--- a/packages/server/src/settings/service.ts
+++ b/packages/server/src/settings/service.ts
@@ -4,6 +4,7 @@ import type { ConfigLocation } from "../config/location"
 import { YamlDocStore, type SettingsDoc } from "./yaml-doc-store"
 import { migrateSettingsLayout } from "./migrate"
 import type { WorkspaceEventPayload } from "../api-types"
+import { sanitizeConfigOwner } from "./public-config"

 export type DocKind = "config" | "state"

@@ -45,10 +46,11 @@ export class SettingsService {
  private publish(kind: DocKind, owner: string, value?: SettingsDoc) {
    if (!this.eventBus) return
    const type = kind === "config" ? "storage.configChanged" : "storage.stateChanged"
+    const nextValue = value ?? this.getOwner(kind, owner)
    const payload: WorkspaceEventPayload = {
      type,
      owner,
-      value: value ?? this.getOwner(kind, owner),
+      value: kind === "config" ? sanitizeConfigOwner(owner, nextValue) : nextValue,
    } as any
    this.eventBus.publish(payload)
  }
--- a/packages/server/src/speech/providers/openai-compatible.ts
+++ b/packages/server/src/speech/providers/openai-compatible.ts
@@ -0,0 +1,204 @@
+import { Readable } from "node:stream"
+import OpenAI from "openai"
+import { toFile } from "openai/uploads"
+import type { SpeechSynthesisResponse, SpeechTranscriptionResponse } from "../../api-types"
+import type { Logger } from "../../logger"
+import type { NormalizedSpeechSettings, SpeechSynthesisStreamResponse, SynthesizeSpeechInput, TranscribeAudioInput } from "../service"
+
+interface OpenAICompatibleSpeechProviderOptions {
+  settings: NormalizedSpeechSettings
+  logger: Logger
+}
+
+export class OpenAICompatibleSpeechProvider {
+  constructor(private readonly options: OpenAICompatibleSpeechProviderOptions) {}
+
+  getCapabilities() {
+    const { settings } = this.options
+    return {
+      available: true,
+      configured: Boolean(settings.apiKey),
+      provider: settings.provider,
+      supportsStt: true,
+      supportsTts: true,
+      supportsStreamingTts: true,
+      baseUrl: settings.baseUrl,
+      sttModel: settings.sttModel,
+      ttsModel: settings.ttsModel,
+      ttsVoice: settings.ttsVoice,
+      ttsFormats: ["mp3", "wav", "opus", "aac"],
+      streamingTtsFormats: ["mp3", "wav", "opus", "aac"],
+    }
+  }
+
+  async transcribe(input: TranscribeAudioInput): Promise<SpeechTranscriptionResponse> {
+    const client = this.createClient()
+    const startedAt = Date.now()
+    const extension = extensionForMime(input.mimeType)
+    const buffer = Buffer.from(input.audioBase64, "base64")
+    const filename = input.filename?.trim() || `prompt-input.${extension}`
+
+    this.options.logger.info(
+      {
+        mimeType: input.mimeType,
+        bytes: buffer.byteLength,
+        language: input.language,
+        model: this.options.settings.sttModel,
+      },
+      "speech.transcribe",
+    )
+
+    const response = await this.requestTranscription(client, buffer, filename, input)
+
+    return {
+      text: typeof response?.text === "string" ? response.text : "",
+      language: typeof response?.language === "string" ? response.language : input.language,
+      durationMs: Number.isFinite(response?.duration) ? Math.round(Number(response.duration) * 1000) : Date.now() - startedAt,
+      segments: Array.isArray(response?.segments)
+        ? response.segments
+            .filter((segment: any) => typeof segment?.text === "string")
+            .map((segment: any) => ({
+              startMs: Math.max(0, Math.round(Number(segment.start ?? 0) * 1000)),
+              endMs: Math.max(0, Math.round(Number(segment.end ?? 0) * 1000)),
+              text: String(segment.text),
+            }))
+        : undefined,
+    }
+  }
+
+  private async requestTranscription(
+    client: OpenAI,
+    buffer: Buffer,
+    filename: string,
+    input: TranscribeAudioInput,
+  ): Promise<any> {
+    const baseRequest = {
+      model: this.options.settings.sttModel,
+      ...(input.language ? { language: input.language } : {}),
+      ...(input.prompt ? { prompt: input.prompt } : {}),
+    }
+
+    try {
+      const file = await toFile(buffer, filename, { type: input.mimeType })
+      return (await client.audio.transcriptions.create({
+        ...baseRequest,
+        file,
+        response_format: "verbose_json" as any,
+      } as any)) as any
+    } catch (error) {
+      this.options.logger.warn({ err: error }, "speech.transcribe verbose_json failed; retrying default format")
+      const retryFile = await toFile(buffer, filename, { type: input.mimeType })
+      return (await client.audio.transcriptions.create({
+        ...baseRequest,
+        file: retryFile,
+      } as any)) as any
+    }
+  }
+
+  async synthesize(input: SynthesizeSpeechInput): Promise<SpeechSynthesisResponse> {
+    const format = input.format ?? this.options.settings.ttsFormat
+
+    this.options.logger.info(
+      {
+        model: this.options.settings.ttsModel,
+        voice: this.options.settings.ttsVoice,
+        format,
+      },
+      "speech.synthesize",
+    )
+
+    const response = await this.requestSpeechAudio(input.text, format)
+    const mimeType = response.headers.get("content-type") || mimeTypeForFormat(format)
+
+    const audioBuffer = Buffer.from(await response.arrayBuffer())
+    return {
+      audioBase64: audioBuffer.toString("base64"),
+      mimeType,
+    }
+  }
+
+  async synthesizeStream(input: SynthesizeSpeechInput): Promise<SpeechSynthesisStreamResponse> {
+    const format = input.format ?? this.options.settings.ttsFormat
+
+    this.options.logger.info(
+      {
+        model: this.options.settings.ttsModel,
+        voice: this.options.settings.ttsVoice,
+        format,
+      },
+      "speech.synthesize.stream",
+    )
+
+    const response = await this.requestSpeechAudio(input.text, format)
+    if (!response.body) {
+      throw new Error("Speech provider did not return a stream.")
+    }
+
+    return {
+      stream: Readable.fromWeb(response.body as any),
+      mimeType: response.headers.get("content-type") || mimeTypeForFormat(format),
+    }
+  }
+
+  private async requestSpeechAudio(text: string, format: "mp3" | "wav" | "opus" | "aac"): Promise<Response> {
+    const { settings } = this.options
+    if (!settings.apiKey) {
+      throw new Error("Speech provider is not configured. Add an API key in Speech settings.")
+    }
+
+    const endpoint = new URL("audio/speech", ensureTrailingSlash(settings.baseUrl ?? "https://api.openai.com/v1"))
+    const response = await fetch(endpoint, {
+      method: "POST",
+      headers: {
+        Authorization: `Bearer ${settings.apiKey}`,
+        "Content-Type": "application/json",
+      },
+      body: JSON.stringify({
+        model: settings.ttsModel,
+        voice: settings.ttsVoice,
+        input: text,
+        response_format: format,
+      }),
+    })
+
+    if (!response.ok) {
+      const detail = await response.text()
+      throw new Error(detail || `Speech synthesis failed with ${response.status}`)
+    }
+
+    return response
+  }
+
+  private createClient(): OpenAI {
+    const { settings } = this.options
+    if (!settings.apiKey) {
+      throw new Error("Speech provider is not configured. Add an API key in Speech settings.")
+    }
+
+    return new OpenAI({
+      apiKey: settings.apiKey,
+      baseURL: settings.baseUrl,
+    })
+  }
+}
+
+function extensionForMime(mimeType: string): string {
+  const normalized = mimeType.toLowerCase()
+  if (normalized.includes("webm")) return "webm"
+  if (normalized.includes("ogg")) return "ogg"
+  if (normalized.includes("wav")) return "wav"
+  if (normalized.includes("mpeg") || normalized.includes("mp3")) return "mp3"
+  if (normalized.includes("mp4") || normalized.includes("aac")) return "m4a"
+  return "webm"
+}
+
+function mimeTypeForFormat(format: "mp3" | "wav" | "opus" | "aac"): string {
+  if (format === "wav") return "audio/wav"
+  if (format === "opus") return 'audio/ogg; codecs="opus"'
+  if (format === "aac") return "audio/aac"
+  return "audio/mpeg"
+}
+
+function ensureTrailingSlash(value: string): string {
+  return value.endsWith("/") ? value : `${value}/`
+}
--- a/packages/server/src/speech/service.ts
+++ b/packages/server/src/speech/service.ts
@@ -0,0 +1,106 @@
+import { z } from "zod"
+import type { Readable } from "node:stream"
+import type { Logger } from "../logger"
+import type { SettingsService } from "../settings/service"
+import type { SpeechCapabilitiesResponse, SpeechSynthesisResponse, SpeechTranscriptionResponse } from "../api-types"
+import { OpenAICompatibleSpeechProvider } from "./providers/openai-compatible"
+
+const ServerSpeechSettingsSchema = z.object({
+  speech: z
+    .object({
+      provider: z.string().optional(),
+      apiKey: z.string().optional(),
+      baseUrl: z.string().optional(),
+      sttModel: z.string().optional(),
+      ttsModel: z.string().optional(),
+      ttsVoice: z.string().optional(),
+      ttsFormat: z.enum(["mp3", "wav", "opus", "aac"]).optional(),
+    })
+    .optional(),
+})
+
+export interface TranscribeAudioInput {
+  audioBase64: string
+  mimeType: string
+  filename?: string
+  language?: string
+  prompt?: string
+}
+
+export interface SynthesizeSpeechInput {
+  text: string
+  format?: "mp3" | "wav" | "opus" | "aac"
+}
+
+export interface SpeechSynthesisStreamResponse {
+  stream: Readable
+  mimeType: string
+}
+
+export interface SpeechProvider {
+  getCapabilities(): SpeechCapabilitiesResponse
+  transcribe(input: TranscribeAudioInput): Promise<SpeechTranscriptionResponse>
+  synthesize(input: SynthesizeSpeechInput): Promise<SpeechSynthesisResponse>
+  synthesizeStream(input: SynthesizeSpeechInput): Promise<SpeechSynthesisStreamResponse>
+}
+
+export interface NormalizedSpeechSettings {
+  provider: string
+  apiKey?: string
+  baseUrl?: string
+  sttModel: string
+  ttsModel: string
+  ttsVoice: string
+  ttsFormat: "mp3" | "wav" | "opus" | "aac"
+}
+
+const DEFAULT_PROVIDER = "openai-compatible"
+const DEFAULT_STT_MODEL = "gpt-4o-mini-transcribe"
+const DEFAULT_TTS_MODEL = "gpt-4o-mini-tts"
+const DEFAULT_TTS_VOICE = "alloy"
+const DEFAULT_TTS_FORMAT = "mp3"
+export class SpeechService {
+  constructor(
+    private readonly settings: SettingsService,
+    private readonly logger: Logger,
+  ) {}
+
+  getCapabilities(): SpeechCapabilitiesResponse {
+    return this.createProvider().getCapabilities()
+  }
+
+  async transcribe(input: TranscribeAudioInput): Promise<SpeechTranscriptionResponse> {
+    return this.createProvider().transcribe(input)
+  }
+
+  async synthesize(input: SynthesizeSpeechInput): Promise<SpeechSynthesisResponse> {
+    return this.createProvider().synthesize(input)
+  }
+
+  async synthesizeStream(input: SynthesizeSpeechInput): Promise<SpeechSynthesisStreamResponse> {
+    return this.createProvider().synthesizeStream(input)
+  }
+
+  private createProvider(): SpeechProvider {
+    const settings = this.resolveSettings()
+    return new OpenAICompatibleSpeechProvider({
+      settings,
+      logger: this.logger.child({ provider: settings.provider }),
+    })
+  }
+
+  private resolveSettings(): NormalizedSpeechSettings {
+    const parsed = ServerSpeechSettingsSchema.parse(this.settings.getOwner("config", "server") ?? {})
+    const speech = parsed.speech ?? {}
+
+    return {
+      provider: speech.provider?.trim() || DEFAULT_PROVIDER,
+      apiKey: speech.apiKey?.trim() || process.env.OPENAI_API_KEY,
+      baseUrl: speech.baseUrl?.trim() || process.env.OPENAI_BASE_URL || undefined,
+      sttModel: speech.sttModel?.trim() || DEFAULT_STT_MODEL,
+      ttsModel: speech.ttsModel?.trim() || DEFAULT_TTS_MODEL,
+      ttsVoice: speech.ttsVoice?.trim() || DEFAULT_TTS_VOICE,
+      ttsFormat: speech.ttsFormat ?? DEFAULT_TTS_FORMAT,
+    }
+  }
+}