Skip to content

feat: Transcription - Refactor making transcription much more robust / capable.#382

Open
neurocis wants to merge 9 commits intospacedriveapp:mainfrom
neurocis:feat/transcription
Open

feat: Transcription - Refactor making transcription much more robust / capable.#382
neurocis wants to merge 9 commits intospacedriveapp:mainfrom
neurocis:feat/transcription

Conversation

@neurocis
Copy link

@neurocis neurocis commented Mar 10, 2026

Voice Transcription
Spacebot converts audio attachments (Telegram voice messages, Discord audio clips, etc.) to text using Whisper-compatible speech-to-text APIs before passing them to the channel LLM.

How It Works
User sends voice message


Channel receives audio attachment (audio/* MIME type)


transcribe_audio_attachment()

├─ Downloads audio bytes from attachment URL
├─ Reads routing config: voice, voice_language, voice_translate, stt_provider
├─ Resolves STT provider (stt_provider override → voice model prefix → error)
├─ Checks provider supports Whisper API
├─ Sends multipart POST to /v1/audio/transcriptions (or /translations)
└─ Returns transcript as <voice_transcript> XML tag in conversation
The transcript is injected into conversation history as structured content:

<voice_transcript name="voice_message.ogg" mime="audio/ogg">
Hello, this is what the user said.
</voice_transcript>
When translation mode is enabled, the tag changes:

<voice_translation name="voice_message.ogg" mime="audio/ogg">
Hello, this is the English translation of what the user said.
</voice_translation>
Configuration
All voice settings live under [defaults.routing] (or per-agent [[agents]].routing) in config.toml.

Parameters
Parameter Type Default Description
voice String Provider-dependent (see below) STT model in "provider/model" format. Empty string disables voice transcription.
voice_language Option None ISO 639-1 language hint for transcription accuracy (e.g., "en", "es", "fr", "ja"). Ignored in translation mode.
voice_translate bool false When true, uses the /v1/audio/translations endpoint to translate audio to English instead of transcribing in the source language.
stt_provider Option None Override which provider handles STT. When absent, the provider is extracted from the voice model string prefix.
Provider Defaults
When no explicit voice is configured, Spacebot sets a default based on the primary provider:

Provider Default voice Notes
openai openai/whisper-1 Native OpenAI Whisper API
groq groq/whisper-large-v3-turbo Fast, cheap Whisper endpoint
gemini gemini/gemini-2.5-flash Via Gemini's OpenAI-compatible endpoint
openrouter (empty) No native STT — must configure stt_provider separately
anthropic (empty) No STT support — must configure stt_provider separately
All others (empty) Must configure voice and optionally stt_provider
Available STT Models
Provider Models Endpoint
OpenAI whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe POST /v1/audio/transcriptions
Groq whisper-large-v3, whisper-large-v3-turbo POST /openai/v1/audio/transcriptions
Gemini gemini-2.5-flash (and other Gemini models) POST /v1/audio/transcriptions (OpenAI-compatible)
Supported Audio Formats
The Whisper API accepts: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm.

Telegram voice messages are OGG/Opus, which is natively supported.

Environment Variables
Environment variables override config file values:

Variable Maps to Example
SPACEBOT_VOICE_MODEL routing.voice groq/whisper-large-v3-turbo
SPACEBOT_VOICE_LANGUAGE routing.voice_language en
SPACEBOT_VOICE_TRANSLATE routing.voice_translate true
SPACEBOT_STT_PROVIDER routing.stt_provider groq
Resolution order: environment variable → config file → provider default.

API
GET /api/config
Returns current routing configuration including voice fields:

{
"routing": {
"voice": "groq/whisper-large-v3-turbo",
"voice_language": "en",
"voice_translate": false,
"stt_provider": "groq"
}
}
PATCH /api/config
Update voice settings at runtime:

{
"agent_id": "main",
"routing": {
"voice": "openai/whisper-1",
"voice_language": "es",
"voice_translate": false,
"stt_provider": "openai"
}
}
GET /api/models?capability=voice_transcription
Returns models from providers that support Whisper-compatible transcription (currently: openai, groq, gemini).

Example Configurations
Groq for chat and transcription
[llm]
groq_key = "gsk_xxx"

[defaults.routing]
channel = "groq/llama-3.3-70b-versatile"
voice = "groq/whisper-large-v3-turbo"
OpenRouter for chat, Groq for transcription
[llm]
openrouter_key = "sk-or-xxx"
groq_key = "gsk_xxx"

[defaults.routing]
channel = "openrouter/anthropic/claude-sonnet-4"
voice = "groq/whisper-large-v3-turbo"
voice_language = "en"
Anthropic for chat, OpenAI for transcription with translation
[llm]
anthropic_key = "sk-ant-xxx"
openai_key = "sk-xxx"

[defaults.routing]
channel = "anthropic/claude-sonnet-4"
voice = "openai/whisper-1"
voice_translate = true
stt_provider = "openai"
Multilingual transcription with language hint
[llm]
openai_key = "sk-xxx"

[defaults.routing]
channel = "openai/gpt-4.1"
voice = "openai/whisper-1"
voice_language = "ja"
Error Handling
Errors are returned as inline text messages in conversation context (not exceptions), so the channel LLM sees the failure and can inform the user:

Condition Message
No voice model configured [Audio attachment received but no voice model is configured. Add voice = "provider/model" to [defaults.routing] in config.]
STT provider not configured [Audio transcription failed: provider 'xxx' is not configured]
Provider doesn't support Whisper [Audio transcription not supported by provider 'xxx'. Configure a Whisper-compatible STT provider (openai, groq, gemini).]
Transcription API error [Audio transcription failed for filename.ogg: Whisper API error (400): ...message...]
Download failure [Failed to download audio: filename.ogg]
There is no fallback to multimodal chat. If transcription fails, the error is returned directly.

Architecture
Module Layout
src/llm/transcription.rs — Whisper API client (multipart form, endpoint routing, response parsing)
src/llm/routing.rs — RoutingConfig with voice, voice_language, voice_translate, stt_provider
src/agent/channel_attachments.rs — transcribe_audio_attachment() orchestration
src/config/toml_schema.rs — TOML deserialization for voice config fields
src/config/providers.rs — resolve_routing() merges TOML → base config
src/config/load.rs — Environment variable resolution
src/api/config.rs — API GET/PATCH for voice settings
src/api/models.rs — voice_transcription capability filter
Request Flow
channel_attachments::download_attachments() detects audio/* MIME type
Calls transcribe_audio_attachment() which:
Downloads raw bytes from the attachment URL
Reads routing.voice, routing.voice_language, routing.voice_translate, routing.stt_provider from RuntimeConfig
Resolves provider via stt_provider override or voice model prefix
Validates provider supports Whisper via supports_whisper_transcription()
Calls llm::transcribe_audio() which:
Builds the correct endpoint URL based on provider (build_whisper_endpoint())
Constructs a multipart form with: file (audio bytes), model, response_format: json, optional language
Sends POST with Authorization: Bearer header and any extra_headers
Parses {"text": "...", "duration": ...} response
Transcript injected as or XML tag into the channel conversation
Provider Endpoint Mapping
Provider Base URL Transcription Path Translation Path
OpenAI https://api.openai.com /v1/audio/transcriptions /v1/audio/translations
Groq https://api.groq.com/openai /openai/v1/audio/transcriptions /openai/v1/audio/translations
Gemini https://generativelanguage.googleapis.com/v1beta/openai /v1/audio/transcriptions /v1/audio/translations
Key Design Decisions
Multipart form data — The Whisper API requires multipart uploads with the audio file, not JSON with base64-encoded audio.
No fallback — On failure, an error message is returned. The previous approach of falling back to multimodal chat with input_audio content type is removed.
Separate STT routing — stt_provider allows using a different provider for transcription than for chat (e.g., Anthropic for chat, Groq for STT).
Language hint ignored for translations — Per the Whisper API spec, language is only applicable to transcriptions.

Note

Implementation Summary

Refactored voice transcription from Anthropic's multimodal input_audio API to standard Whisper-compatible /v1/audio/transcriptions endpoints, enabling support for OpenAI, Groq, and Gemini providers.

Key changes: New src/llm/transcription.rs module encapsulates Whisper API client with multipart form handling, provider-specific endpoint routing, and comprehensive test coverage. Updated channel_attachments.rs to use the new transcription module, removing 120+ lines of inline logic. Extended RoutingConfig with voice_language, voice_translate, and stt_provider fields to support language hints and provider override. Updated config loading, API endpoints, and models filtering to expose new voice settings. Simplified model filtering in src/api/models.rs by checking provider capability rather than maintaining a hardcoded model list.

Written by Tembo for commit 96759b2. This will update automatically on new commits.

neurocis added 4 commits March 9, 2026 17:15
Replace the incorrect multimodal chat approach with proper Whisper-compatible
speech-to-text APIs using multipart form data.

Changes:
- Add voice_language, voice_translate, stt_provider config fields
- Create new transcription module with Whisper-compatible implementation
- Support OpenAI, Groq, and Gemini OpenAI-compatible endpoints
- Add environment variables: SPACEBOT_VOICE_LANGUAGE, SPACEBOT_VOICE_TRANSLATE, SPACEBOT_STT_PROVIDER
- Set sensible voice defaults for OpenAI (whisper-1), Groq (whisper-large-v3-turbo), Gemini (gemini-2.5-flash)
- Update API config response with new STT fields
- Add comprehensive unit tests for transcription module

The previous implementation incorrectly used /v1/chat/completions with input_audio
content type. Now uses proper /v1/audio/transcriptions endpoint with multipart
form data for actual speech-to-text transcription.
The send() method returns reqwest::Error which doesn't have a From
implementation for our Error type. Map it to LlmError::ProviderRequest.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d5ec799-cdfc-47b9-8218-c16b42311b96

📥 Commits

Reviewing files that changed from the base of the PR and between 01e50d8 and 46616fd.

📒 Files selected for processing (2)
  • src/config/load.rs
  • src/config/toml_schema.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/config/load.rs

Walkthrough

Adds voice transcription: new routing/config fields and env overrides for STT, a Whisper-compatible transcription module and tests, provider-capability checks, documentation for voice transcription, API and routing updates, and agent attachment changes to call the new transcription pathway.

Changes

Cohort / File(s) Summary
Repository ignores
/.gitignore
Expanded ignore patterns (Rust target/.Cargo.lock, Python __pycache__/envs, Node node_modules/builds) and generalized OpenCode ignore to .opencode*/.
Documentation
docs/content/docs/(configuration)/config.mdx, docs/content/docs/(core)/routing.mdx, docs/content/docs/(features)/voice-transcription.mdx
Added voice/STT docs: env vars, routing defaults, full voice-transcription feature page, examples, and reference entries.
Config loading & schema
src/config/load.rs, src/config/toml_schema.rs, src/config/providers.rs
Added env overrides SPACEBOT_VOICE_LANGUAGE, SPACEBOT_VOICE_TRANSLATE, SPACEBOT_STT_PROVIDER; TOML schema and merge logic now include voice_language, voice_translate, stt_provider.
Routing & LLM defaults
src/llm/routing.rs, src/llm.rs
Extended RoutingConfig with voice_language, voice_translate, stt_provider; initialized defaults and re-exported new transcription module.
Transcription implementation
src/llm/transcription.rs
New Whisper-compatible transcription client, public TranscriptionRequest/Response, transcribe_audio(), endpoint/path builders, multipart form handling, response parsing, provider capability helper, and tests.
API surface
src/api/config.rs, src/api/models.rs
API structs now expose new routing fields and persistence; replaced model-list checks with provider-based WHISPER_CAPABLE_PROVIDERS and supports_voice_transcription() logic.
Agent integration
src/agent/channel_attachments.rs
Removed inline HTTP transcription handling; now derive provider/model and call transcribe_audio(); validate provider via supports_whisper_transcription(); updated error messages and response wrapping.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • jamiepine
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Transcription - Refactor making transcription much more robust / capable.' clearly describes the main change: a refactor of transcription functionality to make it more robust and capable.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the voice transcription feature, how it works, configuration options, supported providers, API endpoints, and error handling.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-authored-by: KiloCodium <KiloCoder@neurocis.ai>
@neurocis neurocis force-pushed the feat/transcription branch from e37dc26 to 3f76e29 Compare March 10, 2026 04:13
@neurocis neurocis marked this pull request as ready for review March 10, 2026 04:14
if let Ok(voice_language) = std::env::var("SPACEBOT_VOICE_LANGUAGE") {
routing.voice_language = Some(voice_language);
}
if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPACEBOT_VOICE_TRANSLATE currently only ever flips this on, so SPACEBOT_VOICE_TRANSLATE=false won’t override a true from config. Probably want to set the bool whenever the env var is present.

Suggested change
if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
routing.voice_translate = voice_translate.eq_ignore_ascii_case("true");
}

));
}
};
let provider_id = routing.stt_provider.as_deref().unwrap_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the doc comment in the PR description, this routing should be: stt_provider override → voice prefix → error. Defaulting to anthropic when voice has no provider/model prefix seems like it’ll produce confusing failures.

Suggested change
let provider_id = routing.stt_provider.as_deref().unwrap_or_else(|| {
let model_name = voice_model
.split_once('/')
.map(|(_, m)| m)
.unwrap_or(voice_model);
let provider_id = if let Some(stt_provider) = routing.stt_provider.as_deref() {
stt_provider
} else if let Some((p, _)) = voice_model.split_once('/') {
p
} else {
tracing::warn!(model = %voice_model, "invalid voice model route");
return UserContent::text(format!(
"[Audio transcription failed for {}: invalid voice model '{}'; expected provider/model]",
attachment.filename, voice_model
));
};

} else {
"voice_transcript"
};
UserContent::text(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.text is ultimately user-controlled; as-is it can contain < / & / " and break the <voice_transcript ...> wrapper (or inject additional tags into the prompt). Escaping keeps the wrapper well-formed.

Suggested change
UserContent::text(format!(
let escape_attr = |s: &str| {
s.replace('&', "&amp;")
.replace('<', "&lt;")
.replace('>', "&gt;")
.replace('"', "&quot;")
};
let escape_text = |s: &str| {
s.replace('&', "&amp;")
.replace('<', "&lt;")
.replace('>', "&gt;")
};
let filename = escape_attr(&attachment.filename);
let mime_type = escape_attr(&attachment.mime_type);
let text = escape_text(&response.text);
UserContent::text(format!(
"<{} name=\"{}\" mime=\"{}\">\n{}\n</{}>",
tag, filename, mime_type, text, tag
))

translated: bool,
) -> Result<TranscriptionResponse> {
let status = response.status();
let body: serde_json::Value = response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor robustness nit: on non-2xx responses some providers return non-JSON bodies (HTML/proxy text), and response.json() will fail and hide the underlying status/body. Consider reading response.text() first, then serde_json::from_str opportunistically, and include a truncated raw body in the error message when JSON parsing fails.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Around line 13-15: The .gitignore currently lists ".Cargo.lock" (a
dot-prefixed filename) which won't match the actual Rust lockfile "Cargo.lock";
update the ignore entry to "Cargo.lock" (or add both "Cargo.lock" and
".Cargo.lock" if you want to cover both) so the standard Rust lockfile is
properly ignored.

In `@src/agent/channel_attachments.rs`:
- Around line 272-275: Escape the values injected into the XML wrapper to
prevent tag/attribute injection: when constructing the UserContent::text in the
block that formats "<{} name=\"{}\" mime=\"{}\">...\n</{}>" (the code
referencing tag, attachment.filename, attachment.mime_type, response.text), run
attachment.filename and attachment.mime_type through an attribute-escaping
helper and run response.text through an element/inner-text escaping helper (or
add a small local xml_escape function if none exists) before formatting; ensure
the same escaped values are used for both the start and end tag context where
applicable to avoid breaking the wrapper.
- Around line 225-235: The current logic silently defaults provider_id to
"anthropic" when routing.stt_provider is None, which masks missing
configuration; change the handling in the block that computes provider_id and
model_name (using routing.stt_provider, voice_model, provider_id, and
model_name) to: if voice_model contains a '/' use the split provider and model
as now; else if routing.stt_provider is Some use that provider and voice_model
as model; otherwise return/raise an error (or propagate a Config/InvalidInput
error) indicating that stt_provider is unset and voice must be set as
"provider/model". Ensure the error message clearly references the missing
stt_provider and the expected "provider/model" format so operators can fix the
config.
- Around line 248-265: The provider list wrongly treats Gemini as
Whisper-compatible; update supports_whisper_transcription to exclude Gemini (do
not rely on googleapis.com or other URL checks that include Gemini) and either
remove Gemini from the branch that builds a TranscriptionRequest or add a
separate Gemini-specific audio path; specifically, change
supports_whisper_transcription(...) so it returns false for provider identifiers
or configs that indicate Gemini, and ensure transcribe_audio(...) is only called
for true results (leave TranscriptionRequest creation and the match on
transcribe_audio(...) unchanged for non-Gemini providers).

In `@src/api/config.rs`:
- Around line 172-174: The patch struct fields voice_language, voice_translate,
and stt_provider must be made tri-state so we can distinguish omitted vs
explicit null: change their types from Option<String>/Option<bool> to
Option<Option<String>> and Option<Option<bool>> in the PATCH struct (the struct
declared around the shown fields, e.g., the config patch type), then update
update_routing_table() to handle three cases: None => do nothing (field
omitted), Some(Some(value)) => write the TOML key with value, and Some(None) =>
delete the corresponding TOML key (remove it so routing falls back to
inherited/default). Apply the same tri-state change and handling for the other
occurrence referenced (around lines 595-603).

In `@src/api/models.rs`:
- Around line 108-110: The current check (supports_voice_transcription using
WHISPER_CAPABLE_PROVIDERS) treats provider support as model support and exposes
whole provider catalogs; replace it with a model-level allowlist or capability
flag. Add a new predicate (e.g., supports_voice_transcription_model(model_id:
&str) -> bool) that checks either a WHISPER_COMPATIBLE_MODELS set or a per-model
metadata flag, update all usages of supports_voice_transcription and
WHISPER_CAPABLE_PROVIDERS to call supports_voice_transcription_model (including
the other occurrences mentioned), and ensure the models endpoint filters by
model id/capability rather than by provider so only actual Whisper-compatible
models are returned for voice_transcription.

In `@src/config/load.rs`:
- Around line 790-800: The TOML load path (from_toml()/load_from_path()) doesn't
apply the SPACEBOT_VOICE_LANGUAGE, SPACEBOT_VOICE_TRANSLATE, and
SPACEBOT_STT_PROVIDER environment overrides so TOML values can shadow env vars
and SPACEBOT_VOICE_TRANSLATE cannot force false; after routing is built in
from_toml() (i.e., after routing resolution) apply the same env-override logic
used in load_from_env(): if SPACEBOT_VOICE_LANGUAGE is set, assign
routing.voice_language = Some(value); if SPACEBOT_STT_PROVIDER is set, assign
routing.stt_provider = Some(value); and for SPACEBOT_VOICE_TRANSLATE explicitly
parse the env var for "true"/"false" (case-insensitive) and set
routing.voice_translate accordingly so env can force both true and false. Ensure
you use the same symbol names (routing.voice_language, routing.voice_translate,
routing.stt_provider) and add this override block in from_toml() after routing
is finalized.

In `@src/llm/routing.rs`:
- Around line 36-41: The stt_provider override (stt_provider) currently can
mismatch the model name extracted from voice (e.g., voice = "openai/whisper-1"
with stt_provider = "groq"), so update the code to either validate and reject
mismatched provider/model pairs or to resolve a provider-specific STT model when
an override is provided: add a resolver function (e.g.,
resolve_stt_model(provider: &str, voice: &str) -> Result<String, Error>) and
call it where channel attachments are built (the code that extracts the model
from voice in channel_attachments), or add a validator (e.g.,
validate_stt_provider_model_pair(stt_provider: Option<String>, voice: &str) ->
Result<(), Error>) in the routing/struct initialization so that passing
stt_provider will produce a matching model name for that provider instead of
blindly stripping voice.

In `@src/llm/transcription.rs`:
- Around line 76-88: The build_whisper_endpoint function is appending an extra
"/openai" for Groq hosts, producing duplicate segments; update
build_whisper_endpoint to append only "/v1/{audio/...}" when base contains
"groq.com" (i.e., remove the "/openai" from the Groq branch) so the path becomes
"{base}/v1/{path}", and update the corresponding unit tests that assert Groq
URLs (tests referencing the Groq endpoint expectations) to expect no duplicate
"/openai" in the resulting URL.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ec65f3f-c2f8-41b9-ba61-b1023371d101

📥 Commits

Reviewing files that changed from the base of the PR and between 81c7855 and 3f76e29.

⛔ Files ignored due to path filters (1)
  • docs/content/docs/(features)/meta.json is excluded by !**/*.json
📒 Files selected for processing (13)
  • .gitignore
  • docs/content/docs/(configuration)/config.mdx
  • docs/content/docs/(core)/routing.mdx
  • docs/content/docs/(features)/voice-transcription.mdx
  • src/agent/channel_attachments.rs
  • src/api/config.rs
  • src/api/models.rs
  • src/config/load.rs
  • src/config/providers.rs
  • src/config/toml_schema.rs
  • src/llm.rs
  • src/llm/routing.rs
  • src/llm/transcription.rs

Comment on lines +13 to +15
# Rust
/target
.Cargo.lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

.Cargo.lock won't match Cargo.lock.

Cargo.lock is not a dotfile, so this pattern currently has no effect on the standard Rust lockfile.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 13 - 15, The .gitignore currently lists
".Cargo.lock" (a dot-prefixed filename) which won't match the actual Rust
lockfile "Cargo.lock"; update the ignore entry to "Cargo.lock" (or add both
"Cargo.lock" and ".Cargo.lock" if you want to cover both) so the standard Rust
lockfile is properly ignored.

Comment on lines +225 to +235
let provider_id = routing.stt_provider.as_deref().unwrap_or_else(|| {
voice_model
.split_once('/')
.map(|(p, _)| p)
.unwrap_or("anthropic")
});

let model_name = voice_model
.split_once('/')
.map(|(_, m)| m)
.unwrap_or(voice_model);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject providerless voice models when stt_provider is unset.

Falling back to "anthropic" here turns voice = "whisper-1" into a misleading "not configured / not supported" path instead of telling the operator the config is incomplete. This should error unless either stt_provider is set or voice is already provider/model.

💡 Suggested fix
-    let provider_id = routing.stt_provider.as_deref().unwrap_or_else(|| {
-        voice_model
-            .split_once('/')
-            .map(|(p, _)| p)
-            .unwrap_or("anthropic")
-    });
-
-    let model_name = voice_model
-        .split_once('/')
-        .map(|(_, m)| m)
-        .unwrap_or(voice_model);
+    let (provider_id, model_name) = match routing.stt_provider.as_deref() {
+        Some(provider_id) => (provider_id, voice_model),
+        None => match voice_model.split_once('/') {
+            Some((provider_id, model_name)) => (provider_id, model_name),
+            None => {
+                return UserContent::text(
+                    "[Audio transcription failed: `voice` must be `provider/model` when `stt_provider` is unset]",
+                );
+            }
+        },
+    };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let provider_id = routing.stt_provider.as_deref().unwrap_or_else(|| {
voice_model
.split_once('/')
.map(|(p, _)| p)
.unwrap_or("anthropic")
});
let model_name = voice_model
.split_once('/')
.map(|(_, m)| m)
.unwrap_or(voice_model);
let (provider_id, model_name) = match routing.stt_provider.as_deref() {
Some(provider_id) => (provider_id, voice_model),
None => match voice_model.split_once('/') {
Some((provider_id, model_name)) => (provider_id, model_name),
None => {
return UserContent::text(
"[Audio transcription failed: `voice` must be `provider/model` when `stt_provider` is unset]",
);
}
},
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_attachments.rs` around lines 225 - 235, The current logic
silently defaults provider_id to "anthropic" when routing.stt_provider is None,
which masks missing configuration; change the handling in the block that
computes provider_id and model_name (using routing.stt_provider, voice_model,
provider_id, and model_name) to: if voice_model contains a '/' use the split
provider and model as now; else if routing.stt_provider is Some use that
provider and voice_model as model; otherwise return/raise an error (or propagate
a Config/InvalidInput error) indicating that stt_provider is unset and voice
must be set as "provider/model". Ensure the error message clearly references the
missing stt_provider and the expected "provider/model" format so operators can
fix the config.

Comment on lines +248 to +265
if !supports_whisper_transcription(&provider) {
return UserContent::text(format!(
"[Audio transcription failed for {}: provider '{}' does not support input_audio on this endpoint]",
attachment.filename, provider_id
"[Audio transcription not supported by provider '{}'. \
Configure a Whisper-compatible STT provider (openai, groq, gemini).]",
provider_id
));
}

let format = audio_format_for_attachment(attachment);
use base64::Engine as _;
let base64_audio = base64::engine::general_purpose::STANDARD.encode(&bytes);

let endpoint = format!(
"{}/v1/chat/completions",
provider.base_url.trim_end_matches('/')
);
let body = serde_json::json!({
"model": model_name,
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "Transcribe this audio verbatim. Return only the transcription text."
},
{
"type": "input_audio",
"input_audio": {
"data": base64_audio,
"format": format,
}
}
]
}],
"temperature": 0
});

let response = match deps
.llm_manager
.http_client()
.post(&endpoint)
.header("authorization", format!("Bearer {}", provider.api_key))
.header("content-type", "application/json")
.json(&body)
.send()
.await
{
Ok(response) => response,
Err(error) => {
tracing::warn!(%error, model = %voice_model, "voice transcription request failed");
return UserContent::text(format!(
"[Audio transcription failed for {}]",
attachment.filename
));
}
let request = TranscriptionRequest {
audio_bytes: &bytes,
filename: &attachment.filename,
mime_type: &attachment.mime_type,
model: model_name,
language: routing.voice_language.as_deref(),
translate: routing.voice_translate,
};

let status = response.status();
let response_body = match response.json::<serde_json::Value>().await {
Ok(body) => body,
match transcribe_audio(deps.llm_manager.http_client(), &provider, request).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does the Gemini OpenAI-compatible API support POST /v1/audio/translations, or only POST /v1/audio/transcriptions?

💡 Result:

Neither. Google’s Gemini OpenAI-compatible REST base URL (.../v1beta/openai/) does not provide OpenAI’s Whisper-style endpoints like POST /v1/audio/transcriptions or POST /v1/audio/translations. Instead, audio transcription/translation is done by sending audio as input_audio to POST /v1/chat/completions and prompting the model to transcribe/translate. [1]

A third-party compatibility matrix likewise marks Gemini as unsupported for both /audio/transcriptions and /audio/translations. [2]

Sources:
[1] Google AI for Developers — “OpenAI compatibility” (audio example uses /openai/chat/completions) (ai.google.dev)
[2] TrueFoundry AI Gateway docs — supported APIs table (truefoundry.com)

Citations:


🏁 Script executed:

#!/bin/bash
# Find the definition of supports_whisper_transcription function
rg -n "fn supports_whisper_transcription" --type rs

Repository: spacedriveapp/spacebot

Length of output: 92


🏁 Script executed:

#!/bin/bash
# Search for provider support checks related to transcription
rg -n "supports_whisper_transcription|gemini.*transcrib|transcrib.*gemini" --type rs -i

Repository: spacedriveapp/spacebot

Length of output: 92


🏁 Script executed:

#!/bin/bash
# Look for the TranscriptionRequest and how translation is handled
rg -n "voice_translate|TranscriptionRequest" --type rs -A 5

Repository: spacedriveapp/spacebot

Length of output: 92


🏁 Script executed:

#!/bin/bash
# Find the definition of supports_whisper_transcription function
rg -n "fn supports_whisper_transcription" src/

Repository: spacedriveapp/spacebot

Length of output: 169


🏁 Script executed:

#!/bin/bash
# Search for provider support checks
rg -n "supports_whisper_transcription|voice_translate" src/

Repository: spacedriveapp/spacebot

Length of output: 2394


🏁 Script executed:

#!/bin/bash
# Look at the channel_attachments.rs file to understand context
wc -l src/agent/channel_attachments.rs

Repository: spacedriveapp/spacebot

Length of output: 102


🏁 Script executed:

#!/bin/bash
# Read the supports_whisper_transcription implementation
sed -n '152,170p' src/llm/transcription.rs

Repository: spacedriveapp/spacebot

Length of output: 633


🏁 Script executed:

#!/bin/bash
# Look at the Gemini test to understand what's expected
sed -n '299,310p' src/llm/transcription.rs

Repository: spacedriveapp/spacebot

Length of output: 546


🏁 Script executed:

#!/bin/bash
# Search for transcribe_audio function definition
rg -n "fn transcribe_audio" src/ -A 20

Repository: spacedriveapp/spacebot

Length of output: 2857


🏁 Script executed:

#!/bin/bash
# Look at the build_whisper_endpoint function
rg -n "fn build_whisper_endpoint" src/ -A 15

Repository: spacedriveapp/spacebot

Length of output: 942


🏁 Script executed:

#!/bin/bash
# Get more context on TranscriptionRequest and the full transcribe function
sed -n '39,90p' src/llm/transcription.rs

Repository: spacedriveapp/spacebot

Length of output: 1680


🏁 Script executed:

#!/bin/bash
# Check if there are any other tests or configurations that show Gemini transcription is actually supported
rg -n "gemini.*transcrib|Gemini.*audio" src/ -i -B 2 -A 2

Repository: spacedriveapp/spacebot

Length of output: 472


🏁 Script executed:

#!/bin/bash
# Look at the full test file to understand the intent
sed -n '273,350p' src/llm/transcription.rs

Repository: spacedriveapp/spacebot

Length of output: 2969


Gemini should not be listed as supporting Whisper-compatible audio endpoints.

The code incorrectly includes Gemini in supports_whisper_transcription() via the URL check for googleapis.com. Gemini's OpenAI-compatible API does not support the Whisper endpoints (POST /v1/audio/transcriptions or POST /v1/audio/translations). Instead, Gemini performs audio transcription via POST /v1/chat/completions with audio as input_audio. Remove Gemini from the supported providers list, or implement proper Gemini-specific audio handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_attachments.rs` around lines 248 - 265, The provider list
wrongly treats Gemini as Whisper-compatible; update
supports_whisper_transcription to exclude Gemini (do not rely on googleapis.com
or other URL checks that include Gemini) and either remove Gemini from the
branch that builds a TranscriptionRequest or add a separate Gemini-specific
audio path; specifically, change supports_whisper_transcription(...) so it
returns false for provider identifiers or configs that indicate Gemini, and
ensure transcribe_audio(...) is only called for true results (leave
TranscriptionRequest creation and the match on transcribe_audio(...) unchanged
for non-Gemini providers).

Comment on lines +272 to +275
UserContent::text(format!(
"<{} name=\"{}\" mime=\"{}\">\n{}\n</{}>",
tag, attachment.filename, attachment.mime_type, response.text, tag
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Escape transcript text before injecting XML tags.

response.text, attachment.filename, and attachment.mime_type are all unescaped here. A transcript containing </voice_transcript> or a filename containing quotes will break the wrapper and can inject extra pseudo-tags into history.

🔒 Suggested fix
+            let escaped_filename = escape_xml(&attachment.filename);
+            let escaped_mime_type = escape_xml(&attachment.mime_type);
+            let escaped_text = escape_xml(&response.text);
             UserContent::text(format!(
                 "<{} name=\"{}\" mime=\"{}\">\n{}\n</{}>",
-                tag, attachment.filename, attachment.mime_type, response.text, tag
+                tag, escaped_filename, escaped_mime_type, escaped_text, tag
             ))

Use any existing XML/HTML escaping helper in the codebase, or add a tiny local helper if there isn't one yet.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel_attachments.rs` around lines 272 - 275, Escape the values
injected into the XML wrapper to prevent tag/attribute injection: when
constructing the UserContent::text in the block that formats "<{} name=\"{}\"
mime=\"{}\">...\n</{}>" (the code referencing tag, attachment.filename,
attachment.mime_type, response.text), run attachment.filename and
attachment.mime_type through an attribute-escaping helper and run response.text
through an element/inner-text escaping helper (or add a small local xml_escape
function if none exists) before formatting; ensure the same escaped values are
used for both the start and end tag context where applicable to avoid breaking
the wrapper.

Comment on lines +172 to +174
voice_language: Option<String>,
voice_translate: Option<bool>,
stt_provider: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

PATCH can't unset voice_language or stt_provider.

GET /api/config exposes these as nullable fields, but this PATCH shape can't round-trip null: Option<String> makes omitted and explicit null identical, and update_routing_table() only writes on Some(...). After a client sets either field once, it has no way to remove the key and fall back to inherited/default routing. Please use a tri-state patch shape and delete the TOML key on explicit null.

Also applies to: 595-603

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/config.rs` around lines 172 - 174, The patch struct fields
voice_language, voice_translate, and stt_provider must be made tri-state so we
can distinguish omitted vs explicit null: change their types from
Option<String>/Option<bool> to Option<Option<String>> and Option<Option<bool>>
in the PATCH struct (the struct declared around the shown fields, e.g., the
config patch type), then update update_routing_table() to handle three cases:
None => do nothing (field omitted), Some(Some(value)) => write the TOML key with
value, and Some(None) => delete the corresponding TOML key (remove it so routing
falls back to inherited/default). Apply the same tri-state change and handling
for the other occurrence referenced (around lines 595-603).

Comment on lines +108 to +110
/// Returns true if the provider supports Whisper-compatible voice transcription.
fn supports_voice_transcription(provider: &str) -> bool {
WHISPER_CAPABLE_PROVIDERS.contains(&provider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't expose whole providers as voice_transcription models.

This turns provider support into model support, so /api/models?capability=voice_transcription now returns the whole OpenAI/Groq/Gemini catalog even though the transcription path still sends the chosen model id straight to /audio/transcriptions. Normal chat models like openai/gpt-4.1 or groq/llama-3.3-70b-versatile become selectable here and then fail at runtime. If this endpoint is meant to drive voice-model selection, it needs a model-level allowlist/signal rather than a provider-level check.

Also applies to: 360-365, 407-410

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/models.rs` around lines 108 - 110, The current check
(supports_voice_transcription using WHISPER_CAPABLE_PROVIDERS) treats provider
support as model support and exposes whole provider catalogs; replace it with a
model-level allowlist or capability flag. Add a new predicate (e.g.,
supports_voice_transcription_model(model_id: &str) -> bool) that checks either a
WHISPER_COMPATIBLE_MODELS set or a per-model metadata flag, update all usages of
supports_voice_transcription and WHISPER_CAPABLE_PROVIDERS to call
supports_voice_transcription_model (including the other occurrences mentioned),
and ensure the models endpoint filters by model id/capability rather than by
provider so only actual Whisper-compatible models are returned for
voice_transcription.

Comment on lines +790 to +800
if let Ok(voice_language) = std::env::var("SPACEBOT_VOICE_LANGUAGE") {
routing.voice_language = Some(voice_language);
}
if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
if voice_translate.eq_ignore_ascii_case("true") {
routing.voice_translate = true;
}
}
if let Ok(stt_provider) = std::env::var("SPACEBOT_STT_PROVIDER") {
routing.stt_provider = Some(stt_provider);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply these voice/STT env overrides in the TOML load path too.

These vars are only handled in load_from_env(). As soon as config.toml exists, load_from_path() goes through from_toml() and SPACEBOT_VOICE_LANGUAGE, SPACEBOT_VOICE_TRANSLATE, and SPACEBOT_STT_PROVIDER stop overriding config. SPACEBOT_VOICE_TRANSLATE is also one-way here, so env cannot force false over a TOML true.

💡 Suggested direction
+fn apply_voice_env_overrides(routing: &mut RoutingConfig) {
+    if let Ok(voice_language) = std::env::var("SPACEBOT_VOICE_LANGUAGE") {
+        routing.voice_language = (!voice_language.trim().is_empty()).then_some(voice_language);
+    }
+    if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
+        match voice_translate.trim().to_ascii_lowercase().as_str() {
+            "true" => routing.voice_translate = true,
+            "false" => routing.voice_translate = false,
+            _ => tracing::warn!(value = %voice_translate, "invalid SPACEBOT_VOICE_TRANSLATE"),
+        }
+    }
+    if let Ok(stt_provider) = std::env::var("SPACEBOT_STT_PROVIDER") {
+        routing.stt_provider = (!stt_provider.trim().is_empty()).then_some(stt_provider);
+    }
+}
...
-        if let Ok(voice_language) = std::env::var("SPACEBOT_VOICE_LANGUAGE") {
-            routing.voice_language = Some(voice_language);
-        }
-        if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
-            if voice_translate.eq_ignore_ascii_case("true") {
-                routing.voice_translate = true;
-            }
-        }
-        if let Ok(stt_provider) = std::env::var("SPACEBOT_STT_PROVIDER") {
-            routing.stt_provider = Some(stt_provider);
-        }
+        apply_voice_env_overrides(&mut routing);

Apply the same helper after routing resolution in from_toml() as well, otherwise per-agent TOML values will still win over the env var.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Ok(voice_language) = std::env::var("SPACEBOT_VOICE_LANGUAGE") {
routing.voice_language = Some(voice_language);
}
if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
if voice_translate.eq_ignore_ascii_case("true") {
routing.voice_translate = true;
}
}
if let Ok(stt_provider) = std::env::var("SPACEBOT_STT_PROVIDER") {
routing.stt_provider = Some(stt_provider);
}
fn apply_voice_env_overrides(routing: &mut RoutingConfig) {
if let Ok(voice_language) = std::env::var("SPACEBOT_VOICE_LANGUAGE") {
routing.voice_language = (!voice_language.trim().is_empty()).then_some(voice_language);
}
if let Ok(voice_translate) = std::env::var("SPACEBOT_VOICE_TRANSLATE") {
match voice_translate.trim().to_ascii_lowercase().as_str() {
"true" => routing.voice_translate = true,
"false" => routing.voice_translate = false,
_ => tracing::warn!(value = %voice_translate, "invalid SPACEBOT_VOICE_TRANSLATE"),
}
}
if let Ok(stt_provider) = std::env::var("SPACEBOT_STT_PROVIDER") {
routing.stt_provider = (!stt_provider.trim().is_empty()).then_some(stt_provider);
}
}
// ... in the original location (lines 790-800):
apply_voice_env_overrides(&mut routing);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/load.rs` around lines 790 - 800, The TOML load path
(from_toml()/load_from_path()) doesn't apply the SPACEBOT_VOICE_LANGUAGE,
SPACEBOT_VOICE_TRANSLATE, and SPACEBOT_STT_PROVIDER environment overrides so
TOML values can shadow env vars and SPACEBOT_VOICE_TRANSLATE cannot force false;
after routing is built in from_toml() (i.e., after routing resolution) apply the
same env-override logic used in load_from_env(): if SPACEBOT_VOICE_LANGUAGE is
set, assign routing.voice_language = Some(value); if SPACEBOT_STT_PROVIDER is
set, assign routing.stt_provider = Some(value); and for SPACEBOT_VOICE_TRANSLATE
explicitly parse the env var for "true"/"false" (case-insensitive) and set
routing.voice_translate accordingly so env can force both true and false. Ensure
you use the same symbol names (routing.voice_language, routing.voice_translate,
routing.stt_provider) and add this override block in from_toml() after routing
is finalized.

Comment on lines +36 to +41
/// Language hint for voice transcription accuracy (e.g., "en", "es").
pub voice_language: Option<String>,
/// If true, use translations endpoint (translates to English).
pub voice_translate: bool,
/// Optional STT provider override (defaults to voice model provider).
pub stt_provider: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

stt_provider needs a matching model contract.

The attachment path in src/agent/channel_attachments.rs (Lines 225-235) takes the provider from stt_provider, but it still strips the model name out of voice. That means voice = "openai/whisper-1" plus stt_provider = "groq" will send whisper-1 to Groq, and the inverse fails the other way around. Either reject mismatched pairs or resolve a provider-specific STT model when the override is used.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llm/routing.rs` around lines 36 - 41, The stt_provider override
(stt_provider) currently can mismatch the model name extracted from voice (e.g.,
voice = "openai/whisper-1" with stt_provider = "groq"), so update the code to
either validate and reject mismatched provider/model pairs or to resolve a
provider-specific STT model when an override is provided: add a resolver
function (e.g., resolve_stt_model(provider: &str, voice: &str) -> Result<String,
Error>) and call it where channel attachments are built (the code that extracts
the model from voice in channel_attachments), or add a validator (e.g.,
validate_stt_provider_model_pair(stt_provider: Option<String>, voice: &str) ->
Result<(), Error>) in the routing/struct initialization so that passing
stt_provider will produce a matching model name for that provider instead of
blindly stripping voice.

Comment on lines +76 to +88
fn build_whisper_endpoint(base_url: &str, translate: bool) -> String {
let base = base_url.trim_end_matches('/');
let path = if translate {
"audio/translations"
} else {
"audio/transcriptions"
};

if base.contains("groq.com") {
format!("{}/openai/v1/{}", base, path)
} else {
format!("{}/v1/{}", base, path)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

For Groq's OpenAI-compatible API, if the base URL is https://api.groq.com/openai`, what is the full audio transcription endpoint path? Should client code append /v1/audio/transcriptions or /openai/v1/audio/transcriptions?`

💡 Result:

Groq’s OpenAI-compatible audio transcription endpoint is:

  • Full URL: https://api.groq.com/openai/v1/audio/transcriptions [1][2]

So if your base URL is https://api.groq.com/openai, your client should append:

  • /v1/audio/transcriptions (resulting in https://api.groq.com/openai/v1/audio/transcriptions) [1][2]

Do not append /openai/v1/audio/transcriptions (that would duplicate the /openai segment). [1]

Sources:
[1] GroqDocs – OpenAI Compatibility (base_url uses /openai/v1)
[2] GroqDocs – Speech to Text (transcriptions endpoint shown as /openai/v1/audio/transcriptions)


Remove the duplicate /openai segment in the Groq endpoint builder.

Since src/config/providers.rs already sets Groq's base URL to https://api.groq.com/openai, appending /openai/v1/audio/transcriptions creates an invalid URL: https://api.groq.com/openai/openai/v1/audio/.... The correct endpoint requires only /v1/audio/transcriptions to be appended. Update lines 76-88 and the corresponding tests on lines 192-207 to remove the /openai from the path construction when handling Groq.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llm/transcription.rs` around lines 76 - 88, The build_whisper_endpoint
function is appending an extra "/openai" for Groq hosts, producing duplicate
segments; update build_whisper_endpoint to append only "/v1/{audio/...}" when
base contains "groq.com" (i.e., remove the "/openai" from the Groq branch) so
the path becomes "{base}/v1/{path}", and update the corresponding unit tests
that assert Groq URLs (tests referencing the Groq endpoint expectations) to
expect no duplicate "/openai" in the resulting URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant