Versic Audio Analysis Pipeline
Upload → Analysis → Search → RAG
Pipeline Flow (Step Functions State Machine)
%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#7c3aed', 'primaryTextColor': '#e4e4e7', 'primaryBorderColor': '#7c3aed', 'lineColor': '#6366f1', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#18181b', 'fontSize': '13px'}}}%%
flowchart LR
subgraph trigger["Trigger"]
direction TB
Upload["User Upload\n(Track / Bounce / Stem)"]
SQS["SQS\nAnalysis Queue"]
Start["start-pipeline\nLambda"]
end
subgraph lambdas["CPU Lambdas"]
direction TB
Load["LoadUserSettings\nDynamoDB read"]
Set["SetAnalyzing\nstatus = analyzing"]
FF["FFprobe\nduration, codec, channels"]
ES["Essentia (WASM)\nBPM, musical key"]
end
subgraph modal["Modal GPU (T4)"]
direction TB
DM["Demucs\nStem separation\nvocals / drums / bass / other"]
WH["faster-whisper\nLyrics from vocals stem"]
CL["CLAP\nAudio embeddings"]
YN["YAMNet (ONNX)\nInstrument detection"]
OP["ffmpeg\nOpus compression"]
end
subgraph enrich["AI Enrichment"]
direction TB
BR["Bedrock LLM\ngenre, mood, energy,\ntags, description"]
LF["Lyrics Formatting\nLine & paragraph breaks"]
end
subgraph index["Indexing"]
direction TB
AV["AudioVectorIndex\nCLAP → OpenSearch"]
FN["Finalize\nstatus = ready"]
end
Upload --> SQS --> Start
Start --> Load --> Set --> FF --> ES
ES --> DM --> WH --> CL --> YN --> OP
OP --> BR --> LF
LF --> AV --> FN
style trigger fill:#1e1b4b,stroke:#6366f1,color:#e4e4e7
style lambdas fill:#0f2922,stroke:#22c55e,color:#e4e4e7
style modal fill:#2a1a0e,stroke:#f59e0b,color:#e4e4e7
style enrich fill:#1a0e2e,stroke:#a78bfa,color:#e4e4e7
style index fill:#0e1a2e,stroke:#3b82f6,color:#e4e4e7
Search Indexing & RAG Flow
%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#7c3aed', 'primaryTextColor': '#e4e4e7', 'primaryBorderColor': '#7c3aed', 'lineColor': '#6366f1', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#18181b', 'fontSize': '13px'}}}%%
flowchart TB
subgraph writes["Pipeline DynamoDB Writes"]
DDB["DynamoDB\nStorageTable"]
end
subgraph sync["Search Sync (DynamoDB Streams)"]
Stream["DynamoDB\nStream"]
SyncLambda["Sync Lambda\nGenerates .md search docs"]
S3Search["S3 Search Bucket\ntracks/xxx.md"]
end
subgraph bedrock_kb["Bedrock Knowledge Base"]
IdxQueue["SQS\nIndex Queue"]
IdxWorker["Index Worker\nLambda"]
KB["Bedrock KB\nTitan Embed v2\nS3 Vectors storage"]
end
subgraph aoss_idx["Audio Semantic Index"]
AOSS["OpenSearch Serverless\nCLAP audio vectors"]
end
subgraph search["User Search"]
Query["User types\nsearch query"]
Phase0["Phase 0: DynamoDB\ntext matching"]
Phase2["Phase 2: Bedrock KB\nsemantic retrieval"]
AudioSem["Audio Semantic\nCLAP vector search"]
RAG["RAG Answer\nBedrock RetrieveAndGenerate"]
Results["Search Results\n+ AI Answer"]
end
DDB --> Stream --> SyncLambda --> S3Search
SyncLambda --> IdxQueue --> IdxWorker --> KB
Query --> Phase0
Query --> Phase2
Query --> AudioSem
Query --> RAG
Phase2 -.->|retrieves from| KB
AudioSem -.->|queries| AOSS
RAG -.->|retrieves + generates from| KB
Phase0 --> Results
Phase2 --> Results
AudioSem --> Results
RAG --> Results
style writes fill:#1a0e2e,stroke:#a78bfa,color:#e4e4e7
style sync fill:#0f2922,stroke:#22c55e,color:#e4e4e7
style bedrock_kb fill:#2a1a0e,stroke:#f59e0b,color:#e4e4e7
style aoss_idx fill:#0e1a2e,stroke:#3b82f6,color:#e4e4e7
style search fill:#1e1b4b,stroke:#6366f1,color:#e4e4e7
AWS Resources Map
%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#7c3aed', 'primaryTextColor': '#e4e4e7', 'primaryBorderColor': '#7c3aed', 'lineColor': '#6366f1', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#18181b', 'fontSize': '13px'}}}%%
flowchart TB
subgraph compute["Compute"]
SFN["Step Functions\nAudioAnalysisPipeline"]
L1["Lambda: LoadUserSettings"]
L2["Lambda: SetAnalyzing"]
L3["Lambda: FFprobe"]
L4["Lambda: Essentia"]
L5["Lambda: InvokeModalGpu"]
L6["Lambda: Enrichment"]
L7["Lambda: AudioVectorIndex"]
L8["Lambda: Finalize"]
L9["Lambda: SyncLambda"]
L10["Lambda: IndexWorker"]
end
subgraph external["External (Non-AWS)"]
Modal["Modal.com\nT4 GPU\nversic-gpu-worker"]
end
subgraph storage["Storage"]
S3A["S3: StorageBucket\nAudio files + stems"]
S3B["S3: SearchBucket\n.md search documents"]
DDB2["DynamoDB: StorageTable\nAll entity metadata"]
end
subgraph ai["AI Services"]
BRT["Bedrock Runtime\nNova Lite / Claude\nEnrichment + Lyrics fmt"]
BKB["Bedrock Knowledge Base\nTitan Embed v2\nRAG retrieval"]
end
subgraph search_infra["Search Infrastructure"]
AOSS2["OpenSearch Serverless\nCLAP audio vectors"]
S3V["S3 Vectors\nKB vector storage"]
end
subgraph queues["Queues"]
Q1["SQS: AudioAnalysisQueue"]
Q2["SQS: SearchIndexQueue"]
Q3["SQS: SearchCascadeQueue"]
end
SFN --> L1 & L2 & L3 & L4 & L5 & L6 & L7 & L8
L5 -->|HTTPS| Modal
Modal -->|results| L5
L3 & L4 & Modal -->|read/write| S3A
L6 -->|invoke| BRT
L7 -->|upsert| AOSS2
L9 -->|write| S3B
L10 -->|ingest| BKB
BKB -.-> S3V
style compute fill:#0f2922,stroke:#22c55e,color:#e4e4e7
style external fill:#2a1a0e,stroke:#f59e0b,color:#e4e4e7
style storage fill:#1a0e2e,stroke:#a78bfa,color:#e4e4e7
style ai fill:#1e1b4b,stroke:#6366f1,color:#e4e4e7
style search_infra fill:#0e1a2e,stroke:#3b82f6,color:#e4e4e7
style queues fill:#1e0e1a,stroke:#f472b6,color:#e4e4e7
CPU LambdasNode.js 20, serverless
Modal GPUT4, pay-per-second, ~$0.015/track
AI ServicesBedrock (LLM + KB)
SearchOpenSearch Serverless + S3 Vectors
QueuesSQS (analysis, index, cascade)
Cost Per Track (~4 minute song)
| Component | Resource | Cost |
| Lambdas (8 invocations) | AWS Lambda | ~$0.001 |
| Modal GPU (~90s: Demucs + Whisper + CLAP + YAMNet) | Modal T4 | ~$0.015 |
| Enrichment (2 LLM calls: metadata + lyrics formatting) | Bedrock Runtime | ~$0.002 |
| Search indexing (KB ingest + AOSS upsert) | Bedrock KB + AOSS | ~$0.001 |
| Storage (opus stems ~6MB + search doc) | S3 | ~$0.0001/mo |
| Total per track | | ~$0.02 |
What is CLAP?
CLAP = Contrastive Language-Audio Pretraining (model: laion/clap-htsat-fused)
CLAP is the audio equivalent of CLIP (which works for images). It was trained on millions of audio-text pairs to learn a shared embedding space where audio and text descriptions live together.
How it works: CLAP has two encoders — an audio encoder and a text encoder. Both produce 512-dimensional vectors in the same space. If you encode the audio of a guitar riff and the text "acoustic guitar strumming", they'll be close together in vector space.
In our pipeline (indexing): During analysis, CLAP encodes 20-second audio windows into embeddings. These are stored in OpenSearch Serverless (AOSS) as vectors with metadata (entityId, userId, timestamps).
In our pipeline (retrieval): When a user searches "find something that sounds like dark heavy beats", the text query is encoded with CLAP's text encoder into a vector, then we do cosine similarity search against the stored audio vectors in AOSS. This finds audio that sounds like the description, even without matching keywords.
Current limitation: The CLAP text encoder requires PyTorch and can't run in the Next.js dev server (that's the "CLAP text-query runtime unavailable" error). It works in a deployed Lambda environment. In local dev, it falls back to descriptor-based search.