Versic Audio Analysis Pipeline

Upload → Analysis → Search → RAG

Pipeline Flow (Step Functions State Machine)

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#7c3aed', 'primaryTextColor': '#e4e4e7', 'primaryBorderColor': '#7c3aed', 'lineColor': '#6366f1', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#18181b', 'fontSize': '13px'}}}%% flowchart LR subgraph trigger["Trigger"] direction TB Upload["User Upload\n(Track / Bounce / Stem)"] SQS["SQS\nAnalysis Queue"] Start["start-pipeline\nLambda"] end subgraph lambdas["CPU Lambdas"] direction TB Load["LoadUserSettings\nDynamoDB read"] Set["SetAnalyzing\nstatus = analyzing"] FF["FFprobe\nduration, codec, channels"] ES["Essentia (WASM)\nBPM, musical key"] end subgraph modal["Modal GPU (T4)"] direction TB DM["Demucs\nStem separation\nvocals / drums / bass / other"] WH["faster-whisper\nLyrics from vocals stem"] CL["CLAP\nAudio embeddings"] YN["YAMNet (ONNX)\nInstrument detection"] OP["ffmpeg\nOpus compression"] end subgraph enrich["AI Enrichment"] direction TB BR["Bedrock LLM\ngenre, mood, energy,\ntags, description"] LF["Lyrics Formatting\nLine & paragraph breaks"] end subgraph index["Indexing"] direction TB AV["AudioVectorIndex\nCLAP → OpenSearch"] FN["Finalize\nstatus = ready"] end Upload --> SQS --> Start Start --> Load --> Set --> FF --> ES ES --> DM --> WH --> CL --> YN --> OP OP --> BR --> LF LF --> AV --> FN style trigger fill:#1e1b4b,stroke:#6366f1,color:#e4e4e7 style lambdas fill:#0f2922,stroke:#22c55e,color:#e4e4e7 style modal fill:#2a1a0e,stroke:#f59e0b,color:#e4e4e7 style enrich fill:#1a0e2e,stroke:#a78bfa,color:#e4e4e7 style index fill:#0e1a2e,stroke:#3b82f6,color:#e4e4e7

Search Indexing & RAG Flow

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#7c3aed', 'primaryTextColor': '#e4e4e7', 'primaryBorderColor': '#7c3aed', 'lineColor': '#6366f1', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#18181b', 'fontSize': '13px'}}}%% flowchart TB subgraph writes["Pipeline DynamoDB Writes"] DDB["DynamoDB\nStorageTable"] end subgraph sync["Search Sync (DynamoDB Streams)"] Stream["DynamoDB\nStream"] SyncLambda["Sync Lambda\nGenerates .md search docs"] S3Search["S3 Search Bucket\ntracks/xxx.md"] end subgraph bedrock_kb["Bedrock Knowledge Base"] IdxQueue["SQS\nIndex Queue"] IdxWorker["Index Worker\nLambda"] KB["Bedrock KB\nTitan Embed v2\nS3 Vectors storage"] end subgraph aoss_idx["Audio Semantic Index"] AOSS["OpenSearch Serverless\nCLAP audio vectors"] end subgraph search["User Search"] Query["User types\nsearch query"] Phase0["Phase 0: DynamoDB\ntext matching"] Phase2["Phase 2: Bedrock KB\nsemantic retrieval"] AudioSem["Audio Semantic\nCLAP vector search"] RAG["RAG Answer\nBedrock RetrieveAndGenerate"] Results["Search Results\n+ AI Answer"] end DDB --> Stream --> SyncLambda --> S3Search SyncLambda --> IdxQueue --> IdxWorker --> KB Query --> Phase0 Query --> Phase2 Query --> AudioSem Query --> RAG Phase2 -.->|retrieves from| KB AudioSem -.->|queries| AOSS RAG -.->|retrieves + generates from| KB Phase0 --> Results Phase2 --> Results AudioSem --> Results RAG --> Results style writes fill:#1a0e2e,stroke:#a78bfa,color:#e4e4e7 style sync fill:#0f2922,stroke:#22c55e,color:#e4e4e7 style bedrock_kb fill:#2a1a0e,stroke:#f59e0b,color:#e4e4e7 style aoss_idx fill:#0e1a2e,stroke:#3b82f6,color:#e4e4e7 style search fill:#1e1b4b,stroke:#6366f1,color:#e4e4e7

AWS Resources Map

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#7c3aed', 'primaryTextColor': '#e4e4e7', 'primaryBorderColor': '#7c3aed', 'lineColor': '#6366f1', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#18181b', 'fontSize': '13px'}}}%% flowchart TB subgraph compute["Compute"] SFN["Step Functions\nAudioAnalysisPipeline"] L1["Lambda: LoadUserSettings"] L2["Lambda: SetAnalyzing"] L3["Lambda: FFprobe"] L4["Lambda: Essentia"] L5["Lambda: InvokeModalGpu"] L6["Lambda: Enrichment"] L7["Lambda: AudioVectorIndex"] L8["Lambda: Finalize"] L9["Lambda: SyncLambda"] L10["Lambda: IndexWorker"] end subgraph external["External (Non-AWS)"] Modal["Modal.com\nT4 GPU\nversic-gpu-worker"] end subgraph storage["Storage"] S3A["S3: StorageBucket\nAudio files + stems"] S3B["S3: SearchBucket\n.md search documents"] DDB2["DynamoDB: StorageTable\nAll entity metadata"] end subgraph ai["AI Services"] BRT["Bedrock Runtime\nNova Lite / Claude\nEnrichment + Lyrics fmt"] BKB["Bedrock Knowledge Base\nTitan Embed v2\nRAG retrieval"] end subgraph search_infra["Search Infrastructure"] AOSS2["OpenSearch Serverless\nCLAP audio vectors"] S3V["S3 Vectors\nKB vector storage"] end subgraph queues["Queues"] Q1["SQS: AudioAnalysisQueue"] Q2["SQS: SearchIndexQueue"] Q3["SQS: SearchCascadeQueue"] end SFN --> L1 & L2 & L3 & L4 & L5 & L6 & L7 & L8 L5 -->|HTTPS| Modal Modal -->|results| L5 L3 & L4 & Modal -->|read/write| S3A L6 -->|invoke| BRT L7 -->|upsert| AOSS2 L9 -->|write| S3B L10 -->|ingest| BKB BKB -.-> S3V style compute fill:#0f2922,stroke:#22c55e,color:#e4e4e7 style external fill:#2a1a0e,stroke:#f59e0b,color:#e4e4e7 style storage fill:#1a0e2e,stroke:#a78bfa,color:#e4e4e7 style ai fill:#1e1b4b,stroke:#6366f1,color:#e4e4e7 style search_infra fill:#0e1a2e,stroke:#3b82f6,color:#e4e4e7 style queues fill:#1e0e1a,stroke:#f472b6,color:#e4e4e7
CPU LambdasNode.js 20, serverless
Modal GPUT4, pay-per-second, ~$0.015/track
StorageS3 + DynamoDB
AI ServicesBedrock (LLM + KB)
SearchOpenSearch Serverless + S3 Vectors
QueuesSQS (analysis, index, cascade)

Cost Per Track (~4 minute song)

ComponentResourceCost
Lambdas (8 invocations)AWS Lambda~$0.001
Modal GPU (~90s: Demucs + Whisper + CLAP + YAMNet)Modal T4~$0.015
Enrichment (2 LLM calls: metadata + lyrics formatting)Bedrock Runtime~$0.002
Search indexing (KB ingest + AOSS upsert)Bedrock KB + AOSS~$0.001
Storage (opus stems ~6MB + search doc)S3~$0.0001/mo
Total per track~$0.02

What is CLAP?

CLAP = Contrastive Language-Audio Pretraining (model: laion/clap-htsat-fused)

CLAP is the audio equivalent of CLIP (which works for images). It was trained on millions of audio-text pairs to learn a shared embedding space where audio and text descriptions live together.

How it works: CLAP has two encoders — an audio encoder and a text encoder. Both produce 512-dimensional vectors in the same space. If you encode the audio of a guitar riff and the text "acoustic guitar strumming", they'll be close together in vector space.

In our pipeline (indexing): During analysis, CLAP encodes 20-second audio windows into embeddings. These are stored in OpenSearch Serverless (AOSS) as vectors with metadata (entityId, userId, timestamps).

In our pipeline (retrieval): When a user searches "find something that sounds like dark heavy beats", the text query is encoded with CLAP's text encoder into a vector, then we do cosine similarity search against the stored audio vectors in AOSS. This finds audio that sounds like the description, even without matching keywords.

Current limitation: The CLAP text encoder requires PyTorch and can't run in the Next.js dev server (that's the "CLAP text-query runtime unavailable" error). It works in a deployed Lambda environment. In local dev, it falls back to descriptor-based search.