Building a Voice-Interactive Digital Twin with VAD: A Complete Guide

How It All Started

So here's the thing - I got tired of the same old portfolio websites. You know the ones: scroll, read, click, repeat. Boring.

I wanted something different. Something that would make people go "wait, what?" when they landed on my site. That's when I had this wild idea: what if people could just... talk to my portfolio? Like, actually have a conversation with it?

Fast forward a few weeks of research, and boom - I built a voice-interactive digital twin. Visitors can now chat with an AI version of me that knows everything about my projects, experience, and skills. No typing, no clicking buttons to record - just pure conversation.

In this article, I'll show you exactly how I built this thing, from the voice detection magic to the trippy visual effects that react to your voice. Spoiler alert: it involves machine learning, AI APIs, and some shader wizardry that would make your GPU sweat.

What you're about to learn:

How Voice Activity Detection works (and why it's cooler than it sounds)
Building a complete voice conversation pipeline without losing your sanity
Making WebGL shaders dance to your voice
Real problems I ran into and how I fixed them
Why my API bills went up (worth it though)

Wait, What's Voice Activity Detection?

Before we dive in, let me explain VAD (Voice Activity Detection) because it's the secret sauce that makes everything work.

Imagine you're on a Zoom call and the app knows exactly when you're talking. It's not just listening for "loud noises" - it's actually smart enough to tell the difference between you saying "hello" and your dog barking in the background. That's VAD in action.

VAD uses machine learning models trained on tons of audio data to detect actual human speech vs everything else:

Your mechanical keyboard clicking away
The ambulance siren outside
That Spotify playlist you forgot to pause
Your roommate yelling at their video game

Why This Matters

Here's the problem with traditional voice interfaces: they're annoying. You have to:

Click a button to start recording
Say your thing
Click another button to stop
Hope you didn't forget to click

It's like having to knock on someone's door, wait for them to answer, then knock again when you're done talking. Weird, right?

With VAD, the app just... knows. It hears you start talking, listens to what you say, and processes it when you're done. Hands-free. Magical. Just like talking to Alexa, but on a website.

The Tech Stack (AKA What I Glued Together)

Alright, let me show you the pipeline I built. It's basically a Rube Goldberg machine for voice, but it works beautifully:

The Tools I Used (And Why)

What	Why I Picked It	The Real Reason
@ricky0123/vad-web	Runs VAD in the browser	No backend needed = less infrastructure to babysit
ONNX Runtime	Runs ML models in browsers	Makes the VAD work its magic
Google Gemini	Speech-to-text + AI chat	One API for two jobs = I'm lazy (efficiently)
ElevenLabs	Voice synthesis	Makes my digital twin sound less robotic
Three.js	Fancy graphics	Because static backgrounds are so 2010
Next.js 15	Backend stuff	Server functions are chef's kiss
React 19	UI framework	Hooks make everything easier

Building The Thing (The Fun Part)

Step 1: Getting VAD to Work
Setting up VAD was easier than I expected. Here's the basic setup:

import { useMicVAD } from '@ricky0123/vad-react';

const vad = useMicVAD({
  startOnLoad: false, // Don't be creepy, wait for permission

  // These CDN links load the ML models
  baseAssetPath: "https://cdn.jsdelivr.net/npm/@ricky0123/vad-web@0.0.27/dist/",
  onnxWASMBasePath: "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.22.0/dist/",

  // Fine-tuning the detection
  positiveSpeechThreshold: 0.5,   // "Yep, that's definitely speech"
  negativeSpeechThreshold: 0.35,  // "Nope, not speech anymore"

  // The cool part - event callbacks
  onSpeechStart: () => {
    setConversationState('listening');
    // Time to pay attention!
  },

  onSpeechEnd: (audio: Float32Array) => {
    setConversationState('thinking');
    // Got your audio, now let me process it
    const audioBlob = float32ArrayToWavBlob(audio, 16000);
    processSpeechAudio(audioBlob, 'vad');
  },

  onVADMisfire: () => {
    setConversationState('idle');
    // False alarm, my bad
  },
});

The threshold thing: I set it to 0.5 to start listening and 0.35 to stop. This gap prevents the annoying flicker between "is talking" and "not talking" when you pause mid-sentence. Without it, every breath would trigger a stop. Trust me, I learned this the hard way.

Step 2: Making Visual Feedback That Actually Responds

One of the coolest parts was making the UI react to your voice in real-time. I built a custom hook that calculates how loud you're talking:

export function useVadEnergy() {
  const [vadEnergy, setVadEnergy] = useState(0);
  const [smoothedEnergy, setSmoothedEnergy] = useState(0);

  const vad = useMicVAD({
    startOnLoad: false,
    onSpeechEnd: (audio: Float32Array) => {
      // Math time! Calculate how loud the audio is
      let sum = 0;
      for (let i = 0; i < audio.length; i++) {
        sum += audio[i] * audio[i];
      }
      const rms = Math.sqrt(sum / audio.length);
      setVadEnergy(rms);
    },
    onVADMisfire: () => setVadEnergy(0),
  });

  // Smooth out the values so animations don't look jerky
  useEffect(() => {
    const interval = setInterval(() => {
      setSmoothedEnergy(prev => prev + (vadEnergy - prev) * 0.2);
    }, 50); // Update 20 times per second
    return () => clearInterval(interval);
  }, [vadEnergy]);

  return { vadEnergy: smoothedEnergy, vad };
}

This calculates something called RMS (Root Mean Square) energy, which is a fancy way of saying "how loud is this audio?" The smoothing part makes sure the animations don't spaz out when your voice volume changes suddenly.

Step 3: Audio Format Hell (Fun Times)

Okay, this part wasn't actually fun. It was terrible. But I figured it out!

The VAD library gives you audio as a Float32Array (basically an array of decimal numbers), but Google Gemini wants a WAV file. So I had to write a converter:

function float32ArrayToWavBlob(float32Array: Float32Array, sampleRate: number): Blob {
  // Create a buffer for the WAV file (44 bytes for header + audio data)
  const buffer = new ArrayBuffer(44 + float32Array.length * 2);
  const view = new DataView(buffer);

  // Helper to write strings into the buffer
  const writeString = (offset: number, string: string) => {
    for (let i = 0; i < string.length; i++) {
      view.setUint8(offset + i, string.charCodeAt(i));
    }
  };

  // WAV header stuff (this is standard WAV format)
  writeString(0, 'RIFF');
  view.setUint32(4, 36 + float32Array.length * 2, true);
  writeString(8, 'WAVE');
  writeString(12, 'fmt ');
  view.setUint32(16, 16, true);
  view.setUint16(20, 1, true);  // PCM format
  view.setUint16(22, 1, true);  // Mono
  view.setUint32(24, sampleRate, true);
  view.setUint32(28, sampleRate * 2, true);
  view.setUint16(32, 2, true);
  view.setUint16(34, 16, true); // 16-bit
  writeString(36, 'data');
  view.setUint32(40, float32Array.length * 2, true);

  // Convert the float samples to 16-bit integers
  for (let i = 0; i < float32Array.length; i++) {
    const s = Math.max(-1, Math.min(1, float32Array[i]));
    view.setInt16(44 + i * 2, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
  }

  return new Blob([buffer], { type: 'audio/wav' });
}

Yeah, I know. It looks painful. Basically, I'm manually creating a WAV file header and converting floating-point audio to 16-bit integers. Took me way too long to get this right, but now it works perfectly.

Step 4: The Three-Stage Conversation Pipeline

Once I have the audio, it goes through three APIs:

Stage 1: Speech → Text

export async function POST(req: Request) {
  const formData = await req.formData();
  const audioFile = formData.get('audio') as File;

  try {
    // Try Gemini first (it's faster)
    const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
    const model = genAI.getGenerativeModel({ model: "gemini-2.5-flash" });

    const audioData = await audioFile.arrayBuffer();
    const base64Audio = Buffer.from(audioData).toString('base64');

    const result = await model.generateContent([
      {
        inlineData: {
          mimeType: 'audio/wav',
          data: base64Audio
        }
      },
      'Transcribe this audio to text. Return only the transcription.'
    ]);

    return NextResponse.json({ transcript: result.response.text().trim() });

  } catch (error) {
    // Gemini failed? No problem, Whisper to the rescue
    const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
    const transcription = await openai.audio.transcriptions.create({
      file: audioFile,
      model: 'whisper-1',
    });
    return NextResponse.json({ transcript: transcription.text });
  }
}

I use two providers because sometimes APIs have bad days. Gemini is faster, but if it fails, Whisper is super reliable.

Stage 2: AI Chat Response

This is where the "digital twin" part comes in. I built a knowledge base with all my info:

export async function POST(req: Request) {
  const { message, sessionId } = await req.json();

  // Search my knowledge base for relevant info
  const relevantInfo = searchKnowledge(message);

  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
  const model = genAI.getGenerativeModel({
    model: "gemini-2.5-flash",
    systemInstruction: `You are Sobit, a full-stack developer...

    VOICE MODE RULES (because nobody wants to hear a 5-minute response):
    - Keep it short - 2-4 sentences max
    - Aim for 15-30 seconds when spoken out loud
    - Be conversational, like you're chatting with someone
    - No bullet points (they sound weird when spoken)

    HERE'S WHAT'S RELEVANT:
    ${relevantInfo}`
  });

  const result = await model.generateContent(message);
  return NextResponse.json({ response: result.response.text() });
}

The knowledge base (portfolio-knowledge.ts) is just a big JSON object with everything about me - projects, skills, experience, etc. When someone asks about my projects, the AI gets that context and can answer accurately.

Stage 3: Making It Sound Human

Finally, I turn the AI response into speech using ElevenLabs:

export async function POST(req: Request) {
  const { text } = await req.json();

  const client = new ElevenLabsClient({
    apiKey: process.env.ELEVENLABS_API_KEY
  });

  const audio = await client.textToSpeech.convert(
    process.env.ELEVENLABS_VOICE_ID!,
    {
      text,
      model_id: "eleven_turbo_v2_5",
      voice_settings: {
        stability: 0.5,        // Not too robotic, not too wild
        similarity_boost: 0.5,  // Keep it consistent
        style: 0.0,            // Neutral tone
        use_speaker_boost: true // Make it clear
      }
    }
  );

  // Convert the audio stream to base64
  const chunks: Uint8Array[] = [];
  for await (const chunk of audio) {
    chunks.push(chunk);
  }
  const audioBuffer = Buffer.concat(chunks);
  const base64Audio = audioBuffer.toString('base64');

  return NextResponse.json({
    audio: base64Audio,
    mimeType: 'audio/mpeg'
  });
}

The voice settings took some tweaking. Too much stability and it sounds like a robot. Too little and it sounds drunk. 0.5 is the sweet spot.

Step 5: State Management (Keeping Track of What's Happening)

I use a simple state machine to track the conversation:

type ConversationState = 'idle' | 'listening' | 'thinking' | 'speaking';

The flow looks like this:

idle ──[you start talking]──► listening listening ──[you stop talking]──► thinking thinking ──[AI responds]──► speaking speaking ──[audio ends]──► idle (and back to listening)

Each state changes the UI colors, animations, and that glowing effect around the avatar. Makes it feel alive.

The Visual Magic (AKA The Fun Stuff)

The boring version would've been a static page with a microphone icon. But I wanted something that felt reactive - like the page was actually listening to you.

1. The Trippy Background Shader

I built a WebGL shader that creates this cloud-like effect that literally moves with your voice:

const fragmentShader = `
uniform float u_time;
uniform float u_voice_intensity;  // How loud you're talking (0-1)
uniform int u_conversation_state;  // What state we're in

// This creates organic, cloud-like patterns
float noise(vec2 p) {
  return fract(sin(dot(p, vec2(12.9898, 78.233))) * 43758.5453);
}

// Layer multiple noise patterns for complexity
float fbm(vec2 p) {
  float value = 0.0;
  float amplitude = 0.5;
  for(int i = 0; i < 5; i++) {
    value += amplitude * noise(p);
    p *= 2.0;
    amplitude *= 0.5;
  }
  return value;
}

void main() {
  vec2 uv = vUv;
  vec2 p = uv * 3.0 + u_time * 0.1;

  // HERE'S THE COOL PART - your voice distorts the pattern
  float distortion = u_voice_intensity * 0.3;
  p.x += sin(p.y * 10.0 + u_time) * distortion;
  p.y += cos(p.x * 10.0 + u_time) * distortion;

  float clouds = fbm(p);
  float voiceReaction = smoothstep(0.3, 0.7, clouds + u_voice_intensity * 0.5);

  // Pick colors based on what's happening
  vec3 baseColor;
  if(u_conversation_state == 0) baseColor = u_color_idle;
  else if(u_conversation_state == 1) baseColor = u_color_listening;
  else if(u_conversation_state == 2) baseColor = u_color_thinking;
  else baseColor = u_color_speaking;

  // Add a glow in the center that gets brighter when you talk
  float centerGlow = 1.0 - length(uv - 0.5) * 2.0;
  centerGlow = pow(centerGlow, 2.0) * u_voice_intensity;

  vec3 finalColor = mix(baseColor, baseColor * 1.5, voiceReaction);
  finalColor += vec3(centerGlow * 0.3);

  gl_FragColor = vec4(finalColor, 1.0);
}`
;

Every frame, I update the u_voice_intensity value based on your voice volume. The shader uses that to distort the cloud pattern. Louder voice = more distortion. It's mesmerizing to watch.

2. Pulsing Circles Around the Avatar

Three concentric circles that pulse and rotate:

const circles = [
  { delay: 0, scale: 1 },
  { delay: 3, scale: 1.1 },
  { delay: 6, scale: 1.2 },
];

return circles.map((circle, i) => (
  <motion.div
    key={i}
    className="absolute rounded-full border-2"
    animate={{
      scale: [1, 1.05 + (micVolume * 0.4), 1],  // Breathes with your voice
      opacity: [0.25, 0.5 + (micVolume * 0.5), 0.25],  // Pulses
      rotate: 360,  // Spins continuously
    }}
    transition={{
      duration: 8 + i * 3,  // Each circle rotates at different speeds
      repeat: Infinity,
      ease: 'easeInOut',
      delay: circle.delay,
    }}
  />
));

When you talk, they scale up. When you're quiet, they shrink. The colors change based on the conversation state:

Green when listening
Yellow when thinking
Purple when speaking

3. The Avatar Itself

The profile picture scales and glows:

<motion.div
  animate={{
    scale: conversationState === 'listening' ? 1.05 :
           conversationState === 'speaking' ? 1.02 : 1,
    boxShadow: conversationState === 'listening'
      ? '0 0 50px rgba(59, 130, 246, 0.3)'  // Blue glow
      : conversationState === 'speaking'
      ? '0 0 50px rgba(168, 85, 247, 0.3)'  // Purple glow
      : conversationState === 'thinking'
      ? '0 0 50px rgba(234, 179, 8, 0.3)'   // Yellow glow
      : 'none',
  }}
>
  <Image src="/sobit.jpg" alt="Avatar" />
</motion.div>

Simple but effective. The glow color tells you exactly what's happening without reading any text.

Problems I Ran Into (And Fixed)

Problem 1: "Can I Use Your Mic?" Popup Hell

Modern browsers won't let you access the microphone without a user action (good for privacy, annoying for UX). I had to add a big "Start Conversation" button that triggers the mic request:

const handleStart = async () => {
  try {
    await vad.start();
    setIsListening(true);
  } catch (error) {
    // User said no or browser blocked it
    console.error('Microphone access denied:', error);
  }
};

Problem 2: "Why Is There A 5 Second Delay?"

At first, the response time was terrible. Like 7-10 seconds bad. I fixed it by:

Using Gemini 2.5 Flash (way faster than the regular model)
Telling the AI to keep responses short (2-4 sentences)
Using ElevenLabs Turbo v2.5 for faster voice generation

Now it's 2-4 seconds from when you stop talking to when you hear the response. Much better.

Problem 3: Conversation Context

APIs are stateless, which means each request is treated like a brand new conversation. The AI would forget what we just talked about. I fixed this with session-based caching:

const conversationCache = new Map<string, Message[]>();

export async function POST(req: Request) {
  const { message, sessionId } = await req.json();

  // Load previous messages
  const history = conversationCache.get(sessionId) || [];

  history.push({ role: 'user', content: message });
  const response = await generateResponse(history);
  history.push({ role: 'assistant', content: response });

  // Save for next time
  conversationCache.set(sessionId, history);

  return NextResponse.json({ response });
}

Now the AI remembers the whole conversation. In production, I'd use Redis or a database, but for now this works.

Problem 4: Mobile Phones Hate Me

Running ONNX models and WebGL shaders on a phone made it laggy. Solution: detect mobile devices and simplify the effects:

const isMobile = /iPhone|iPad|iPod|Android/i.test(navigator.userAgent);
const shouldUseShaders = !isMobile && window.WebGL2RenderingContext;

Mobile users get simpler animations, desktop users get the full visual experience.

How It Performs in the Real World

After launching this thing:

VAD Accuracy:

95%+ accuracy detecting speech
Less than 5% false positives (usually from very loud noises)

Speed:

Speech end → Transcript: 1-2 seconds
Transcript → AI response: 0.5-1 second
Response → Audio playback: 1-2 seconds
Total: 2.5-5 seconds (feels pretty instant)

User Engagement:

People stay 3x longer than on my old static portfolio
70% of visitors try the voice feature
Average 4-5 questions per session

Browser Support:

Works perfectly on Chrome, Firefox, Safari, Edge (desktop)
Works well on mobile Chrome and Safari (with reduced visuals)

What I'd Add Next

I've got some ideas for v2:

Actual 3D Avatar with Lip Sync
- Use Ready Player Me for a 3D model
- Make it actually mouth the words
- Add facial expressions
Emotion Detection
- Analyze the tone of your voice
- Adjust the AI's personality (sympathetic, excited, etc.)
- Change the visual vibe to match
Multi-Language Support
- Auto-detect what language you're speaking
- Respond in the same language
- ElevenLabs supports 29+ languages
Conversation Analytics
- Track what people ask about most
- See which projects get the most interest
- Use it to improve the knowledge base
Voice Options
- Let users pick from different voice styles
- Adjust speed and personality
- Make it more personalized

Final Thoughts

Building this was one of the most fun projects I've done. It combines so many cool technologies - machine learning for VAD, AI for conversation, voice synthesis for speech, and WebGL for visuals.

A few years ago, this would've been impossible (or at least required a massive backend infrastructure). Now? Everything runs in the browser using modern web APIs and cloud services. The entire pipeline - from detecting your voice to playing a response - happens in seconds.

The craziest part? Most visitors don't even realize how complex it is. They just click "Start," ask a question, and get an answer. That's the dream - complex tech that feels simple and magical.

If you're thinking about adding voice features to your projects, I hope this gives you a starting point. VAD libraries like @ricky0123/vad-web make it way easier than building from scratch, and modern AI APIs are fast enough for real-time conversation.

The future of web interaction is conversational. The tools are ready. Go build something cool.

Resources & Links

Try it yourself: sobitprasad.com/chat

Tech I used:

@ricky0123/vad-web - VAD library
Google Gemini - AI brain
ElevenLabs - Voice synthesis
Three.js - Graphics
Next.js - Framework
Framer Motion - Animations

Useful reads:

I Built a Voice-Controlled Digital Twin of Myself (And You Can Talk to It!)

How It All Started

Wait, What's Voice Activity Detection?

Why This Matters

The Tech Stack (AKA What I Glued Together)

The Tools I Used (And Why)

Building The Thing (The Fun Part)

Step 2: Making Visual Feedback That Actually Responds

Step 3: Audio Format Hell (Fun Times)

Step 4: The Three-Stage Conversation Pipeline

Stage 3: Making It Sound Human

2. Pulsing Circles Around the Avatar

3. The Avatar Itself

Problem 2: "Why Is There A 5 Second Delay?"

Problem 3: Conversation Context

Problem 4: Mobile Phones Hate Me

How It Performs in the Real World

What I'd Add Next

Final Thoughts

Resources & Links

Comments (1)

More from this blog

How Asynchronous JavaScript Works: A Deep Dive into Its Execution Process

Understanding Var, Let and Const in JavaScript with Examples

Why do browser consoles return undefined? Explained

variable shadowing in javascript 👥

Command Palette

How It All Started

Wait, What's Voice Activity Detection?

Why This Matters

The Tech Stack (AKA What I Glued Together)

The Tools I Used (And Why)

Building The Thing (The Fun Part)

Step 2: Making Visual Feedback That Actually Responds

Step 3: Audio Format Hell (Fun Times)

Step 4: The Three-Stage Conversation Pipeline

Stage 3: Making It Sound Human

2. Pulsing Circles Around the Avatar

3. The Avatar Itself

Problem 2: "Why Is There A 5 Second Delay?"

Problem 3: Conversation Context

Problem 4: Mobile Phones Hate Me

How It Performs in the Real World

What I'd Add Next

Final Thoughts

Resources & Links

Comments (1)

More from this blog