I Built a DIY Voice-Controlled Car That LISTENS to Me... NO PHONE or App Needed! (Cute Al Face)

In this project, we will build a voice-controlled robot car that can respond to simple spoken commands like forward, backward, left, right, and stop. Instead of using a remote control, you can simply talk to the robot and it will move accordingly.

The system is powered by an ESP32-S3 microcontroller, a digital I2S microphone for capturing voice commands, and an AI speech-to-text service that converts the recorded audio into text. Once the command is recognized, the ESP32 processes it and controls the motors through a motor driver module.

To make the robot more interactive, a small OLED display is added to the front of the car to show animated robot eyes that react while the robot listens or moves.

By the end of this tutorial, you will have a working AI-powered voice-controlled robot car that combines robotics, embedded programming, and speech recognition.

Before beginning, a huge shoutout to JLCMC for sponsoring.

Let's start building.

Supplies

Electronic Components Required:

Seeed Studio XIAO ESP32-S3
INMP441 I2S Digital Microphone Module
0.96" SSD1306 OLED Display (I2C, 128×64)
L298N Dual H-Bridge Motor Driver Module
4× DC Gear Motors
4× Robot Wheels
Robot Chassis (see step 1)
Lithium Battery Pack (3.7V 300mah Li-Po Battery or 2 × 18650 battery)
Jumper Wires (Male-to-Female / Female-to-Female)
Breadboard

Step 1: CAD & 3D Printing

The body of the robot car is made using a few simple 3D printed parts. These parts hold the motors, wheels, and electronics securely while keeping the robot compact and clean looking.

The main printed components used in this project are:

Car Chassis – This is the main frame of the robot where all components such as the motors, ESP32 board, and battery are mounted.
Motor Clips – Used to hold the motors firmly in place on the chassis.
Voice Control Car Body – The top enclosure that houses the electronics and gives the robot its final shape.

All parts were designed in ThinkerCAD and printed using PLA filament on a standard FDM 3D printer. After printing, lightly clean the parts if needed and make sure the motors and wheels fit properly before moving to the assembly step.

Once all the printed parts are ready, we can begin assembling the robot.

Step 2: Elevate Your Electronic Projects - JLCMC

JLCMC is your one-stop shop for all electronic manufacturing needs, offering an extensive catalog of nearly 600,000 SKUs that cover hardware, mechanical, electronic, and automation components. Their commitment to guaranteeing genuine products, rapid shipping (with most in-stock items dispatched within 24 hours), and competitive pricing truly sets them apart. In addition, their exceptional customer service ensures you always get exactly what you need to bring your projects to life.

For my next project, I’m planning to buy a timing belt from their Transmission Components section.

What I really like is how easy it is to customize the part. On the left side, you can select all the required options, and just below that, you get the complete specification and documentation, so you know exactly what you’re ordering.

JLCMC has recently** upgraded their new-user registration benefits,** increasing the value of the welcome coupon package to $123 in discount coupons + $10 Shipping Coupon. Whether you’re building DIY electronics, robotics, or mechanical projects, JLCMC has you covered with quality parts and fast delivery. Don’t miss out—visit https://jlcmc.com/?from=RL2 to explore their amazing range of products and grab your discount coupon today!

Step 3: Chassis Assembly

Take the DC gear motors and solder two wires to the terminals of each motor. Make sure the connections are solid and the wires are long enough to reach the motor driver.
Insert the motors into the 3D-printed motor clips so that the motor body fits tightly inside the holder.
Place the motor clips (with the motors inside) into the motor slots on the chassis. Ensure the motor shafts are facing outward where the wheels will be mounted.
Apply a small amount of hot glue around the motor clips to secure them firmly to the chassis.
Attach the wheels onto the motor shafts and press them firmly so they sit tightly on the motors.
Mount the breadboard onto the chassis using double-sided tape. Press it down properly so it stays fixed.
Place the motor driver module on one end of the breadboard and secure it using a small amount of hot glue.
Finally, insert the motor wires into the output terminals of the motor driver module and tighten the terminal screws to hold the wires in place.

Step 4: Circuit Connection

Now let’s connect all the electronic components according to the circuit diagram.

1. OLED Display Connection (I2C)

Connect the 0.96" SSD1306 OLED display to the XIAO ESP32-S3 using the I2C interface.

OLED → XIAO ESP32-S3
VCC → 3.3V
GND → GND
SCL → D5
SDA → D4

Make sure the OLED is powered with 3.3V, since the XIAO ESP32-S3 operates on 3.3V logic.

2. I2S Microphone Connection

Next, connect the I2S microphone module to the ESP32-S3. This microphone will capture the voice commands.

I2S MIC → XIAO ESP32-S3
GND → GND
VDD → 3.3V
SD → D2
SCK → D3
WS → D1
L/R → 3.3V

Setting L/R to 3.3V selects the right audio channel.

3. Motor Driver Connection

Now connect the L298N motor driver to the ESP32-S3. These pins will control the speed and direction of the motors.

L298N → XIAO ESP32-S3
ENA → D6
IN1 → D7
IN2 → D8
ENB → D9
IN3 → D10
IN4 → D0

The motor wires that were soldered earlier should already be connected to the output terminals of the L298N motor driver.

4. Power Connections

The XIAO ESP32-S3 is powered using a 3.7V LiPo battery connected directly to the battery connector on the board.

Since the XIAO ESP32-S3 includes built-in battery management, it can safely power the system and also be charged through the USB-C port when needed.

The motor driver receives power from the external battery (2 × 18650 battery) system, and the motors are driven through the L298N module.

Important: Make sure the GND of the ESP32-S3, motor driver, and all other modules are connected together so the entire circuit shares a common ground.

Once all the connections are complete, double-check the wiring before powering the system. After verifying everything, we can move on to uploading the code in the next step.

Step 5: Code & Setup

Now we will upload the program to the Seeed Studio XIAO ESP32-S3 so the robot can listen to voice commands and control the motors.

1. Install Arduino IDE and ESP32 Board Package

Install the latest version of Arduino IDE on your computer.
Open Arduino IDE and install the ESP32 board package:
Go to File → Preferences
In Additional Board Manager URLs, add:

https://raw.githubusercontent.com/espressif/arduino-esp32/gh-pages/package_esp32_index.json

Then go to:
Tools → Board → Boards Manager
Search for ESP32 and install the ESP32 board package.

2. Select the Board and Port

Open Arduino IDE and configure the board.
Go to:
Tools → Board → ESP32 Arduino → XIAO ESP32S3
Then select the correct COM port:
Tools → Port → Select the port connected to your XIAO ESP32-S3

3. Install Required Libraries

Install the following libraries from the Arduino Library Manager.
Go to Sketch → Include Library → Manage Libraries and install:
Adafruit GFX
Adafruit SSD1306
ArduinoJson
FluxGarage RoboEyes
These libraries are used for the OLED display, JSON parsing, and animated robot eyes.

4. Copy the Project Code

/*
 * ============================================================
 *  Voice-Controlled Robot with Animated OLED Eyes
 *  Author    : Shahbaz Hashmi Ansari
 *  Copyright : Roboattic Lab — All Rights Reserved
 * ============================================================
 *
 */

#include <WiFi.h>
#include <HTTPClient.h>
#include <ArduinoJson.h>

#include <Wire.h>
#include <Adafruit_GFX.h>
#include <Adafruit_SSD1306.h>
#include <FluxGarage_RoboEyes.h>

#define SCREEN_WIDTH 128
#define SCREEN_HEIGHT 64
#define OLED_RESET -1
Adafruit_SSD1306 display(SCREEN_WIDTH, SCREEN_HEIGHT, &Wire, OLED_RESET);

RoboEyes<Adafruit_SSD1306> roboEyes(display);


// --- WiFi Credentials ---
const char* ssid = "***********";
const char* password = "**********";


// --- Pin Assignments ---
#define LED 21

#define LED_ON()  digitalWrite(LED, LOW)
#define LED_OFF() digitalWrite(LED, HIGH)


// --- L298N Motor Driver Pins (Seeed XIAO ESP32-S3) ---
#define ENA       43
#define MOTOR_IN1 44
#define MOTOR_IN2 7
#define ENB       8
#define MOTOR_IN3 9
#define MOTOR_IN4 1

#define MOTOR_PWM_FREQ 1000
#define MOTOR_PWM_RES  8

int motor_speed = 180;

#define MOTOR_MOVE_DURATION 2000


// --- PSRAM Audio Buffer ---
uint8_t* psram_audio_buffer   = NULL;
size_t   psram_audio_size     = 0;
size_t   psram_audio_capacity = 0;

#define PSRAM_AUDIO_MAX_SIZE (512 * 1024)


// --- Continuous Listening Settings ---
#define RECORD_DURATION_MS 5000


// --- ElevenLabs API ---
const char* elevenlabs_api_key = "API_KEY HERE";
const char* elevenlabs_stt_url = "https://api.elevenlabs.io/v1/speech-to-text";


// --- Function Declarations ---
bool   I2S_Record_Init();
bool   Record_Start(const char* filename);
bool   Record_Available(const char* filename, float* audiolength_sec);
String SpeechToText_ElevenLabs(const char* filename);
void   motor_init();
void   motor_forward();
void   motor_backward();
void   motor_left();
void   motor_right();
void   motor_stop();
void   process_voice_command(String transcription);



// ==============================================================
//                     Motor Control
// ==============================================================

void motor_init() {
  pinMode(MOTOR_IN1, OUTPUT);
  pinMode(MOTOR_IN2, OUTPUT);
  pinMode(MOTOR_IN3, OUTPUT);
  pinMode(MOTOR_IN4, OUTPUT);
  pinMode(ENA, OUTPUT);
  pinMode(ENB, OUTPUT);
  digitalWrite(MOTOR_IN1, LOW);
  digitalWrite(MOTOR_IN2, LOW);
  digitalWrite(MOTOR_IN3, LOW);
  digitalWrite(MOTOR_IN4, LOW);
  analogWrite(ENA, 0);
  analogWrite(ENB, 0);

  Serial.println("> Motors initialized and STOPPED. Default speed: " + String(motor_speed));
}

void motor_forward() {
  Serial.println("> MOTOR: Moving FORWARD (speed=" + String(motor_speed) + ")");

  roboEyes.setMood(HAPPY);
  roboEyes.setPosition(DEFAULT);

  digitalWrite(MOTOR_IN1, HIGH);
  digitalWrite(MOTOR_IN2, LOW);
  digitalWrite(MOTOR_IN3, HIGH);
  digitalWrite(MOTOR_IN4, LOW);
  analogWrite(ENA, motor_speed);
  analogWrite(ENB, motor_speed);

  uint32_t move_start = millis();
  while (millis() - move_start < MOTOR_MOVE_DURATION) {
    roboEyes.update();
    delay(5);
  }

  motor_stop();
}

void motor_backward() {
  Serial.println("> MOTOR: Moving BACKWARD (speed=" + String(motor_speed) + ")");

  roboEyes.setMood(TIRED);
  roboEyes.setPosition(DEFAULT);

  digitalWrite(MOTOR_IN1, LOW);
  digitalWrite(MOTOR_IN2, HIGH);
  digitalWrite(MOTOR_IN3, LOW);
  digitalWrite(MOTOR_IN4, HIGH);
  analogWrite(ENA, motor_speed);
  analogWrite(ENB, motor_speed);

  uint32_t move_start = millis();
  while (millis() - move_start < MOTOR_MOVE_DURATION) {
    roboEyes.update();
    delay(5);
  }

  motor_stop();
}

void motor_left() {
  Serial.println("> MOTOR: Turning LEFT (speed=" + String(motor_speed) + ")");

  roboEyes.setMood(ANGRY);
  roboEyes.setPosition(W);

  digitalWrite(MOTOR_IN1, HIGH);
  digitalWrite(MOTOR_IN2, LOW);
  digitalWrite(MOTOR_IN3, LOW);
  digitalWrite(MOTOR_IN4, HIGH);
  analogWrite(ENA, motor_speed);
  analogWrite(ENB, motor_speed);

  uint32_t move_start = millis();
  while (millis() - move_start < MOTOR_MOVE_DURATION) {
    roboEyes.update();
    delay(5);
  }

  motor_stop();
}

void motor_right() {
  Serial.println("> MOTOR: Turning RIGHT (speed=" + String(motor_speed) + ")");

  roboEyes.setMood(ANGRY);
  roboEyes.setPosition(E);

  digitalWrite(MOTOR_IN1, LOW);
  digitalWrite(MOTOR_IN2, HIGH);
  digitalWrite(MOTOR_IN3, HIGH);
  digitalWrite(MOTOR_IN4, LOW);
  analogWrite(ENA, motor_speed);
  analogWrite(ENB, motor_speed);

  uint32_t move_start = millis();
  while (millis() - move_start < MOTOR_MOVE_DURATION) {
    roboEyes.update();
    delay(5);
  }

  motor_stop();
}

void motor_stop() {
  digitalWrite(MOTOR_IN1, LOW);
  digitalWrite(MOTOR_IN2, LOW);
  digitalWrite(MOTOR_IN3, LOW);
  digitalWrite(MOTOR_IN4, LOW);
  analogWrite(ENA, 0);
  analogWrite(ENB, 0);

  roboEyes.setMood(DEFAULT);
  roboEyes.setPosition(DEFAULT);
  roboEyes.update();
}


// --- Strip punctuation and convert to lowercase for reliable command matching ---
String clean_text(String text) {
  String cleaned = "";
  for (int i = 0; i < (int)text.length(); i++) {
    char c = text.charAt(i);
    if (isAlpha(c) || isDigit(c) || c == ' ') {
      cleaned += (char)tolower(c);
    }
  }
  String result = "";
  bool last_was_space = false;
  for (int i = 0; i < (int)cleaned.length(); i++) {
    if (cleaned.charAt(i) == ' ') {
      if (!last_was_space && result.length() > 0) {
        result += ' ';
        last_was_space = true;
      }
    } else {
      result += cleaned.charAt(i);
      last_was_space = false;
    }
  }
  result.trim();
  return result;
}

// --- Extract a number from text — supports digits and English words ---
int extract_number(String text) {
  String num_str = "";
  bool found_digit = false;
  for (int i = 0; i < (int)text.length(); i++) {
    if (isDigit(text.charAt(i))) {
      num_str += text.charAt(i);
      found_digit = true;
    } else if (found_digit) {
      break;
    }
  }
  if (num_str.length() > 0) {
    return num_str.toInt();
  }

  int current = 0;
  bool found_word_num = false;

  int start = 0;
  while (start < text.length()) {
    int space_idx = text.indexOf(' ', start);
    if (space_idx == -1) space_idx = text.length();
    String word = text.substring(start, space_idx);

    int val = -1;
    if (word == "zero") val = 0;
    else if (word == "one") val = 1;
    else if (word == "two" || word == "to" || word == "too") val = 2;
    else if (word == "three") val = 3;
    else if (word == "four" || word == "for") val = 4;
    else if (word == "five") val = 5;
    else if (word == "six") val = 6;
    else if (word == "seven") val = 7;
    else if (word == "eight") val = 8;
    else if (word == "nine") val = 9;
    else if (word == "ten") val = 10;
    else if (word == "eleven") val = 11;
    else if (word == "twelve") val = 12;
    else if (word == "thirteen") val = 13;
    else if (word == "fourteen") val = 14;
    else if (word == "fifteen") val = 15;
    else if (word == "sixteen") val = 16;
    else if (word == "seventeen") val = 17;
    else if (word == "eighteen") val = 18;
    else if (word == "nineteen") val = 19;
    else if (word == "twenty") val = 20;
    else if (word == "thirty") val = 30;
    else if (word == "forty" || word == "fourty") val = 40;
    else if (word == "fifty") val = 50;
    else if (word == "sixty") val = 60;
    else if (word == "seventy") val = 70;
    else if (word == "eighty") val = 80;
    else if (word == "ninety") val = 90;
    else if (word == "hundred") val = 100;

    if (val != -1) {
      found_word_num = true;
      if (val == 100) {
        if (current == 0) current = 1;
        current *= 100;
      } else {
        current += val;
      }
    } else {
      if (found_word_num) break;
    }
    start = space_idx + 1;
  }

  if (found_word_num) {
    return current;
  }

  return -1;
}

// --- Parse transcription and execute the appropriate motor command ---
void process_voice_command(String transcription) {
  if (transcription.length() == 0) return;

  String cmd = clean_text(transcription);
  Serial.println("> Command (cleaned): [" + cmd + "]");

  if (cmd.indexOf("forward") >= 0) {
    motor_forward();
    return;
  }
  if (cmd.indexOf("backward") >= 0 || cmd.indexOf("back") >= 0 || cmd.indexOf("reverse") >= 0) {
    motor_backward();
    return;
  }
  if (cmd.indexOf("left") >= 0) {
    motor_left();
    return;
  }
  if (cmd.indexOf("right") >= 0) {
    motor_right();
    return;
  }
  if (cmd.indexOf("stop") >= 0) {
    motor_stop();
    Serial.println("> MOTOR: STOPPED");
    return;
  }

  if (cmd.indexOf("speed") >= 0) {
    roboEyes.setMood(HAPPY);
    roboEyes.setPosition(DEFAULT);

    uint32_t move_start = millis();
    while (millis() - move_start < 1000) {
      roboEyes.update();
      delay(5);
    }

    int new_speed = extract_number(cmd);
    if (new_speed >= 0 && new_speed <= 255) {
      motor_speed = new_speed;
      Serial.println("> MOTOR: Speed set to " + String(motor_speed));
    } else if (new_speed > 255) {
      motor_speed = 255;
      Serial.println("> MOTOR: Speed clamped to max 255");
    } else {
      Serial.println("> MOTOR: Could not parse speed value from command");
    }
    return;
  }

  Serial.println("> No motor command detected.");
}



// ==============================================================
//                     I2S Audio Recording
// ==============================================================

#include "driver/i2s_std.h"

#ifndef DEBUG
#define DEBUG true
#define DebugPrint(x)   ; if (DEBUG) { Serial.print(x); }
#define DebugPrintln(x) ; if (DEBUG) { Serial.println(x); }
#endif

// --- I2S Pin Assignments ---
#define I2S_WS  2
#define I2S_SD  4
#define I2S_SCK 3

#define SAMPLE_RATE      16000
#define BITS_PER_SAMPLE  16
#define GAIN_BOOSTER_I2S 4

// Set to true if L/R pin is tied to 3.3V, false if tied to GND
#define MIC_CHANNEL_RIGHT true

i2s_chan_handle_t rx_handle;

struct WAV_HEADER {
  char  riff[4]        = { 'R', 'I', 'F', 'F' };
  long  flength        = 0;
  char  wave[4]        = { 'W', 'A', 'V', 'E' };
  char  fmt[4]         = { 'f', 'm', 't', ' ' };
  long  chunk_size     = 16;
  short format_tag     = 1;
  short num_chans      = 1;
  long  srate          = SAMPLE_RATE;
  long  bytes_per_sec  = SAMPLE_RATE * (BITS_PER_SAMPLE / 8);
  short bytes_per_samp = (BITS_PER_SAMPLE / 8);
  short bits_per_samp  = BITS_PER_SAMPLE;
  char  dat[4]         = { 'd', 'a', 't', 'a' };
  long  dlength        = 0;
} myWAV_Header;

bool flg_is_recording     = false;
bool flg_I2S_initialized  = false;


bool I2S_Record_Init() {
  i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(I2S_NUM_AUTO, I2S_ROLE_MASTER);
  i2s_new_channel(&chan_cfg, NULL, &rx_handle);

  i2s_std_config_t std_cfg = {
    .clk_cfg  = I2S_STD_CLK_DEFAULT_CONFIG(SAMPLE_RATE),
    .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_STEREO),
    .gpio_cfg = {
      .mclk  = I2S_GPIO_UNUSED,
      .bclk  = (gpio_num_t)I2S_SCK,
      .ws    = (gpio_num_t)I2S_WS,
      .dout  = I2S_GPIO_UNUSED,
      .din   = (gpio_num_t)I2S_SD,
      .invert_flags = {
        .mclk_inv = false,
        .bclk_inv = false,
        .ws_inv   = false,
      },
    },
  };

  i2s_channel_init_std_mode(rx_handle, &std_cfg);
  i2s_channel_enable(rx_handle);

  flg_I2S_initialized = true;

  // Flush initial garbage samples
  int16_t dummy_buf[1024];
  size_t  dummy_read = 0;
  for (int i = 0; i < 10; i++) {
    i2s_channel_read(rx_handle, dummy_buf, sizeof(dummy_buf), &dummy_read, portMAX_DELAY);
  }

  // Microphone diagnostic test
  int16_t test_buf[1024];
  size_t  test_read = 0;
  i2s_channel_read(rx_handle, test_buf, sizeof(test_buf), &test_read, portMAX_DELAY);

  int total_stereo_samples = test_read / 2;
  int frames               = total_stereo_samples / 2;

  int   left_nonzero = 0;
  int16_t left_max   = 0;
  int   left_clipped = 0;
  for (int i = 0; i < frames; i++) {
    int16_t s = test_buf[i * 2];
    if (s != 0) left_nonzero++;
    if (abs(s) > abs(left_max)) left_max = s;
    if (s == 32767 || s == -32768) left_clipped++;
  }

  int   right_nonzero = 0;
  int16_t right_max   = 0;
  int   right_clipped = 0;
  for (int i = 0; i < frames; i++) {
    int16_t s = test_buf[i * 2 + 1];
    if (s != 0) right_nonzero++;
    if (abs(s) > abs(right_max)) right_max = s;
    if (s == 32767 || s == -32768) right_clipped++;
  }

  Serial.println("> I2S initialized and flushed.");
  Serial.println("> MIC TEST (" + String(frames) + " stereo frames):");
  Serial.println(">   LEFT  ch: " + String(left_nonzero) + " non-zero, max=" + String(left_max) + ", clipped=" + String(left_clipped));
  Serial.println(">   RIGHT ch: " + String(right_nonzero) + " non-zero, max=" + String(right_max) + ", clipped=" + String(right_clipped));
  Serial.println(">   Using: " + String(MIC_CHANNEL_RIGHT ? "RIGHT" : "LEFT") + " channel");

  Serial.print(">   First 8 stereo frames (L,R): ");
  for (int i = 0; i < 8 && i < frames; i++) {
    Serial.print("[" + String(test_buf[i * 2]) + "," + String(test_buf[i * 2 + 1]) + "] ");
  }
  Serial.println();

  return flg_I2S_initialized;
}


bool Record_Start(const char* audio_filename) {
  if (!flg_I2S_initialized) {
    Serial.println("ERROR in Record_Start() - I2S not initialized, call 'I2S_Record_Init()' missed");
    return false;
  }

  if (!flg_is_recording) {
    flg_is_recording  = true;
    psram_audio_size  = 0;

    memcpy(psram_audio_buffer, (uint8_t*)&myWAV_Header, 44);
    psram_audio_size = 44;

    DebugPrintln("\n> WAV Header generated in PSRAM, Audio Recording started ... ");
    return true;
  }

  if (flg_is_recording) {
    int16_t stereo_buffer[2048];
    size_t  bytes_read = 0;
    i2s_channel_read(rx_handle, stereo_buffer, sizeof(stereo_buffer), &bytes_read, portMAX_DELAY);

    int total_samples  = bytes_read / 2;
    int stereo_frames  = total_samples / 2;

    int16_t mono_buffer[1024];
    for (int i = 0; i < stereo_frames && i < 1024; i++) {
      if (MIC_CHANNEL_RIGHT)
        mono_buffer[i] = stereo_buffer[i * 2 + 1];
      else
        mono_buffer[i] = stereo_buffer[i * 2];
    }

    if (GAIN_BOOSTER_I2S > 1 && GAIN_BOOSTER_I2S <= 64) {
      for (int i = 0; i < stereo_frames && i < 1024; ++i) {
        int32_t amplified = (int32_t)mono_buffer[i] * GAIN_BOOSTER_I2S;
        if (amplified > 32767)  amplified = 32767;
        if (amplified < -32768) amplified = -32768;
        mono_buffer[i] = (int16_t)amplified;
      }
    }

    size_t write_bytes = stereo_frames * 2;
    if (write_bytes > 2048) write_bytes = 2048;

    if ((psram_audio_size + write_bytes) <= psram_audio_capacity) {
      memcpy(psram_audio_buffer + psram_audio_size, (uint8_t*)mono_buffer, write_bytes);
      psram_audio_size += write_bytes;
      return true;
    } else {
      Serial.println("ERROR in Record_Start() - PSRAM buffer full!");
      return false;
    }
  }
  return false;
}


bool Record_Available(const char* audio_filename, float* audiolength_sec) {
  if (!flg_is_recording)    return false;
  if (!flg_I2S_initialized) return false;

  if (flg_is_recording) {
    long filesize          = psram_audio_size;
    myWAV_Header.flength   = filesize;
    myWAV_Header.dlength   = (filesize - 44);
    memcpy(psram_audio_buffer, (uint8_t*)&myWAV_Header, 44);

    flg_is_recording = false;

    *audiolength_sec = (float)(filesize - 44) / (SAMPLE_RATE * BITS_PER_SAMPLE / 8);

    DebugPrintln("> ... Done. Audio Recording finished.");
    DebugPrint("> AUDIO in PSRAM, size [bytes]: " + (String)filesize);
    DebugPrintln(", length [sec]: " + (String)*audiolength_sec);

    if (filesize > 76) {
      DebugPrint("> First 16 mono samples (16bit signed): ");
      int16_t* samples = (int16_t*)(psram_audio_buffer + 44);
      for (int i = 0; i < 16; i++) {
        DebugPrint((String)samples[i] + " ");
      }
      DebugPrintln("");
    }

    return true;
  }
  return false;
}



// ==============================================================
//                  ElevenLabs Speech-to-Text
// ==============================================================

String SpeechToText_ElevenLabs(const char* audio_filename) {
  uint32_t t_start = millis();

  if (WiFi.status() != WL_CONNECTED) {
    Serial.println("ERROR - WiFi not connected, cannot send to ElevenLabs STT");
    return ("");
  }

  size_t audio_size = psram_audio_size;
  if (audio_size == 0) {
    Serial.println("ERROR - No audio data in PSRAM");
    return ("");
  }
  if (audio_size > 500000) {
    Serial.println("ERROR - Audio data too large for STT request (>500KB)");
    return ("");
  }
  DebugPrintln("> Audio data in PSRAM, size: " + (String)audio_size);

  uint32_t t_data_ready = millis();

  HTTPClient http;
  if (!http.begin(elevenlabs_stt_url)) {
    Serial.println("ERROR - Failed to initialize HTTP connection to ElevenLabs");
    return ("");
  }

  http.setTimeout(30000);
  http.setConnectTimeout(10000);
  http.addHeader("xi-api-key", elevenlabs_api_key);

  String boundary = "----WebKitFormBoundary7MA4YWxkTrZu0gW";
  http.addHeader("Content-Type", "multipart/form-data; boundary=" + boundary);

  String body_start = "--" + boundary + "\r\n";
  body_start += "Content-Disposition: form-data; name=\"model_id\"\r\n\r\n";
  body_start += "scribe_v2\r\n";
  body_start += "--" + boundary + "\r\n";
  body_start += "Content-Disposition: form-data; name=\"language_code\"\r\n\r\n";
  body_start += "eng\r\n";
  body_start += "--" + boundary + "\r\n";
  body_start += "Content-Disposition: form-data; name=\"file\"; filename=\"audio.wav\"\r\n";
  body_start += "Content-Type: audio/wav\r\n\r\n";

  String body_end = "\r\n--" + boundary + "--\r\n";

  size_t total_size = body_start.length() + audio_size + body_end.length();

  uint8_t* complete_body = (uint8_t*)malloc(total_size);
  if (!complete_body) {
    Serial.println("ERROR - Failed to allocate memory for HTTP body!");
    http.end();
    return ("");
  }

  memcpy(complete_body, body_start.c_str(), body_start.length());
  memcpy(complete_body + body_start.length(), psram_audio_buffer, audio_size);
  memcpy(complete_body + body_start.length() + audio_size, body_end.c_str(), body_end.length());

  uint32_t t_request_prepared = millis();

  DebugPrintln("> POST Request to ElevenLabs STT, sending " + String(total_size) + " bytes ...");

  uint32_t t_request_sent      = millis();
  int httpResponseCode          = http.POST(complete_body, total_size);
  uint32_t t_response_received = millis();

  free(complete_body);

  String transcription = "";
  String response      = http.getString();

  uint32_t t_response_parsed = millis();

  if (httpResponseCode == 200) {
    DebugPrintln("> HTTP 200 OK");
    DynamicJsonDocument doc(2048);
    if (deserializeJson(doc, response) == DeserializationError::Ok) {
      if (doc.containsKey("text")) {
        transcription = doc["text"].as<String>();
      }
    } else {
      Serial.println("ERROR - Failed to parse ElevenLabs JSON response");
    }
  } else {
    Serial.printf("ERROR - HTTP Response Code: %d\n", httpResponseCode);
    Serial.println("Response: " + response);
  }

  http.end();

  DebugPrintln("---------------------------------------------------");
  DebugPrintln("-> Audio data size [bytes]: " + (String)audio_size);
  DebugPrintln("-> Latency Data Ready [t_data_ready]:          " + (String)((float)((t_data_ready - t_start)) / 1000) + " sec");
  DebugPrintln("-> Latency Request Preparation:                " + (String)((float)((t_request_prepared - t_data_ready)) / 1000) + " sec");
  DebugPrintln("-> Latency ElevenLabs STT Response:            " + (String)((float)((t_response_received - t_request_sent)) / 1000) + " sec");
  DebugPrintln("-> Latency Response Parsing:                   " + (String)((float)((t_response_parsed - t_response_received)) / 1000) + " sec");
  DebugPrintln("=> TOTAL Duration [sec]: ..................... " + (String)((float)((t_response_parsed - t_start)) / 1000));
  DebugPrintln("=> Server response length [bytes]: " + (String)response.length());
  DebugPrintln("=> Transcription: [" + transcription + "]");
  DebugPrintln("---------------------------------------------------\n");

  return transcription;
}



// ==============================================================
//                        Setup & Loop
// ==============================================================

void setup() {
  motor_init();
  Serial.begin(115200);
  Serial.setTimeout(100);

  pinMode(LED, OUTPUT);
  LED_OFF();

  delay(500);
  Wire.setPins(5, 6);
  Wire.begin();

  bool display_found = false;
  if (display.begin(SSD1306_SWITCHCAPVCC, 0x3C)) {
    display_found = true;
  } else if (display.begin(SSD1306_SWITCHCAPVCC, 0x3D)) {
    display_found = true;
  }

  if (!display_found) {
    Serial.println(F("\nCRITICAL ERROR: OLED Display Not Found!"));
    Serial.println(F("Check I2C wiring: SDA to D4(GPIO5) & SCL to D5(GPIO6)"));
    while (true) { delay(100); }
  }

  display.clearDisplay();
  display.display();

  roboEyes.begin(SCREEN_WIDTH, SCREEN_HEIGHT, 100);
  roboEyes.setAutoblinker(true, 3, 2);
  roboEyes.setIdleMode(true, 2, 2);
  roboEyes.setMood(DEFAULT);
  roboEyes.setPosition(DEFAULT);

  Serial.println(VERSION);

  if (psramFound()) {
    psram_audio_buffer = (uint8_t*)ps_malloc(PSRAM_AUDIO_MAX_SIZE);
    if (psram_audio_buffer) {
      psram_audio_capacity = PSRAM_AUDIO_MAX_SIZE;
      Serial.println("PSRAM initialized. Allocated " + String(PSRAM_AUDIO_MAX_SIZE) + " bytes for audio buffer.");
    } else {
      Serial.println("ERROR - PSRAM allocation failed!");
      return;
    }
  } else {
    Serial.println("ERROR - No PSRAM found! XIAO ESP32-S3 PSRAM required.");
    return;
  }

  WiFi.mode(WIFI_STA);
  WiFi.begin(ssid, password);
  Serial.print("Connecting WLAN ");
  while (WiFi.status() != WL_CONNECTED) {
    Serial.print(".");
    roboEyes.update();
    delay(500);
  }
  Serial.println(". Done, device connected.");
  LED_OFF();

  I2S_Record_Init();

  Serial.println("> Continuous listening mode: recording " + String(RECORD_DURATION_MS / 1000) + " sec chunks, transcribing each ...");
  Serial.println("> Voice commands: 'go forward/backward/left/right/stop'");
  Serial.println("> Speed command:  'make speed 150' (0-255)");
}


void loop() {

  roboEyes.setMood(DEFAULT);
  roboEyes.setPosition(N);
  roboEyes.update();

  // --- Step 1: Record audio for configured duration ---
  LED_ON();
  motor_stop();

  uint32_t record_start = millis();

  Record_Start("PSRAM_AUDIO");

  while (millis() - record_start < RECORD_DURATION_MS) {
    Record_Start("PSRAM_AUDIO");
    roboEyes.update();

    display.setTextSize(1);
    display.setTextColor(SSD1306_WHITE);
    int16_t  x1, y1;
    uint16_t tw, th;
    display.getTextBounds("Listening...", 0, 0, &x1, &y1, &tw, &th);
    int textX = (SCREEN_WIDTH - tw) / 2;
    int textY = SCREEN_HEIGHT - th - 2;
    display.fillRect(textX - 2, textY - 1, tw + 4, th + 2, SSD1306_BLACK);
    display.setCursor(textX, textY);
    display.print("Listening...");
    display.display();

    delay(5);
  }

  float recorded_seconds = 0;
  Record_Available("PSRAM_AUDIO", &recorded_seconds);

  LED_OFF();

  // --- Step 2: Transcribe audio via ElevenLabs ---
  if (recorded_seconds > 0.4) {
    roboEyes.setMood(TIRED);
    roboEyes.setPosition(DEFAULT);
    roboEyes.update();

    String transcription = SpeechToText_ElevenLabs("PSRAM_AUDIO");

    if (transcription.length() > 0) {
      Serial.println(">> " + transcription);

      // --- Step 3: Execute the detected voice command ---
      process_voice_command(transcription);

    } else {
      Serial.println("> (silence or no speech detected)");
    }
  }

  // --- Step 4: Begin next recording cycle ---
}

5. Configure Your WiFi Credentials

Inside the code, locate the WiFi configuration section:

const char* ssid = "YOUR_WIFI_NAME"; const char* password = "YOUR_WIFI_PASSWORD";

Replace these values with your WiFi network name and password.

The ESP32 will connect to this network to send audio to the speech recognition service.

6. Get Your ElevenLabs API Key

This project uses ElevenLabs Speech-to-Text API to convert the recorded voice commands into text.

Follow these steps to create your API key.

Go to the ElevenLabs Dashboard and log in to your account.
Click Developers in the bottom-left sidebar.
Open the API Keys tab.
Click Create Key.
Enable Restrict Key and allow only Speech-to-Text access.
Click Create and copy the generated API key.

Important: Copy and save the key immediately. For security reasons, the full key is shown only once.

7. Add the API Key to the Code

Find this line in the code:

const char* elevenlabs_api_key = "YOUR_API_KEY";

Replace it with your own API key.

Make sure you keep this key private and do not share it publicly.

8. Upload the Code

Connect the XIAO ESP32-S3 to your computer using a USB-C cable.
Then upload the code:
Click Upload in Arduino IDE.
The program will compile and transfer to the board.
Once uploaded, open the Serial Monitor (115200 baud) to see debugging messages and voice recognition results.

After the code is uploaded successfully, the robot will start listening for voice commands and controlling the motors accordingly.

Step 6: Working Video and Tutorial

Congratulations! You’ve successfully built your DIY Voice Controlled Car for Beginners (ESP32 + VOICE RECOGNITION). A demonstration video of this project can be viewed here: Watch Now

Thank you for your interest in this project. If you have any questions or suggestions for future projects, please leave a comment, and I will do my best to assist you.

For business or promotional inquiries, please contact me via email at Email.

I will continue to update this article with new information. Don’t forget to follow me for updates on new projects and subscribe to my YouTube channel (YouTube: roboattic Lab) for more content. Thank you for your support.