Code Structure & Explanation

This section explains the structure of the ESP32-CAM project code and describes how the main components work together to implement video streaming, image capture, camera control, and face detection.

Project File Structure

main.cpp

Entry point of the application

Contains setup() and loop() functions; initializes the camera, Wi-Fi, and web server

camera_pins.h

Defines GPIO pin mappings for ESP32-CAM modules

Provides hardware-specific configurations for different board variants

app_httpd.cpp

Implements the HTTP server

Handles live video streaming, image capture, camera control requests, status reporting, and face detection processing

camera_index.h

Contains the web interface (HTML, CSS, JavaScript)

Embedded directly in firmware as a C string; served to the browser at the root (/) endpoint

Overview of Main Files

The ESP32-CAM project is organized into modular components to separate hardware control, networking, and user interaction:

  • main.cpp: Responsible for system initialization and starting background services
  • camera_pins.h: Abstracts board-specific GPIO mappings to ensure compatibility with the selected ESP32-CAM module
  • app_httpd.cpp: Core implementation of the camera web server and streaming logic
  • camera_index.h: Provides a self-contained web UI without requiring external files or servers

Camera Initialization

The camera is configured using the camera_config_t structure provided by the ESP32 camera driver. Key responsibilities of camera initialization:

  • Assign correct GPIO pins for the camera module
  • Configure clock frequency and pixel format
  • Select frame size and JPEG quality
  • Allocate frame buffers in PSRAM when available

Important behaviors:

  • When PSRAM is detected, higher resolutions and multiple frame buffers are enabled
  • Without PSRAM, the system automatically limits resolution to avoid memory exhaustion
  • JPEG format is used for streaming efficiency

After configuration, the camera is initialized using:

// Configuration structure for camera
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
config.pin_d1 = Y3_GPIO_NUM;
config.pin_d2 = Y4_GPIO_NUM;
config.pin_d3 = Y5_GPIO_NUM;
config.pin_d4 = Y6_GPIO_NUM;
config.pin_d5 = Y7_GPIO_NUM;
config.pin_d6 = Y8_GPIO_NUM;
config.pin_d7 = Y9_GPIO_NUM;
config.pin_xclk = XCLK_GPIO_NUM;
config.pin_pclk = PCLK_GPIO_NUM;
config.pin_vsync = VSYNC_GPIO_NUM;
config.pin_href = HREF_GPIO_NUM;
config.pin_sscb_sda = SIOD_GPIO_NUM;
config.pin_sscb_scl = SIOC_GPIO_NUM;
config.pin_pwdn = PWDN_GPIO_NUM;
config.pin_reset = RESET_GPIO_NUM;
config.xclk_freq_hz = 20000000;
config.pixel_format = PIXFORMAT_JPEG;

// Frame size settings
if(psramFound()){
    config.frame_size = FRAMESIZE_UXGA;
    config.jpeg_quality = 10;
    config.fb_count = 2;
} else {
    config.frame_size = FRAMESIZE_SVGA;
    config.jpeg_quality = 12;
    config.fb_count = 1;
}

// Initialize the camera
esp_err_t err = esp_camera_init(&config);
if (err != ESP_OK) {
    Serial.printf("Camera init failed with error 0x%x", err);
    return;
}

HTTP Server Setup

The project uses Espressif's built-in HTTP server to expose camera functionality over Wi-Fi. Server responsibilities:

  • Serve the embedded web interface
  • Stream live video
  • Capture still images
  • Accept camera control commands
  • Report system status

Key characteristics:

  • Runs as a background task
  • Uses multiple URI handlers
  • Designed for low memory overhead
// Initialize HTTP server
httpd_config_t config = HTTPD_DEFAULT_CONFIG();
config.max_uri_handlers = 16;

esp_err_t err = httpd_start(&camera_httpd, &config);
if (err == ESP_OK) {
    httpd_register_uri_handler(camera_httpd, &index_uri);
    httpd_register_uri_handler(camera_httpd, &status_uri);
    httpd_register_uri_handler(camera_httpd, &capture_uri);
    httpd_register_uri_handler(camera_httpd, &stream_uri);
    httpd_register_uri_handler(camera_httpd, &control_uri);
    httpd_register_uri_handler(camera_httpd, &settings_uri);
}

MJPEG Stream Handling

The /stream endpoint continuously sends image frames to connected clients. Streaming process:

  • Capture a frame using esp_camera_fb_get()
  • Optionally process the frame for face detection
  • Encode or forward the frame as JPEG
  • Send the frame as part of a multipart HTTP response
  • Return the frame buffer using esp_camera_fb_return()

This loop runs continuously until the client disconnects.

MJPEG is used to ensure:

  • Maximum browser compatibility
  • Simple implementation
  • Low processing overhead
static esp_err_t stream_handler(httpd_req_t *req){
    camera_fb_t * fb = NULL;
    esp_err_t res = ESP_OK;
    size_t _jpg_buf_len = 0;
    uint8_t * _jpg_buf = NULL;
    char * part_buf[64];

    res = httpd_resp_set_type(req, _STREAM_CONTENT_TYPE);
    if(res != ESP_OK){
        return res;
    }

    httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*");

    while(true){
        fb = esp_camera_fb_get();
        if (!fb) {
            Serial.println("Camera capture failed");
            res = ESP_FAIL;
        } else {
            if(fb->width > 400){
                if(facenet.run_face_recognize(fb->buf, fb->len)){
                    // Face detection processing
                }
            }
            if(res == ESP_OK){
                res = httpd_resp_send_chunk(req, _jpg_buf, _jpg_buf_len);
            }
            esp_camera_fb_return(fb);
            if(res != ESP_OK){
                break;
            }
        }
    }
    return res;
}

HTTP Endpoints

The system exposes several HTTP endpoints:

  • / - Serves the embedded web interface
  • /stream - Provides a continuous MJPEG video stream
  • /capture - Captures and returns a single JPEG image
  • /control - Adjusts camera parameters using query parameters
  • /status - Returns current camera and system configuration in JSON format

All endpoints are accessed using standard HTTP GET requests.

Frame Buffer Management

Efficient frame buffer handling is critical for stability and performance. Key strategies used:

  • PSRAM is used for frame buffers whenever available: When available, PSRAM is used to store frame buffers, allowing for higher resolution images
  • Multiple frame buffers allow one frame to be captured while another is processed: Multiple frame buffers enable smooth streaming by allowing one buffer to be processed while another is being captured
  • Frame buffers are returned immediately after use to prevent memory leaks: Frames are returned to the camera driver immediately after use to prevent memory leaks
  • Resolution is dynamically limited when memory is constrained: Resolution and frame count are adjusted based on available memory to prevent crashes

Face Detection Pipeline

Face detection is implemented as part of the frame processing workflow. Processing steps:

  • Capture image frame from camera: Frame is captured from the camera sensor
  • Convert or prepare image data as required: Image is prepared for face detection (resized, converted if necessary)
  • Run face detection on the frame: Face detection algorithm processes the image to find faces
  • Identify face regions and optional landmarks: Facial landmarks (eyes, nose, mouth) are identified
  • Draw bounding boxes on the image: Bounding boxes and landmarks are drawn on the image
  • Encode the processed image as JPEG: Processed image is JPEG encoded for transmission
  • Stream or return the image to the client: Processed image is sent to the client as part of the video stream

Key notes:

  • Detection runs entirely on the ESP32-CAM: No external services or cloud processing
  • Performance depends on resolution and frame rate: Processing time increases with higher resolution
  • Detection is most reliable at lower resolutions: Face detection accuracy may vary with image quality

Code Design Considerations

All long-running tasks are handled asynchronously:

  • The loop() function remains idle after setup
  • Network and camera operations run in separate tasks
  • The system prioritizes stability over high frame rates

This design ensures predictable behavior on resource-constrained hardware.

About the Author

Mayank Kulkarni - Founder of MKTechs & Zervista

Mayank Kulkarni

Embedded Systems | Full-Stack | IoT | AI | Full Stack Developer

Founder of MKTechs & Zervista

https://mayank.wiki

Expert in embedded systems, IoT, and edge AI technologies. Specializing in full-stack development and innovative technology solutions.