Three Prompts: MediaPipe Computer Vision Demo

Introduction

Computer vision has revolutionized how machines perceive and interpret the visual world. Google's MediaPipe is an open-source framework that simplifies building multimodal (video, audio, time series) applied machine learning pipelines. In this project, we built a real-time face and body tracking application that demonstrates MediaPipe's capabilities in a browser environment.

The application allows users to toggle between face mesh tracking, body pose tracking, or both simultaneously. It visualizes the tracking results with color-coded overlays on a live camera feed, all running in real-time directly in the browser without requiring any server-side processing.

The Three Prompts

Here are the exact prompts that were used to create this project:

Prompt 1

@projects/media-pipe-test 

Make a sample application of media pipe, allowing for multiple models, body, and face

The test should demostrate the abilities of camera tracking

Prompt 2

This error pops up when switching to the body pose. Also make it possible to turn both on at the same time



VM1419 pose_solution_simd_wasm_bin.js:9 Uncaught (in promise) RuntimeError: Aborted(Module.arguments has been replaced with plain arguments_ (the initial value can be provided on Module, but after startup the value is only looked for on a local variable of that name))
    at abort (VM1419 pose_solution…wasm_bin.js:9:17640)
    at Object.get (VM1419 pose_solution…_wasm_bin.js:9:7759)
    at pose_solution_simd_wasm_bin.js:9:5881

Prompt 3

Now update the UI to be more structured, keep the midia pipe logic identical to how it works now. Make sure to handle the colors properly

Technical Implementation

MediaPipe Integration

The application integrates two MediaPipe models:

Face Mesh: Detects and tracks 468 facial landmarks in real-time, providing precise mapping of facial features.
Pose: Identifies and tracks 33 body landmarks, enabling full-body pose estimation.

Both models run entirely in the browser using WebAssembly (WASM), which provides near-native performance for computationally intensive tasks like real-time computer vision.

Key Technical Challenges

Challenge 1: WASM Module Conflicts

The first version of the application encountered runtime errors when switching between models. This occurred because MediaPipe's WASM modules don't always clean up gracefully when destroyed and recreated.

Solution: Instead of destroying and recreating models on each toggle, we initialized both models once and kept them alive throughout the application lifecycle. This approach eliminated the WASM conflicts and allowed for simultaneous model usage.

// Initialize models once and keep them alive
useEffect(() => {
  if (!faceMeshRef.current) {
    const faceMesh = new FaceMesh({/*...*/});
    // Configuration...
    faceMeshRef.current = faceMesh;
  }
  if (!poseRef.current) {
    const pose = new Pose({/*...*/});
    // Configuration...
    poseRef.current = pose;
  }
  // Camera setup...
}, []);

Challenge 2: Simultaneous Model Processing

Enabling both models simultaneously required careful management of the camera feed and processing pipeline.

Solution: We implemented a system where each frame from the camera is conditionally sent to active models based on user toggles. Results from each model are stored separately and then combined during rendering.

onFrame: async () => {
  const img = videoRef.current;
  if (faceEnabled) await faceMeshRef.current.send({ image: img });
  else lastResultsRef.current.face = null;
  if (poseEnabled) await poseRef.current.send({ image: img });
  else lastResultsRef.current.pose = null;
  drawResults(
    img,
    faceEnabled ? lastResultsRef.current.face : null,
    poseEnabled ? lastResultsRef.current.pose : null
  );
}

UI Design

The user interface was designed to be clean, intuitive, and visually appealing:

Color-coded controls: Each tracking model has a distinct color (aqua for Face Mesh, mint for Pose)
Toggle switches: Simple checkboxes allow enabling/disabling each model independently
Visual feedback: The canvas displays real-time tracking results with color-matched overlays
Responsive container: The video feed is displayed in a contained card with appropriate styling

Key Takeaways

1. WebAssembly Performance

MediaPipe's use of WebAssembly enables complex computer vision tasks to run efficiently in the browser. This eliminates the need for server-side processing and reduces latency, making real-time applications possible.

2. React Component Architecture

The application demonstrates how to structure React components that integrate with complex third-party libraries. By using refs and careful effect management, we maintain clean component lifecycles while working with imperative APIs.

3. Error Handling in ML Applications

Machine learning libraries like MediaPipe can introduce unique runtime challenges. The project shows how to identify and resolve WASM-related issues through careful initialization and state management.

Potential Extensions

This project could be extended in several interesting ways:

Additional MediaPipe models: Integrating hand tracking, object detection, or segmentation
Recording capabilities: Saving video clips with overlay visualizations
Gesture recognition: Using the tracked landmarks to recognize specific gestures or poses
Interactive elements: Adding virtual objects that respond to detected body movements

Conclusion

This MediaPipe demo showcases how modern web applications can leverage advanced computer vision capabilities directly in the browser. With just three prompts, we built a functional application that demonstrates real-time face and body tracking with an intuitive user interface.

The project serves as an excellent starting point for exploring MediaPipe's capabilities and building more complex computer vision applications for the web.