Status: Design spec complete. Prototype pending.

The Tiny Hermes Robot is a deliberately simple robotic platform. All reasoning, vision processing, navigation, and personality live on the Hermes backend. The robot itself is just sensors, motors, and a display — an I/O shell with zero onboard intelligence.

Architecture

  ┌─────────────────────┐
  │    HERMES BACKEND   │
  │  (Hermes Agent VM)  │
  │                     │
  │  • Vision (YOLO/CLIP)│
  │  • Path planning     │
  │  • Speech (STT/TTS)  │
  │  • Decision-making   │
  │  • Personality/LLM   │
  └──────┬──────────────┘
         │ WiFi / Tailscale
         │ WebSocket / MJPEG
  ┌──────┴──────────────┐
  │   ROBOT CLIENT      │  ← "Dumb" — no local AI
  │  (ESP32-S3)         │
  │                     │
  │  • Camera stream    │
  │  • Motor drive I²C  │
  │  • OLED display     │
  │  • I²S mic + amp    │
  │  • Battery monitor  │
  └─────────────────────┘

Hardware BOM

ComponentPartCost
MCUESP32-S3-DevKitC-1 (16MB flash, 8MB PSRAM)~$15
CameraOV2640 (2MP) on FPC ribbon~$8
Display0.96″ OLED 128×64, SSD1306 I²C~$4
MotorsN20 micro metal gearmotor (100:1, ~150 RPM) ×2~$12
DriverDRV8833 dual H-bridge (I²C via PCA9685)~$5
Chassis3D-printed PLA (2 pieces)~$3
Battery18650 Li-ion (3.7V 2600mAh) + TP4056 charger~$8
AudioINMP441 I²S mic + MAX98357A amp + speaker~$8
LoRa (opt)SX1262 + SMA antenna~$15
Total~$97 ($82 without LoRa)

Communication

Primary (WiFi): Robot streams MJPEG frames + JSON telemetry (battery, temp, RSSI, encoder position) over WebSocket. Hermes sends motor commands, display updates, and speech TTS back down the same connection.

Fallback (LoRa — optional): When WiFi is unavailable, the robot sends short status packets every 30s and receives waypoint commands. No video/audio — commands only, ~30 bytes/sec.

Key Design Decisions

Why ESP32-S3, not RPi? Instant-on, 100mA active draw vs 500mA+. The robot should be usable for hours, not minutes. No Linux boot time, no SD card corruption risk.

Why I²C motor driver? Frees up GPIO pins. The PCA9685 PWM driver handles 16 channels on just two pins (SDA/SCL), leaving the rest for camera, display, audio, and sensors.

Why MJPEG, not H.264? The ESP32-S3 lacks a hardware H.264 encoder. Software encoding would consume most of the CPU. MJPEG at 640×480 is simple and well within the S3’s ISP pipeline — the backend handles compression and analysis.

Source Files

Next Steps