How We Achieved 100% Accuracy in 97 KB on an ESP32
A deep dive into deploying a predictive maintenance model on a $4 microcontroller — no cloud, no GPU, no compromise.
The Challenge
Predictive maintenance is one of the most impactful applications of machine learning in industry. Detect a failing motor before it fails, and you save thousands in downtime. But the standard approach — streaming sensor data to the cloud for inference — introduces latency, connectivity dependencies, and recurring costs that make it impractical for most deployments.
We asked ourselves: can we run the entire inference pipeline on a $4 ESP32, using less than 100 KB of flash?
The Dataset
We used vibration sensor data from industrial motors — accelerometer readings sampled at 1 kHz across three axes. The task: classify the motor state into one of four categories:
- Normal — healthy operation
- Imbalance — rotor weight distribution issue
- Misalignment — shaft coupling problem
- Bearing fault — early-stage bearing degradation
The dataset contained 12,000 labeled windows of 256 samples each.
Why Not Just Use TensorFlow Lite?
TFLite Micro is a solid framework, but it comes with trade-offs. The runtime alone occupies 50-100 KB of flash depending on which operators you include. For a model that needs to fit in 97 KB total — runtime included — we needed a different approach.
Luviner uses a proprietary inference engine optimized for quantized arithmetic on ARM Cortex-M and Xtensa (ESP32) architectures. The runtime overhead is under 8 KB, leaving the remaining space entirely for model weights and application logic.
The Model Architecture
Instead of a conventional CNN or LSTM, we used an ultra-compact Neural Network. This architecture offers three advantages on microcontrollers:
- Fixed memory footprint — no hidden states that grow with sequence length
- Temporal dynamics — the network adapts to input timing naturally, ideal for sensor data
- Quantization-friendly — the weight distributions are well-suited for integer arithmetic
The final model uses an ultra-compact proprietary architecture, followed by a single classification head with softmax output.
Quantization: From Float32 to Int8
The float32 model achieved 100% accuracy on the test set. The question was: would quantization destroy that accuracy?
We applied post-training quantization with calibration on 500 representative samples. The key insight: we quantize not just the weights, but also the activations, using per-channel scale factors that minimize the quantization error at each layer.
Result: 100% accuracy preserved after quantization. The model size dropped from 380 KB (float32) to 89 KB (int8) — a 4.3x reduction with zero accuracy loss.
Memory Layout on ESP32
Here is how the 97 KB breaks down on the ESP32:
- Inference runtime: 7.8 KB
- Model weights (int8): 89.2 KB
- Total flash: 97 KB
- RAM at inference: 4.1 KB (activations buffer)
The ESP32 has 4 MB of flash and 520 KB of SRAM. Our model uses 2.4% of the available flash and 0.8% of the RAM. This leaves plenty of room for the application firmware, connectivity stack, and sensor drivers.
Inference Performance
On an ESP32 running at 240 MHz:
- Inference time: 1.2 ms per window
- Throughput: 833 predictions per second
- Power consumption: 12 mW during inference
For comparison, sending the same data to a cloud endpoint would take 50-200 ms (depending on connectivity), consume 100-500 mW for the radio transmission, and require a persistent internet connection.
Try It Yourself
We have published an interactive simulation of this exact model running on a virtual ESP32. You can see the sensor readings, the inference results, and the classification output in real time.
Key Takeaways
- You do not need a GPU or cloud connection to run ML inference. A $4 microcontroller is enough.
- Quantization does not have to mean accuracy loss — with proper calibration, int8 models can match float32.
- The inference runtime matters as much as the model. A lean runtime leaves more space for your model.
- Edge AI is not a future technology. It is production-ready today.
If you are building a product that needs on-device intelligence, Luviner can help. We handle the hard parts — model optimization, quantization, and deployment — so you can focus on your application.