Comment by sagebird
5 hours ago
I am not an expert but this seems like model distillation could work to get the behavior you need to run on a cheap end-user processor (Raspberry Pi 4/5 class). I chatted with claude opus about your project and had the following advice:
For the compute problem, you don't need a Jetson. The approach you want is knowledge distillation: train a large, expensive teacher model offline on a beefy GPU (cloud instance, your laptop's GPU, whatever), then distill it down into a tiny student network like a MobileNetV3-Small or EfficientNet-Lite. Quantize that student to int8 and export it to TFLite. The resulting model is 2-3 MB and runs at 10-20 FPS on a Raspberry Pi 4/5 with just the CPU - no ML accelerator needed. For even cheaper, an ESP32-S3 with a camera module can run sub-500KB models for simpler tasks. The preprocessing is trivial: resize the camera frame to 224x224, normalize pixel values, feed the tensor to the TFLite interpreter. The CNN learns its own feature extraction internally, so you don't need any classical CV preprocessing. Looking at your observations, I think the deeper issue is what you identified: there's not enough signal in single frames. Your validation loss not converging even after augmentation and ImageNet pretraining confirms this. The fix is exactly what you listed in your future work - feed stacked temporal frames instead of single images. A simple approach is to concatenate 3-4 consecutive grayscale frames into a multi-channel input (e.g., 224x224x4). This gives the network implicit motion, velocity, and approach-rate information without needing to compute optical flow explicitly. It's the same trick DeepMind used in the original Atari DQN paper - a single frame of Pong doesn't tell you which direction the ball is moving either. On the action space: your intuition about STOP being problematic is right. It creates a degenerate attractor - once the model predicts STOP, there's no recovery mechanism. The paper you referenced that only uses STOP at goal-reached is the better design. Also consider that TURN_CW and TURN_CCW have no obvious visual signal in a single frame (which way to turn is a function of where you've been and where you're going, not just what you see right now), which is another reason temporal stacking or adding a small recurrent/memory component would help. Even a simple LSTM or state tuple fed alongside the image could encode "I've been turning left for 3 steps, maybe try something else." For the longer term, consider a hybrid architecture: use the distilled neural net for obstacle detection and free-space classification, but pair it with classical SLAM or even simple odometry-based mapping for path planning and coverage. Pure end-to-end behavior cloning for the full navigation stack is a hard problem - even the commercial robots use learned perception with algorithmic planning. And your data collection would get easier too, because you'd only need to label "what's in front of me" rather than "what should I do," which decouples perception from decision-making and makes each piece easier to train and debug independently.
No comments yet
Contribute on Hacker News ↗