On-Device AI: Running Llama on a Phone
We have quantised Llama 3 down to 4-bit precision and it runs at 30 tokens per second on a flagship Android device. The quality loss is surprisingly small -- maybe 2-3% on standard...
Edge AI reduces inference costs by 80-90% for high-volume use cases and eliminates cloud latency. As models shrink and hardware improves, this becomes viable for an expanding set of enterprise applications.
The push to run AI models directly on edge devices, phones, and local hardware rather than in the cloud. Driven by latency requirements, privacy concerns, and cost optimisation for high-volume inference.
| Source | Type | Items |
|---|---|---|
| The AI Podcast (NVIDIA) | Podcast | 1 |
| Exponential View (Azeem Azhar) | 1 |
We have quantised Llama 3 down to 4-bit precision and it runs at 30 tokens per second on a flagship Android device. The quality loss is surprisingly small -- maybe 2-3% on standard benchmarks. For most enterprise use cases like classification, summarisation, and extraction, this is more than adequate.
We have quantised Llama 3 down to 4-bit precision and it runs at 30 tokens per second on a flagship Android device. The quality loss is surprisingly small -- maybe 2-3% on standard...
My prediction: by 2028, more AI inference will run on edge devices than in the cloud. The economics are compelling -- once you amortise the device cost, edge inference is essential...