we are raising our maiden funding round
by Sirsho Chakraborty on February 06, 2025
Local LLM deployment on mid-range phones depends on model optimization (quantization, pruning, distillation) and resource management & scheduling (capping CPU usage, suspending LLM tasks during gaming or low-power states, and resuming on demand). Larger queries can be offloaded to the cloud for better performance. Region-specific LLMs can be updated via OTA updates—pushing only changed layers or embeddings to adapt local culture or language—keeping model sizes manageable and improving personalization without requiring an entire new download.
Local models in the 1B–3B parameter range are generally feasible with 4-bit or 8-bit quantization, though RAM and speed constraints remain critical challenges. Larger models (7B+) often push device costs above $1000 and need specialized hardware (e.g., Snapdragon Elite, NPUs). Adapting models for local culture or language can be done via OTA updates with differential weighting.
The main challenge is balancing memory footprint, battery/thermal constraints, and token generation speed, ensuring a seamless user experience.
(A) Model Optimization Techniques
Converting the model weights from floating point (FP16/FP32) down to 8-bit or even 4-bit precision can significantly reduce memory footprint and inference latency. For some layers, especially embeddings or layer norms, FP16 can be used. The rest of the network can be quantized to 8-bit or 4-bit to further reduce size and speed up computation.
Prune less critical weights/neuron connections from the model. A moderate pruning strategy (e.g., 10–30%) can reduce the overall parameter count, hence memory usage, without severely impacting model quality. Instead of direct pruning, a smaller “student” model can be trained to mimic the performance of a larger “teacher” model.
(B) Resource Management & Scheduling
A starting point is capping AI usage to 50% of total CPU cores. Later on, it can be lowered or raised dynamically based on battery level, thermal headroom, or user activity.
For instance, if the phone is plugged in and relatively cool, allow higher usage. If the battery is below 20% or the device is overheating, cut LLM CPU usage drastically.
Having multiple run queues for system processes vs. AI tasks vs. user apps can help isolate and control resource usage. Modern Linux/Android schedulers let us set c-group CPU limits or use task-set/priority-based approaches to keep resources balanced.
If user behavior or contextual cues indicate that a resource-intensive process like gaming is about to start, the system can proactively suspend or partially freeze the local LLM to free up CPU/GPU resources. Once the gaming session ends, the LLM can quickly resume to answer queries or perform background tasks.
When the phone is idle or the user is not actively engaging with any AI-related feature, the LLM stays in low-power or suspended mode to conserve battery. If an email arrives during this idle state and requires immediate action (e.g., a quick reply suggestion), the system can decide whether a smaller 1B model is sufficient for a short classification or text generation, or if a larger 3B model should be activated for more complex reasoning.
When the user explicitly requests an AI function—such as drafting a detailed email reply—the LLM ramps up to the required performance level.
(C) Connectivity and Cloud Offloading
Use the local (1B–3B param) LLM to handle common questions, immediate logic, or short queries. If the question or prompt is large and requires more context or advanced reasoning, your on-device LLM can detect this and call a cloud-based larger model (like GPT-4, DeepSeek R1).
Adapting the model to local culture or language is also a challenge. Region-specific fine-tuned weights or simply new tokenizer/embedding layers for regional languages can be pushed via OTA updates. Instead of shipping an entire new model each time, differential updates (only shipping changed layers/weights) can be considered.
Below are the performance reports for a few models that we’re considering and experimenting with. We can fine-tune the models based on the above strategies. The community is currently researching on the same topic and how to optimize solution.
(1) tinyllama 1.1B
(2) Falcon-RW-1B
CPU and RAM Requirements (Android):
A 4-bit quantization of 1B typically requires ~1GB of RAM.
8-bit can be ~2 GB of RAM.
Speeds can vary from <1 token/sec up to a few tokens/sec depending on CPU/GPU acceleration.
Lightweight and easier to run on a wider range of devices.
(Tested on Android devices, Raspberry Pi 4)
(1) stable-lm-3B
CPU and RAM Requirements (Android):
An 8-bit quantization takes ~3GB of RAM.
4-bit quant: ~1.5–2 GB
Speeds can vary from <1 token/sec up to a few tokens/sec depending on CPU/GPU acceleration.
On an 8 GB RAM device with a decent CPU/GPU (Snapdragon® 888 or newer): 1–3 tokens/sec with a 4-bit quant model.
On Raspberry Pi 5 8GB, ARM Cortex A76: ~0.5–1 token/sec with a 4-bit quant model.
CPU and RAM Requirements (Android):
An 8-bit quantization requires ~3.5GB of RAM.
4-bit can be ~2 GB of RAM.
On an 8 GB RAM device with a decent CPU/GPU (Snapdragon® 888): 1–3 tokens/sec with a 4-bit quant model. Can be optimized with specialized GPU/DSP acceleration.
On Raspberry Pi 5 8GB, ARM Cortex A76: ~0.5–1 token/sec with a 4-bit quant model.
8-bit quant testing is still in progress.
(3) Other Models to be tested: GPT-3.5, Mixtral 8x7B
Llama v2 7B Chat Quantized
CPU and RAM Requirements (Android):
A 4-bit quantization of 7B typically requires ~4 GB of RAM.
8-bit can be ~6–7 GB of RAM.
Supported Chipsets: Snapdragon® 8 Elite Mobile, Snapdragon® 8 Gen 3 Mobile, Snapdragon® X Elite
Adreno GPUs, Hexagon NPUs can significantly improve performance
On a flagship Snapdragon device with Qualcomm’s AI SDK and hardware acceleration, real-time or near-real-time token generation (1–3 tokens/sec) is achievable.
Device cost will go above $1000 to meet the requirements.
(We’re yet to perform tests and tinkering with this specific model. This report is based on the claims by Qualcomm.)
Reference: https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized