Taking a look at Gemma 4 E2B (QAT)
A continuation of the 3N E2B benchmarks, now with faster hardware and models.
LiteRT-LM model: https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/blob/main/gemma-4-E2B-it.litertlm
llama.cpp Q4_0 model: https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B_q4_0-it.gguf
We'll denote the hardware as follows:
| ID | Info |
|---|---|
| CPU 1 | Dimensity 9000+ (A710 = 1.8 GHz, X2 = 2.55 GHz) |
| CPU 1.1 | Dimensity 9000+ (max freq) |
| CPU 2 | i7-10750H (2.6 GHz) |
| CPU 2.1 | i7-10750H (max freq, ~3.4 GHz) |
| GPU 1 | Mali-G710 MP10 (max freq) |
| GPU 2 | GTX 1660 Ti (max freq) |
| GPU 2.1 | GTX 1660 Ti (1.455 GHz) |
TPS is taken as an average of 5 runs.
Configuration:
LiteRT-LM: enable-speculative-decoding = false
llama.cpp: ub = 1024, threads=n_cores/2
For CPU 1, llama.cpp is configured as follows: -DGGML_NATIVE=off -DGGML_CPU_ARM_ARCH=native+dotprod+i8mm+nosve
For GPU 2, llama.cpp is https://github.com/pt13762104/llama.cpp/commit/c5914bbd918022518bb5c0645bcdb24e3bc404f2 with GGML_CUDA_NO_TURING_MMA enabled.
Benchmarks:
LiteRT-LM
| Benchmark | CPU 1 | CPU 1.1 | CPU 2 | CPU 2.1 |
|---|---|---|---|---|
| pp1024 | 151.98 | 184.57 | 153.96 | 211.37 |
| tg256 @ d1024 | 15.13 | 16.07 | 17.23 | 20.45 |
Comment: Power efficiency of the CPU is poor. TG is inherently memory bottlenecked. CPU 2 demonstrates much better scaling.
| Benchmark | GPU 1 | GPU 2 | GPU 2.1 |
|---|---|---|---|
| pp1024 | 874.90 | 4877.41 | 4411.52 |
| tg256 @ d1024 | 19.16 | 98.96 | 79.00 |
Comment: Clearly the GTX 1660 Ti is ALU bottlenecked in the decode phase. The Mali-G710 MP10 was really fast compared to the CPUs.
llama.cpp Q4_0
| Benchmark | CPU 1 | CPU 1.1 | CPU 2 | CPU 2.1 |
|---|---|---|---|---|
| pp1024 | 80.67 | 103.53 | 71.26 | 84.24 |
| tg256 | 15.10 | 17.87 | 17.82 | 20.48 |
| pp1024 @ d4096 | 42.99 | 53.76 | 54.26 | 64.25 |
| tg256 @ d4096 | 12.68 | 14.01 | 14.88 | 17.29 |
Comment: Slightly better scaling than LiteRT-LM for CPU 1, but objectively worse performance.
| Benchmark | GPU 2 | GPU 2.1 |
|---|---|---|
| pp1024 | 2165.55 | 1860.03 |
| tg256 | 118.57 | 102.95 |
| pp1024 @ d4096 | 1496.00 | 1311.85 |
| tg256 @ d4096 | 112.58 | 99.38 |
Comment: Better decode scaling, but falls short of LiteRT-LM in prompt processing.
Conclusion: LiteRT-LM is the first viable option for phone inference. If you want to burn your hand while enjoying (much) slower inference, llama.cpp (on the phone) is probably for you.
It should be noted that the LiteRT-LM model is optimized for phone inference*, and should not be considered a replacement to the llama.cpp Q4_0 model.
*Note that this is a 2-bit and not a 4-bit model. Accuracy might be evaluated on another time, e.g. on a RTX 3080 when I have time to do so.