Taking a look at Gemma 4 E2B (QAT)

A continuation of the 3N E2B benchmarks, now with faster hardware and models.

LiteRT-LM model: https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/blob/main/gemma-4-E2B-it.litertlm

llama.cpp Q4_0 model: https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B_q4_0-it.gguf

We'll denote the hardware as follows:

ID	Info
CPU 1	Dimensity 9000+ (A710 = 1.8 GHz, X2 = 2.55 GHz)
CPU 1.1	Dimensity 9000+ (max freq)
CPU 2	i7-10750H (2.6 GHz)
CPU 2.1	i7-10750H (max freq, ~3.4 GHz)
GPU 1	Mali-G710 MP10 (max freq)
GPU 2	GTX 1660 Ti (max freq)
GPU 2.1	GTX 1660 Ti (1.455 GHz)

TPS is taken as an average of 5 runs.

Configuration:

LiteRT-LM: enable-speculative-decoding = false

llama.cpp: ub = 1024, threads=n_cores/2

For CPU 1, llama.cpp is configured as follows: -DGGML_NATIVE=off -DGGML_CPU_ARM_ARCH=native+dotprod+i8mm+nosve

For GPU 2, llama.cpp is https://github.com/pt13762104/llama.cpp/commit/c5914bbd918022518bb5c0645bcdb24e3bc404f2 with GGML_CUDA_NO_TURING_MMA enabled.

Benchmarks:

LiteRT-LM

Benchmark	CPU 1	CPU 1.1	CPU 2	CPU 2.1
pp1024	151.98	184.57	153.96	211.37
tg256 @ d1024	15.13	16.07	17.23	20.45

Comment: Power efficiency of the CPU is poor. TG is inherently memory bottlenecked. CPU 2 demonstrates much better scaling.

Benchmark	GPU 1	GPU 2	GPU 2.1
pp1024	874.90	4877.41	4411.52
tg256 @ d1024	19.16	98.96	79.00

Comment: Clearly the GTX 1660 Ti is ALU bottlenecked in the decode phase. The Mali-G710 MP10 was really fast compared to the CPUs.

llama.cpp Q4_0

Benchmark	CPU 1	CPU 1.1	CPU 2	CPU 2.1
pp1024	80.67	103.53	71.26	84.24
tg256	15.10	17.87	17.82	20.48
pp1024 @ d4096	42.99	53.76	54.26	64.25
tg256 @ d4096	12.68	14.01	14.88	17.29

Comment: Slightly better scaling than LiteRT-LM for CPU 1, but objectively worse performance.

Benchmark	GPU 2	GPU 2.1
pp1024	2165.55	1860.03
tg256	118.57	102.95
pp1024 @ d4096	1496.00	1311.85
tg256 @ d4096	112.58	99.38

Comment: Better decode scaling, but falls short of LiteRT-LM in prompt processing.

Conclusion: LiteRT-LM is the first viable option for phone inference. If you want to burn your hand while enjoying (much) slower inference, llama.cpp (on the phone) is probably for you.

It should be noted that the LiteRT-LM model is optimized for phone inference*, and should not be considered a replacement to the llama.cpp Q4_0 model.

*Note that this is a 2-bit and not a 4-bit model. Accuracy might be evaluated on another time, e.g. on a RTX 3080 when I have time to do so.