Skip to content

2026

Taking a look at Gemma 4 E2B (QAT)

A continuation of the 3N E2B benchmarks, now with faster hardware and models.

LiteRT-LM model: https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/blob/main/gemma-4-E2B-it.litertlm

llama.cpp Q4_0 model: https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B_q4_0-it.gguf

We'll denote the hardware as follows:

ID Info
CPU 1 Dimensity 9000+ (A710 = 1.8 GHz, X2 = 2.55 GHz)
CPU 1.1 Dimensity 9000+ (max freq)
CPU 2 i7-10750H (2.6 GHz)
CPU 2.1 i7-10750H (max freq, ~3.4 GHz)
GPU 1 Mali-G710 MP10 (max freq)
GPU 2 GTX 1660 Ti (max freq)
GPU 2.1 GTX 1660 Ti (1.455 GHz)

TPS is taken as an average of 5 runs.

Configuration:

LiteRT-LM: enable-speculative-decoding = false

llama.cpp: ub = 1024, threads=n_cores/2

For CPU 1, llama.cpp is configured as follows: -DGGML_NATIVE=off -DGGML_CPU_ARM_ARCH=native+dotprod+i8mm+nosve

For GPU 2, llama.cpp is https://github.com/pt13762104/llama.cpp/commit/c5914bbd918022518bb5c0645bcdb24e3bc404f2 with GGML_CUDA_NO_TURING_MMA enabled.

Benchmarks:

LiteRT-LM

Benchmark CPU 1 CPU 1.1 CPU 2 CPU 2.1
pp1024 151.98 184.57 153.96 211.37
tg256 @ d1024 15.13 16.07 17.23 20.45

Comment: Power efficiency of the CPU is poor. TG is inherently memory bottlenecked. CPU 2 demonstrates much better scaling.

Benchmark GPU 1 GPU 2 GPU 2.1
pp1024 874.90 4877.41 4411.52
tg256 @ d1024 19.16 98.96 79.00

Comment: Clearly the GTX 1660 Ti is ALU bottlenecked in the decode phase. The Mali-G710 MP10 was really fast compared to the CPUs.

llama.cpp Q4_0

Benchmark CPU 1 CPU 1.1 CPU 2 CPU 2.1
pp1024 80.67 103.53 71.26 84.24
tg256 15.10 17.87 17.82 20.48
pp1024 @ d4096 42.99 53.76 54.26 64.25
tg256 @ d4096 12.68 14.01 14.88 17.29

Comment: Slightly better scaling than LiteRT-LM for CPU 1, but objectively worse performance.

Benchmark GPU 2 GPU 2.1
pp1024 2165.55 1860.03
tg256 118.57 102.95
pp1024 @ d4096 1496.00 1311.85
tg256 @ d4096 112.58 99.38

Comment: Better decode scaling, but falls short of LiteRT-LM in prompt processing.

Conclusion: LiteRT-LM is the first viable option for phone inference. If you want to burn your hand while enjoying (much) slower inference, llama.cpp (on the phone) is probably for you.

It should be noted that the LiteRT-LM model is optimized for phone inference*, and should not be considered a replacement to the llama.cpp Q4_0 model.

*Note that this is a 2-bit and not a 4-bit model. Accuracy might be evaluated on another time, e.g. on a RTX 3080 when I have time to do so.

An useless operating system

It was at the time when I saw these cheap retro gaming consoles with only 4GB of storage and 512MB of RAM. I've thought: Could I make a complete retro gaming distribution in that same space?

The answer: Not really. I could squeeze things in 4GB of space, but not 512MB of RAM. Not even close.

But anyways, here's the download link: https://drive.google.com/file/d/1-LzuryJ2MBLoBvcXmFuBgm9OQ3V_TUy0/view?usp=sharing. Use it at your own risk. (You can also drop it directly to a Ventoy USB.)

There's no sound (I'm too lazy to install it). Wireless drivers and other things took up too much space, so it's also excluded.

This is a simple Debian installation with ES-DE and RetroArch (cores less than 20MB in size were kept).

When you start the OS, an ES-DE instance will start at VT8. Stopping ES-DE makes it restart. You can login as emustation (password is the same.)

It's too resource hogging, and Batocera exists, so it's probably useless. But it's at least fun to try at the end :)

Does the Dimensity 9000 and 10750H hold well in the benchmarks? (again)

I've taken a look at https://www.phoronix.com/review/16-armlinux-sep2018/, and decided to test the Dimensity 9000 on these benchmarks.

The result: https://openbenchmarking.org/result/2602284-YOSH-260227012. Most of the benchmarks the Dimensity wins by a landslide excluding pgbench (it got a lead but the Socionext Developerbox is brute forcing it), or Perl Interpreter (I blame proot for this).

The X2-core was completely destroying anything else in 2018 (obviously), and the total run-time is so fast nothing even comes close (thanks to the X2-core again.). On the desktop leaderboard, the 9000 Plus pales in comparison. I have not tested that out but it should rank at the bottom.

I've also tested a few benchmarks out of my 10750H and it got about 4960X-5960X performance: https://openbenchmarking.org/result/2602287-YOSH-YOSHI9552.

A summary: https://docs.google.com/spreadsheets/d/1MC92otAyJLy6xrpeCMe5lM960kpfgVo6Gvjeg3wx6aE/edit?usp=sharing.