Skip to content

Fun

Qwen3 0.6B benchmarks

Here are a few benchmarks of Qwen3 0.6B (Q4_0) on a Dimensity 9000+:

Specs: 64 bit LPDDR5X-7500 (60.0 GB/s), 1xX2 (3350MHz), 3xA710 (3200 MHz), 4xA510 (1800MHz)

All benchmarks are done using llama.cpp build: 6602 (72b24d96) with clang version 20.1.8 (Fedora 20.1.8-4.fc42) for aarch64-redhat-linux-gnu with ubatch = 64. Tests on A510 are done with mmap enabled.

Compilation options: -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DGGML_OPENMP=off

1st run: One A510 core vs. one A710 core vs. one X2 core

One A510 core

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 pp512 14.83 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 tg128 4.34 ± 0.00

One A710 core

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 pp512 96.77 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 tg128 27.20 ± 0.00

One X2 core

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 pp512 143.94 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 tg128 39.32 ± 0.00

2nd run: Two A510 cores vs. two A710 cores vs. A710+X2

Two A510 cores

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 pp512 25.97 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 tg128 6.92 ± 0.00

Two A710 cores

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 pp512 184.00 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 tg128 48.63 ± 0.00

A710+X2

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 pp512 196.54 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 tg128 52.45 ± 0.00

3rd run: 3 A510 cores vs. 3 A710 cores vs. 2xA710+X2

3 A510 cores

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 pp512 39.05 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 tg128 10.40 ± 0.00

3 A710 cores

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 pp512 267.38 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 tg128 64.33 ± 0.00

2xA710+X2

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 pp512 284.89 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 tg128 65.91 ± 0.00

4th run: 4 A510 cores vs. 3xA710+X2

4 A510 cores

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 pp512 43.76 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 tg128 10.51 ± 0.00

3xA710+X2

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 0 pp512 359.16 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 0 tg128 74.01 ± 0.00

5th run: All cores

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 8 64 0 pp512 86.80 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 8 64 0 tg128 22.08 ± 0.00

Gemma 3N E2B benchmarks

Here are a few benchmarks of Gemma 3N E2B (Q4_0) on a Snapdragon 730G:

Specs: 32 bit LPDDR4X-3733 (14.9 GB/s), 2xA76 (2208MHz, downclocks to 2169MHz), 6xA55 (1804MHz)

All benchmarks are done using llama.cpp build: 5891 (0d922676) with mmap disabled.

Compilation options: -DGGML_NATIVE=off -DGGML_OPENMP=off -DGGML_CPU_ARM_ARCH=armv8.2-a+fp16+dotprod

1st run: One A55 core vs. one A76 core

One A55 core

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 pp512 3.21 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 tg128 1.05 ± 0.00

One A76 core

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 pp512 13.65 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 tg128 5.80 ± 0.00

2nd run: Two A55 cores vs. two A76 cores

Two A55 cores

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 pp512 6.46 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 tg128 2.13 ± 0.00

Two A76 cores (best configuration for TG, 2-3t/s more in real world usage compared to all cores)

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 pp512 23.06 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 tg128 6.81 ± 0.00

3rd run: 6 A55 cores vs all cores

6 A55 cores

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 6 0 pp512 18.18 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 6 0 tg128 4.41 ± 0.00

All cores (best configuration for PP, but 2xA76 has negligible difference)

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 8 0 pp512 27.51 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 8 0 tg128 5.26 ± 0.00

Run Codeforces 1952J codes 20000x faster

I have created a Codeforces 1952J language to C++ transpiler, the resulting compiled code is 20000x* faster compared to the reference implementation.

Code
#include <bits/stdc++.h>
using namespace std;
#define int long long
#ifndef yoshi_likes_e4
#define endl '\n'
#endif
#define problem ""
#define multitest 0
#define debug(x) cerr << #x << " = " << x << endl;
void init()
{
}
map<string, bool> var_type;
void Add_variable(string k)
{
    bool ok = 0;
    try
    {
        std::stoi(k);
        ok = true;
    }
    catch (const std::invalid_argument &e)
    {
    }
    catch (const std::out_of_range &e)
    {
    }
    if (!ok)
    {
        int pos = k.find('[');
        if (pos == string::npos)
            var_type[k] = 0;
        else
            var_type[string(k.begin(), k.begin() + pos)] = 1;
    }
}
void Yoshi()
{
    vector<vector<string>> code;
    string s;
    while (getline(cin, s))
    {
        stringstream t(s);
        code.push_back({});
        string x;
        while (t >> x)
            code.back().push_back(x);
    }
    map<int, int> label_id;
    int lid = 0;
    for (auto &lines : code)
    {
        if (lines[0] == "simp")
        {
            int v = stoi(lines[2]) - 1;
            if (label_id.find(v) == label_id.end())
                label_id[v] = lid++;
        }
        if (lines[0] == "vibe")
            Add_variable(lines[2]), Add_variable(lines[4]);
        if (lines[0] == "bruh")
            Add_variable(lines[1]), Add_variable(lines[5]);
        if (lines[0] == "*slaps")
            Add_variable(lines[1]), Add_variable(lines[5].substr(0, lines[5].size() - 1));
        if (lines[0] == "rip")
            Add_variable(lines[2]), Add_variable(lines[6]);
        if (lines[0] == "yoink")
            Add_variable(lines[1]);
        if (lines[0] == "yeet")
            Add_variable(lines[1]);
    }
    vector<string> var0, var1;
    for (auto &[u, v] : var_type)
        if (v)
            var1.push_back(u);
        else
            var0.push_back(u);
    cout << R""""(#include <bits/stdc++.h>
using namespace std;
void input(int &x)
{
    string s;
    getline(cin, s);
    x = stoi(s);
}
void input(vector<int> &x)
{
    string s;
    getline(cin, s);
    stringstream t(s);
    while (t >> s)
        x.push_back(stoi(s));
})"""";
    cout << "\nint main(){\ncin.tie(0)->sync_with_stdio(0);\n";
    if (var0.size())
    {
        cout << "int ";
        for (auto &i : var0)
            cout << i << (&i != &var0.back() ? ", " : ";\n");
    }
    if (var1.size())
    {
        cout << "vector<int> ";
        for (auto &i : var1)
            cout << i << (&i != &var1.back() ? ", " : ";\n");
    }
    for (auto &lines : code)
    {
        if (label_id.find(&lines - &code[0]) != label_id.end())
            cout << "L" << label_id[&lines - &code[0]] << ":\n";
        if (lines[0] == "simp")
            cout << "goto L" << label_id[stoi(lines[2]) - 1] << ";\n";
        if (lines[0] == "vibe")
            cout << "if (" << lines[2] << " > " << lines[4] << ')' << "\n";
        if (lines[0] == "bruh")
            cout << lines[1] << " = " << lines[5] << ";\n";
        if (lines[0] == "*slaps")
            cout << lines[5].substr(0, lines[5].size() - 1) << " += " << lines[1] << ";\n";
        if (lines[0] == "rip")
            cout << lines[2] << " -= " << lines[6] << ";\n";
        if (lines[0] == "yoink")
            cout << "input(" << lines[1] << ");\n";
        if (lines[0] == "yeet")
            cout << "cout << " << lines[1] << " << \"\\n\";\n";
        if (lines[0] == "go")
            cout << "return 0;\n";
    }
    cout << "}" << endl;
}
signed main()
{
#ifndef yoshi_likes_e4
    ios::sync_with_stdio(0);
    cin.tie(0);
    if (fopen(problem ".inp", "r"))
    {
        freopen(problem ".inp", "r", stdin);
        freopen(problem ".out", "w", stdout);
    }
#endif
    init();
    int t = 1;
#if multitest
    cin >> t;
#endif
    while (t--)
        Yoshi();
}

*: Selection sort performance:

\(n=3000\):

Implementation Runtime
Reference 41.0s
Compiled 2ms

\(n=5000\):

Implementation Runtime
Reference 123.3s
Compiled 6ms

SymPy

I have created a SymPy package that runs on ARM64, based on Alpine (note: requires Termux,proot/chroot OR any rooted computer that is ARM64 (e.g. RPi,etc.)) just for fun (to see how small it can be).

Download link

  • Usage:
    • Extract this file to your home directory (e.g by tar -xzf ../sympy.cmax.tgz /home/...).
    • Run chroot /home/.../alpine /bin/bash -l
    • You should see a Python shell with SymPy loaded. (Note: exiting the shell will stop the chroot. To stop this (e.g. for customization,etc..) remove the exit 0 line in the .../alpine/etc/profile file.)