Skip to content

Fun

An useless operating system

It was at the time when I saw these cheap retro gaming consoles with only 4GB of storage and 512MB of RAM. I've thought: Could I make a complete retro gaming distribution in that same space?

The answer: Not really. I could squeeze things in 4GB of space, but not 512MB of RAM. Not even close.

But anyways, here's the download link: https://drive.google.com/file/d/1-LzuryJ2MBLoBvcXmFuBgm9OQ3V_TUy0/view?usp=sharing. Use it at your own risk. (You can also drop it directly to a Ventoy USB.)

There's no sound (I'm too lazy to install it). Wireless drivers and other things took up too much space, so it's also excluded.

This is a simple Debian installation with ES-DE and RetroArch (cores less than 20MB in size were kept).

When you start the OS, an ES-DE instance will start at VT8. Stopping ES-DE makes it restart. You can login as emustation (password is the same.)

It's too resource hogging, and Batocera exists, so it's probably useless. But it's at least fun to try at the end :)

Does the Dimensity 9000 and 10750H hold well in the benchmarks? (again)

I've taken a look at https://www.phoronix.com/review/16-armlinux-sep2018/, and decided to test the Dimensity 9000 on these benchmarks.

The result: https://openbenchmarking.org/result/2602284-YOSH-260227012. Most of the benchmarks the Dimensity wins by a landslide excluding pgbench (it got a lead but the Socionext Developerbox is brute forcing it), or Perl Interpreter (I blame proot for this).

The X2-core was completely destroying anything else in 2018 (obviously), and the total run-time is so fast nothing even comes close (thanks to the X2-core again.). On the desktop leaderboard, the 9000 Plus pales in comparison. I have not tested that out but it should rank at the bottom.

I've also tested a few benchmarks out of my 10750H and it got about 4960X-5960X performance: https://openbenchmarking.org/result/2602287-YOSH-YOSHI9552.

A summary: https://docs.google.com/spreadsheets/d/1MC92otAyJLy6xrpeCMe5lM960kpfgVo6Gvjeg3wx6aE/edit?usp=sharing.

Qwen3 0.6B benchmarks

Here are a few benchmarks of Qwen3 0.6B (Q4_0) on a Dimensity 9000+:

Specs: 64 bit LPDDR5X-7500 (60.0 GB/s), 1xX2 (3350MHz), 3xA710 (3200 MHz), 4xA510 (1800MHz)

All benchmarks are done using llama.cpp build: 6602 (72b24d96) with clang version 20.1.8 (Fedora 20.1.8-4.fc42) for aarch64-redhat-linux-gnu with ubatch = 64. Tests on A510 are done with mmap enabled.

Compilation options: -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DGGML_OPENMP=off

1st run: One A510 core vs. one A710 core vs. one X2 core

One A510 core

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 pp512 14.83 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 tg128 4.34 ± 0.00

One A710 core

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 pp512 96.77 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 tg128 27.20 ± 0.00

One X2 core

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 pp512 143.94 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 1 64 0 tg128 39.32 ± 0.00

2nd run: Two A510 cores vs. two A710 cores vs. A710+X2

Two A510 cores

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 pp512 25.97 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 tg128 6.92 ± 0.00

Two A710 cores

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 pp512 184.00 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 tg128 48.63 ± 0.00

A710+X2

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 pp512 196.54 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 2 64 0 tg128 52.45 ± 0.00

3rd run: 3 A510 cores vs. 3 A710 cores vs. 2xA710+X2

3 A510 cores

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 pp512 39.05 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 tg128 10.40 ± 0.00

3 A710 cores

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 pp512 267.38 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 tg128 64.33 ± 0.00

2xA710+X2

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 pp512 284.89 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 3 64 0 tg128 65.91 ± 0.00

4th run: 4 A510 cores vs. 3xA710+X2

4 A510 cores

model size params backend threads n_ubatch test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 pp512 43.76 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 tg128 10.51 ± 0.00

3xA710+X2

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 0 pp512 359.16 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 4 64 0 tg128 74.01 ± 0.00

5th run: All cores

model size params backend threads n_ubatch mmap test t/s
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 8 64 0 pp512 86.80 ± 0.00
qwen3 0.6B Q4_0 358.78 MiB 596.05 M CPU 8 64 0 tg128 22.08 ± 0.00

Gemma 3N E2B benchmarks

Here are a few benchmarks of Gemma 3N E2B (Q4_0) on a Snapdragon 730G:

Specs: 32 bit LPDDR4X-3733 (14.9 GB/s), 2xA76 (2208MHz, downclocks to 2169MHz), 6xA55 (1804MHz)

All benchmarks are done using llama.cpp build: 5891 (0d922676) with mmap disabled.

Compilation options: -DGGML_NATIVE=off -DGGML_OPENMP=off -DGGML_CPU_ARM_ARCH=armv8.2-a+fp16+dotprod

1st run: One A55 core vs. one A76 core

One A55 core

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 pp512 3.21 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 tg128 1.05 ± 0.00

One A76 core

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 pp512 13.65 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 1 0 tg128 5.80 ± 0.00

2nd run: Two A55 cores vs. two A76 cores

Two A55 cores

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 pp512 6.46 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 tg128 2.13 ± 0.00

Two A76 cores (best configuration for TG, 2-3t/s more in real world usage compared to all cores)

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 pp512 23.06 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 2 0 tg128 6.81 ± 0.00

3rd run: 6 A55 cores vs all cores

6 A55 cores

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 6 0 pp512 18.18 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 6 0 tg128 4.41 ± 0.00

All cores (best configuration for PP, but 2xA76 has negligible difference)

model size params backend threads mmap test t/s
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 8 0 pp512 27.51 ± 0.00
gemma3n E2B Q4_0 3.34 GiB 4.46 B CPU 8 0 tg128 5.26 ± 0.00

Run Codeforces 1952J codes 20000x faster

I have created a Codeforces 1952J language to C++ transpiler, the resulting compiled code is 20000x* faster compared to the reference implementation.

Code
#include <bits/stdc++.h>
using namespace std;
#define int long long
#ifndef yoshi_likes_e4
#define endl '\n'
#endif
#define problem ""
#define multitest 0
#define debug(x) cerr << #x << " = " << x << endl;
void init()
{
}
map<string, bool> var_type;
void Add_variable(string k)
{
    bool ok = 0;
    try
    {
        std::stoi(k);
        ok = true;
    }
    catch (const std::invalid_argument &e)
    {
    }
    catch (const std::out_of_range &e)
    {
    }
    if (!ok)
    {
        int pos = k.find('[');
        if (pos == string::npos)
            var_type[k] = 0;
        else
            var_type[string(k.begin(), k.begin() + pos)] = 1;
    }
}
void Yoshi()
{
    vector<vector<string>> code;
    string s;
    while (getline(cin, s))
    {
        stringstream t(s);
        code.push_back({});
        string x;
        while (t >> x)
            code.back().push_back(x);
    }
    map<int, int> label_id;
    int lid = 0;
    for (auto &lines : code)
    {
        if (lines[0] == "simp")
        {
            int v = stoi(lines[2]) - 1;
            if (label_id.find(v) == label_id.end())
                label_id[v] = lid++;
        }
        if (lines[0] == "vibe")
            Add_variable(lines[2]), Add_variable(lines[4]);
        if (lines[0] == "bruh")
            Add_variable(lines[1]), Add_variable(lines[5]);
        if (lines[0] == "*slaps")
            Add_variable(lines[1]), Add_variable(lines[5].substr(0, lines[5].size() - 1));
        if (lines[0] == "rip")
            Add_variable(lines[2]), Add_variable(lines[6]);
        if (lines[0] == "yoink")
            Add_variable(lines[1]);
        if (lines[0] == "yeet")
            Add_variable(lines[1]);
    }
    vector<string> var0, var1;
    for (auto &[u, v] : var_type)
        if (v)
            var1.push_back(u);
        else
            var0.push_back(u);
    cout << R""""(#include <bits/stdc++.h>
using namespace std;
void input(int &x)
{
    string s;
    getline(cin, s);
    x = stoi(s);
}
void input(vector<int> &x)
{
    string s;
    getline(cin, s);
    stringstream t(s);
    while (t >> s)
        x.push_back(stoi(s));
})"""";
    cout << "\nint main(){\ncin.tie(0)->sync_with_stdio(0);\n";
    if (var0.size())
    {
        cout << "int ";
        for (auto &i : var0)
            cout << i << (&i != &var0.back() ? ", " : ";\n");
    }
    if (var1.size())
    {
        cout << "vector<int> ";
        for (auto &i : var1)
            cout << i << (&i != &var1.back() ? ", " : ";\n");
    }
    for (auto &lines : code)
    {
        if (label_id.find(&lines - &code[0]) != label_id.end())
            cout << "L" << label_id[&lines - &code[0]] << ":\n";
        if (lines[0] == "simp")
            cout << "goto L" << label_id[stoi(lines[2]) - 1] << ";\n";
        if (lines[0] == "vibe")
            cout << "if (" << lines[2] << " > " << lines[4] << ')' << "\n";
        if (lines[0] == "bruh")
            cout << lines[1] << " = " << lines[5] << ";\n";
        if (lines[0] == "*slaps")
            cout << lines[5].substr(0, lines[5].size() - 1) << " += " << lines[1] << ";\n";
        if (lines[0] == "rip")
            cout << lines[2] << " -= " << lines[6] << ";\n";
        if (lines[0] == "yoink")
            cout << "input(" << lines[1] << ");\n";
        if (lines[0] == "yeet")
            cout << "cout << " << lines[1] << " << \"\\n\";\n";
        if (lines[0] == "go")
            cout << "return 0;\n";
    }
    cout << "}" << endl;
}
signed main()
{
#ifndef yoshi_likes_e4
    ios::sync_with_stdio(0);
    cin.tie(0);
    if (fopen(problem ".inp", "r"))
    {
        freopen(problem ".inp", "r", stdin);
        freopen(problem ".out", "w", stdout);
    }
#endif
    init();
    int t = 1;
#if multitest
    cin >> t;
#endif
    while (t--)
        Yoshi();
}

*: Selection sort performance:

\(n=3000\):

Implementation Runtime
Reference 41.0s
Compiled 2ms

\(n=5000\):

Implementation Runtime
Reference 123.3s
Compiled 6ms

SymPy

I have created a SymPy package that runs on ARM64, based on Alpine (note: requires Termux,proot/chroot OR any rooted computer that is ARM64 (e.g. RPi,etc.)) just for fun (to see how small it can be).

Download link

  • Usage:
    • Extract this file to your home directory (e.g by tar -xzf ../sympy.cmax.tgz /home/...).
    • Run chroot /home/.../alpine /bin/bash -l
    • You should see a Python shell with SymPy loaded. (Note: exiting the shell will stop the chroot. To stop this (e.g. for customization,etc..) remove the exit 0 line in the .../alpine/etc/profile file.)