Yoshi's Island

June 6, 2026
in Fun
2 min read

Taking a look at Gemma 4 E2B (QAT)

A continuation of the 3N E2B benchmarks, now with faster hardware and models.

LiteRT-LM model: https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/blob/main/gemma-4-E2B-it.litertlm

llama.cpp Q4_0 model: https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B_q4_0-it.gguf

We'll denote the hardware as follows:

ID	Info
CPU 1	Dimensity 9000+ (A710 = 1.8 GHz, X2 = 2.55 GHz)
CPU 1.1	Dimensity 9000+ (max freq)
CPU 2	i7-10750H (2.6 GHz)
CPU 2.1	i7-10750H (max freq, ~3.4 GHz)
GPU 1	Mali-G710 MP10 (max freq)
GPU 2	GTX 1660 Ti (max freq)
GPU 2.1	GTX 1660 Ti (1.455 GHz)

TPS is taken as an average of 5 runs.

Configuration:

LiteRT-LM: enable-speculative-decoding = false

llama.cpp: ub = 1024, threads=n_cores/2

For CPU 1, llama.cpp is configured as follows: -DGGML_NATIVE=off -DGGML_CPU_ARM_ARCH=native+dotprod+i8mm+nosve

For GPU 2, llama.cpp is https://github.com/pt13762104/llama.cpp/commit/c5914bbd918022518bb5c0645bcdb24e3bc404f2 with GGML_CUDA_NO_TURING_MMA enabled.

Benchmarks:

LiteRT-LM

Benchmark	CPU 1	CPU 1.1	CPU 2	CPU 2.1
pp1024	151.98	184.57	153.96	211.37
tg256 @ d1024	15.13	16.07	17.23	20.45

Comment: Power efficiency of the CPU is poor. TG is inherently memory bottlenecked. CPU 2 demonstrates much better scaling.

Benchmark	GPU 1	GPU 2	GPU 2.1
pp1024	874.90	4877.41	4411.52
tg256 @ d1024	19.16	98.96	79.00

Comment: Clearly the GTX 1660 Ti is ALU bottlenecked in the decode phase. The Mali-G710 MP10 was really fast compared to the CPUs.

llama.cpp Q4_0

Benchmark	CPU 1	CPU 1.1	CPU 2	CPU 2.1
pp1024	80.67	103.53	71.26	84.24
tg256	15.10	17.87	17.82	20.48
pp1024 @ d4096	42.99	53.76	54.26	64.25
tg256 @ d4096	12.68	14.01	14.88	17.29

Comment: Slightly better scaling than LiteRT-LM for CPU 1, but objectively worse performance.

Benchmark	GPU 2	GPU 2.1
pp1024	2165.55	1860.03
tg256	118.57	102.95
pp1024 @ d4096	1496.00	1311.85
tg256 @ d4096	112.58	99.38

Comment: Better decode scaling, but falls short of LiteRT-LM in prompt processing.

Conclusion: LiteRT-LM is the first viable option for phone inference. If you want to burn your hand while enjoying (much) slower inference, llama.cpp (on the phone) is probably for you.

It should be noted that the LiteRT-LM model is optimized for phone inference*, and should not be considered a replacement to the llama.cpp Q4_0 model.

*Note that this is a 2-bit and not a 4-bit model. Accuracy might be evaluated on another time, e.g. on a RTX 3080 when I have time to do so.

February 14, 2026
in Fun
1 min read

An useless operating system

It was at the time when I saw these cheap retro gaming consoles with only 4GB of storage and 512MB of RAM. I've thought: Could I make a complete retro gaming distribution in that same space?

The answer: Not really. I could squeeze things in 4GB of space, but not 512MB of RAM. Not even close.

But anyways, here's the download link: https://drive.google.com/file/d/1-LzuryJ2MBLoBvcXmFuBgm9OQ3V_TUy0/view?usp=sharing. Use it at your own risk. (You can also drop it directly to a Ventoy USB.)

There's no sound (I'm too lazy to install it). Wireless drivers and other things took up too much space, so it's also excluded.

This is a simple Debian installation with ES-DE and RetroArch (cores less than 20MB in size were kept).

When you start the OS, an ES-DE instance will start at VT8. Stopping ES-DE makes it restart. You can login as emustation (password is the same.)

It's too resource hogging, and Batocera exists, so it's probably useless. But it's at least fun to try at the end :)

February 8, 2026
in Fun
1 min read

Does the Dimensity 9000 and 10750H hold well in the benchmarks? (again)

I've taken a look at https://www.phoronix.com/review/16-armlinux-sep2018/, and decided to test the Dimensity 9000 on these benchmarks.

The result: https://openbenchmarking.org/result/2602284-YOSH-260227012. Most of the benchmarks the Dimensity wins by a landslide excluding pgbench (it got a lead but the Socionext Developerbox is brute forcing it), or Perl Interpreter (I blame proot for this).

The X2-core was completely destroying anything else in 2018 (obviously), and the total run-time is so fast nothing even comes close (thanks to the X2-core again.). On the desktop leaderboard, the 9000 Plus pales in comparison. I have not tested that out but it should rank at the bottom.

I've also tested a few benchmarks out of my 10750H and it got about 4960X-5960X performance: https://openbenchmarking.org/result/2602287-YOSH-YOSHI9552.

A summary: https://docs.google.com/spreadsheets/d/1MC92otAyJLy6xrpeCMe5lM960kpfgVo6Gvjeg3wx6aE/edit?usp=sharing.

September 27, 2025
in Fun
3 min read

Qwen3 0.6B benchmarks

Here are a few benchmarks of Qwen3 0.6B (Q4_0) on a Dimensity 9000+:

Specs: 64 bit LPDDR5X-7500 (60.0 GB/s), 1xX2 (3350MHz), 3xA710 (3200 MHz), 4xA510 (1800MHz)

All benchmarks are done using llama.cpp build: 6602 (72b24d96) with clang version 20.1.8 (Fedora 20.1.8-4.fc42) for aarch64-redhat-linux-gnu with ubatch = 64. Tests on A510 are done with mmap enabled.

Compilation options: -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DGGML_OPENMP=off

1st run: One A510 core vs. one A710 core vs. one X2 core

One A510 core

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	1	64	pp512	14.83 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	1	64	tg128	4.34 ± 0.00

One A710 core

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	1	64	0	pp512	96.77 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	1	64	0	tg128	27.20 ± 0.00

One X2 core

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	1	64	0	pp512	143.94 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	1	64	0	tg128	39.32 ± 0.00

2nd run: Two A510 cores vs. two A710 cores vs. A710+X2

Two A510 cores

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	2	64	pp512	25.97 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	2	64	tg128	6.92 ± 0.00

Two A710 cores

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	2	64	0	pp512	184.00 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	2	64	0	tg128	48.63 ± 0.00

A710+X2

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	2	64	0	pp512	196.54 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	2	64	0	tg128	52.45 ± 0.00

3rd run: 3 A510 cores vs. 3 A710 cores vs. 2xA710+X2

3 A510 cores

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	3	64	pp512	39.05 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	3	64	tg128	10.40 ± 0.00

3 A710 cores

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	3	64	0	pp512	267.38 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	3	64	0	tg128	64.33 ± 0.00

2xA710+X2

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	3	64	0	pp512	284.89 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	3	64	0	tg128	65.91 ± 0.00

4th run: 4 A510 cores vs. 3xA710+X2

4 A510 cores

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	4	64	pp512	43.76 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	4	64	tg128	10.51 ± 0.00

3xA710+X2

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	4	64	0	pp512	359.16 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	4	64	0	tg128	74.01 ± 0.00

5th run: All cores

model	size	params	backend	threads	n_ubatch	mmap	test	t/s
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	8	64	0	pp512	86.80 ± 0.00
qwen3 0.6B Q4_0	358.78 MiB	596.05 M	CPU	8	64	0	tg128	22.08 ± 0.00

July 14, 2025
in Fun
2 min read

Gemma 3N E2B benchmarks

Here are a few benchmarks of Gemma 3N E2B (Q4_0) on a Snapdragon 730G:

Specs: 32 bit LPDDR4X-3733 (14.9 GB/s), 2xA76 (2208MHz, downclocks to 2169MHz), 6xA55 (1804MHz)

All benchmarks are done using llama.cpp build: 5891 (0d922676) with mmap disabled.

Compilation options: -DGGML_NATIVE=off -DGGML_OPENMP=off -DGGML_CPU_ARM_ARCH=armv8.2-a+fp16+dotprod

1st run: One A55 core vs. one A76 core

One A55 core

model	size	params	backend	threads	mmap	test	t/s
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	1	0	pp512	3.21 ± 0.00
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	1	0	tg128	1.05 ± 0.00

One A76 core

model	size	params	backend	threads	mmap	test	t/s
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	1	0	pp512	13.65 ± 0.00
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	1	0	tg128	5.80 ± 0.00

2nd run: Two A55 cores vs. two A76 cores

Two A55 cores

model	size	params	backend	threads	mmap	test	t/s
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	2	0	pp512	6.46 ± 0.00
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	2	0	tg128	2.13 ± 0.00

Two A76 cores (best configuration for TG, 2-3t/s more in real world usage compared to all cores)

model	size	params	backend	threads	mmap	test	t/s
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	2	0	pp512	23.06 ± 0.00
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	2	0	tg128	6.81 ± 0.00

3rd run: 6 A55 cores vs all cores

6 A55 cores

model	size	params	backend	threads	mmap	test	t/s
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	6	0	pp512	18.18 ± 0.00
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	6	0	tg128	4.41 ± 0.00

All cores (best configuration for PP, but 2xA76 has negligible difference)

model	size	params	backend	threads	mmap	test	t/s
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	8	0	pp512	27.51 ± 0.00
gemma3n E2B Q4_0	3.34 GiB	4.46 B	CPU	8	0	tg128	5.26 ± 0.00

May 14, 2025
in LLM
2 min read

Qwen3 FAQ

There are other Qwen3 FAQs, with this one made me do my FAQ. I just put random commonly asked questions here for fun.

1. HOW TO DISABLE THINKING 🔥🔥🔥🔥

Use /no_think in your prompts. Or just add it to system prompt.

2. What are the system requirements?

In progress

3. What can I run?

FOR MOE MODELS, REMEMBER TO USE TENSOR OVERRIDES!!!

Any units of information here refers to VRAM size if not explicitly stated.

This list is from best to worst.

If you have a GPU server with \(\geq 150\) GB of free VRAM: go for the 235B model! You won't be disappointed.

Else, if you have CPU platforms with fast RAM (\(\geq 200\) GB/s and \(\geq 96\) GB), you can try running the 235B model. NOTE: You should add GPUs (\(24\) GB will work fine) then offload to them as much as possible (this will boost your speed by a huge margin).

Else, If you have a GPU gaming rig or server with \(\leq 32\) GB of VRAM, or is RAM-limited (\(\leq 96\) GB), the 32B model is best for you. It performs well, maybe a slight bit less than the 235B.

If you want fast inference on big GPU rigs, you can also try the 32B model.

For platforms with \(12-16\) GB of VRAM, running 14B or 30BA3B is advised. You can get very high performance with 30BA3B using an ordinary computer with 16GB VRAM.

30BA3B have relatively high performance even on DDR4 RAM and VRAM-limited machines. It can reach 20t/s on an ordinary machine with a 8GB GPU and DDR5 RAM. Think of it as like a "flash" model.

30BA3B doesn't require expensive GPUs, anything \(\geq 8\) GB and relatively recent is already an usable experience.

8b is also a viable option for 8GB platforms. You can choose between 30BA3B (slower) or 8b (faster, but somewhat less intelligent).

4B is a viable option for 4-6 GB platforms and ordinary DDR4/5 CPU inference. You can use this model on 6GB GPUs while keeping a bit of memory for the system.

1.7B is suitable for fast, ordinary CPU inference. You can even do it on high end phones!!!

0.6B is suitable for anything except single-channel RAM DDR \(\leq 4\) devices.

4. What are tensor overrides?

Think of it as a way to force some tensors, not layers onto the GPU. You would to like to load everything (except the FFN*) on to the GPU. Then, load the FFN accordingly if you have enough VRAM.

*: The FFN tensors are really big, but they can be processed efficiently with MoE models. Other parts (such as shared experts, etc...) needs to be put on the GPU because they are activated every time.

May 1, 2025
in Fun
3 min read

Run Codeforces 1952J codes 20000x faster

I have created a Codeforces 1952J language to C++ transpiler, the resulting compiled code is 20000x* faster compared to the reference implementation.

Code

#include <bits/stdc++.h>
using namespace std;
#define int long long
#ifndef yoshi_likes_e4
#define endl '\n'
#endif
#define problem ""
#define multitest 0
#define debug(x) cerr << #x << " = " << x << endl;
void init()
{
}
map<string, bool> var_type;
void Add_variable(string k)
{
    bool ok = 0;
    try
    {
        std::stoi(k);
        ok = true;
    }
    catch (const std::invalid_argument &e)
    {
    }
    catch (const std::out_of_range &e)
    {
    }
    if (!ok)
    {
        int pos = k.find('[');
        if (pos == string::npos)
            var_type[k] = 0;
        else
            var_type[string(k.begin(), k.begin() + pos)] = 1;
    }
}
void Yoshi()
{
    vector<vector<string>> code;
    string s;
    while (getline(cin, s))
    {
        stringstream t(s);
        code.push_back({});
        string x;
        while (t >> x)
            code.back().push_back(x);
    }
    map<int, int> label_id;
    int lid = 0;
    for (auto &lines : code)
    {
        if (lines[0] == "simp")
        {
            int v = stoi(lines[2]) - 1;
            if (label_id.find(v) == label_id.end())
                label_id[v] = lid++;
        }
        if (lines[0] == "vibe")
            Add_variable(lines[2]), Add_variable(lines[4]);
        if (lines[0] == "bruh")
            Add_variable(lines[1]), Add_variable(lines[5]);
        if (lines[0] == "*slaps")
            Add_variable(lines[1]), Add_variable(lines[5].substr(0, lines[5].size() - 1));
        if (lines[0] == "rip")
            Add_variable(lines[2]), Add_variable(lines[6]);
        if (lines[0] == "yoink")
            Add_variable(lines[1]);
        if (lines[0] == "yeet")
            Add_variable(lines[1]);
    }
    vector<string> var0, var1;
    for (auto &[u, v] : var_type)
        if (v)
            var1.push_back(u);
        else
            var0.push_back(u);
    cout << R""""(#include <bits/stdc++.h>
using namespace std;
void input(int &x)
{
    string s;
    getline(cin, s);
    x = stoi(s);
}
void input(vector<int> &x)
{
    string s;
    getline(cin, s);
    stringstream t(s);
    while (t >> s)
        x.push_back(stoi(s));
})"""";
    cout << "\nint main(){\ncin.tie(0)->sync_with_stdio(0);\n";
    if (var0.size())
    {
        cout << "int ";
        for (auto &i : var0)
            cout << i << (&i != &var0.back() ? ", " : ";\n");
    }
    if (var1.size())
    {
        cout << "vector<int> ";
        for (auto &i : var1)
            cout << i << (&i != &var1.back() ? ", " : ";\n");
    }
    for (auto &lines : code)
    {
        if (label_id.find(&lines - &code[0]) != label_id.end())
            cout << "L" << label_id[&lines - &code[0]] << ":\n";
        if (lines[0] == "simp")
            cout << "goto L" << label_id[stoi(lines[2]) - 1] << ";\n";
        if (lines[0] == "vibe")
            cout << "if (" << lines[2] << " > " << lines[4] << ')' << "\n";
        if (lines[0] == "bruh")
            cout << lines[1] << " = " << lines[5] << ";\n";
        if (lines[0] == "*slaps")
            cout << lines[5].substr(0, lines[5].size() - 1) << " += " << lines[1] << ";\n";
        if (lines[0] == "rip")
            cout << lines[2] << " -= " << lines[6] << ";\n";
        if (lines[0] == "yoink")
            cout << "input(" << lines[1] << ");\n";
        if (lines[0] == "yeet")
            cout << "cout << " << lines[1] << " << \"\\n\";\n";
        if (lines[0] == "go")
            cout << "return 0;\n";
    }
    cout << "}" << endl;
}
signed main()
{
#ifndef yoshi_likes_e4
    ios::sync_with_stdio(0);
    cin.tie(0);
    if (fopen(problem ".inp", "r"))
    {
        freopen(problem ".inp", "r", stdin);
        freopen(problem ".out", "w", stdout);
    }
#endif
    init();
    int t = 1;
#if multitest
    cin >> t;
#endif
    while (t--)
        Yoshi();
}

*: Selection sort performance:

\(n=3000\):

Implementation	Runtime
Reference	41.0s
Compiled	2ms

\(n=5000\):

Implementation	Runtime
Reference	123.3s
Compiled	6ms

April 30, 2025
in CP
1 min read

Shortest path between 2 special nodes

This problem is one of our practice problems, and I have been struggling a bit to solve it. There's a "solution" on GFG that's basically useless.

The solution is plainly \(O(VE)\) and basically have no relations to the actual solution.

I'll explain 2 real solutions:

Solution 1. BFS with multiple sources

Do normal BFS but you push all the sources to the queue. Create a \(source_v\) array which will be the optimal source for the \(v\)-th node. Then, for each edge \((u,v)\), let \(Result=min(Result,d[u]+d[v])\) if \(source[u] \neq source[v]\).

Complexity: \(O(V+E)\).

Solution 2. SPFA with multiple sources

You do normal SPFA relaxation (start from node 1, then add node 2 as a source, etc...) but update the \(Result\) variable when there's a node that were NOT a source and is special. The relaxation is only performed if \(d[u]+1<Result\).

Complexity: \(O(VE)\) (I don't know if it can be hacked, it runs relatively fast (~2x slower than BFS)).

August 24, 2024
in CP
4 min read

CHUNGKHOAN

Recently I revisited this problem which was intended to solve in \(O(n)\) (the problem asks to find the largest subarray which has \(max-min<=T\)). But I can only think of a \(O(n*log(n))\) solution, so I coded the fastest \(O(n*log(n))\) solution yet using vectorized binary search and linear sparse table:

Code

#include <algorithm>
#include <functional>
#include <iostream>
using namespace std;
typedef int vec __attribute__((vector_size(32)));
int lsb(int x)
{
    return x & -x;
}
int smin[4000005], arr[4000005], smax[4000005], mmin[4000005], mmax[4000005];
int b = 16;
int n, t;
int smmin(int r, int size = b)
{
    int dist_from_r = __lg(mmin[r] & ((1 << size) - 1));
    return r - dist_from_r;
}
int smmax(int r, int size = b)
{
    int dist_from_r = __lg(mmax[r] & ((1 << size) - 1));
    return r - dist_from_r;
}
int pmax(int x, int y)
{
    return arr[x] > arr[y] ? x : y;
}
int pmin(int x, int y)
{
    return arr[x] < arr[y] ? x : y;
}
// RMQ constant time insanity that I found online
void build()
{
    int curr_mask = 0;
    for (int i = 0; i < n; i++)
    {
        curr_mask = (curr_mask << 1) & ((1LL << b) - 1);
        while (curr_mask > 0 && pmin(i, i - __lg(lsb(curr_mask))) == i)
        {
            curr_mask ^= lsb(curr_mask);
        }
        curr_mask |= 1;
        mmin[i] = curr_mask;
    }
    curr_mask = 0;
    for (int i = 0; i < n; i++)
    {
        curr_mask = (curr_mask << 1) & ((1LL << b) - 1);
        while (curr_mask > 0 && pmax(i, i - __lg(lsb(curr_mask))) == i)
        {
            curr_mask ^= lsb(curr_mask);
        }
        curr_mask |= 1;
        mmax[i] = curr_mask;
    }
    for (int i = 0; i < (n >> 4); i++)
    {
        smin[i] = smmin(b * i + b - 1);
        smax[i] = smmax(b * i + b - 1);
    }
    for (int j = 1; (1 << j) <= (n >> 4); j++)
        for (int i = 0; i + (1 << j) <= (n >> 4); i++)
        {
            smin[(n >> 4) * j + i] = pmin(smin[(n >> 4) * (j - 1) + i], smin[(n >> 4) * (j - 1) + i + (1 << (j - 1))]);
            smax[(n >> 4) * j + i] = pmax(smax[(n >> 4) * (j - 1) + i], smax[(n >> 4) * (j - 1) + i + (1 << (j - 1))]);
        }
}
vec query8_0(vec &s, vec &t)
{
    vec res;
    for (int i = 0; i < 8; i++)
    {
        int l = s[i], r = t[i];
        if (r - l + 1 <= b)
        {
            res[i] = arr[smmax(r, r - l + 1)] - arr[smmin(r, r - l + 1)];
            continue;
        }
        int ansmax = pmax(smmax(l + b - 1), smmax(r)), ansmin = pmin(smmin(l + b - 1), smmin(r));
        int x = (l >> 4) + 1, y = (r >> 4) - 1;
        if (x <= y)
        {
            int j = __lg(y - x + 1);
            ansmax = pmax(ansmax, pmax(smax[(n >> 4) * j + x], smax[(n >> 4) * j + y - (1 << j) + 1]));
            ansmin = pmin(ansmin, pmin(smin[(n >> 4) * j + x], smin[(n >> 4) * j + y - (1 << j) + 1]));
        }
        res[i] = arr[ansmax] - arr[ansmin];
    }
    return res;
}
bool cmp_l(vec &a, vec &b)
{
    vec c = a <= b;
    return c[0] + c[1] + c[2] + c[3] + c[4] + c[5] + c[6] + c[7];
}
int maxvec(vec a)
{
    return max(max(max(a[0], a[1]), max(a[2], a[3])), max(max(a[4], a[5]), max(a[6], a[7])));
}
vec nm;
vec bs_min(vec &l)
{
    // SIMD binary search
    vec s = l, r = nm;
    while (cmp_l(l, r))
    {
        vec mid = (l + r) >> 1, q4 = query8_0(s, mid);
        l = l - (mid - l + 1) * (q4 <= t);
        r = r - (mid - r - 1) * (q4 > t);
    }
    return r - s + 1;
}
const int BUF_SZ = 33550336;
inline namespace Input
{
char buf[BUF_SZ];
int pos;
int len;
char next_char()
{
    if (pos == len)
    {
        pos = 0;
        len = (int)fread(buf, 1, BUF_SZ, stdin);
        if (!len)
        {
            return EOF;
        }
    }
    return buf[pos++];
}
int read_int()
{
    int x;
    char ch;
    int sgn = 1;
    while (!isdigit(ch = next_char()))
    {
        if (ch == '-')
        {
            sgn *= -1;
        }
    }
    x = ch - '0';
    while (isdigit(ch = next_char()))
    {
        x = x * 10 + (ch - '0');
    }
    return x * sgn;
}
}; // namespace Input
inline namespace Output
{
char buf[BUF_SZ];
int pos;
void flush_out()
{
    fwrite(buf, 1, pos, stdout);
    pos = 0;
}
void write_char(char c)
{
    if (pos == BUF_SZ)
    {
        flush_out();
    }
    buf[pos++] = c;
}
void write_int(int x)
{
    static char num_buf[100];
    if (x < 0)
    {
        write_char('-');
        x *= -1;
    }
    int len = 0;
    for (; x >= 10; x /= 10)
    {
        num_buf[len++] = (char)('0' + (x % 10));
    }
    write_char((char)('0' + x));
    while (len)
    {
        write_char(num_buf[--len]);
    }
    write_char('\n');
} // auto-flush output when program exits
#include <cassert>
void init_output()
{
    assert(atexit(flush_out) == 0);
}
}; // namespace Output
signed main()
{
    init_output();
    t = read_int();
    n = read_int();
    for (int i = 0; i < n; i++)
        arr[i] = read_int();
    build();
    int res = 0;
    nm = vec{n - 1, n - 1, n - 1, n - 1, n - 1, n - 1, n - 1, n - 1};
    vec shift{0, 1, 2, 3, 4, 5, 6, 7};
    for (int i = 0; i < n / 8; i++)
    {
        vec vi = shift + 8 * i;
        res = max(res, maxvec(bs_min(vi)));
        // Fast cut
        if (n - res < 8 * (i + 1))
            break;
    }
    if (n % 8)
    {
        int n4 = (n >> 3) << 3;
        vec c{n4, n4, n4, n4, n4, n4, n4, n4};
        for (int i = 0; i < n % 8; i++)
            c[i]++;
        res = max(res, maxvec(bs_min(c)));
    }
    write_int(res);
}

June 17, 2024
in CP
7 min read

Hàng cây

1. Subtask 1

Ta lặp qua từng phần tử trong mảng, chọn ra những phần tử thỏa mãn rồi tính giá trị. Đpt: \(O(nq)\).

2. Subtask 2

Ta sẽ xử lý trước các truy vấn \([l,r]\) với l,r thuộc mảng. Ta đặt các mảng next, prev, p để xử lý thao tác xóa phần tử (tăng \(l\) lên -> xóa dần phần tử đi). Ta lặp r, sau đó chèn các phần tử nằm trong khoảng \([0,r]\) vào mảng p. Mảng next sẽ lưu \(i+1\) với phần tử thứ \(i\), và mảng prev sẽ lưu phần tử thứ \(i-1\). Khi chúng ta muốn xóa phần tử thứ \(i\) trong mảng p thì ta sẽ cập nhật next và prev để bỏ qua phần tử đấy và cập nhật chênh lệch. Khi chúng ta xử lý được hết \([l,r]\) trong mảng thì chỉ cần cnp để tìm đáp án.

Code

void update(int *p, int *prev, int *next, int i, long long &sum, int size)
{
    if (next[i] != size && prev[i] != -1)
    {
        sum += abs(p[next[i]] - p[prev[i]]);
    }
    if (next[i] != size)
    {
        sum -= abs(p[next[i]] - p[i]);
        prev[next[i]] = prev[i];
    }
    if (prev[i] != -1)
    {
        sum -= abs(p[prev[i]] - p[i]);
        next[prev[i]] = next[i];
    }
}
void sub2(vector<int> &p, vector<query *> &queries)
{
    int distinct = 0;
    bool cnt[401];
    fill(cnt, cnt + 401, 0);
    for (int i = 0; i < n; i++)
    {
        if (!cnt[p[i]])
        {
            cnt[p[i]] = 1;
            distinct++;
        }
    }
    long long mp[401][401];
    vector<int> p2 = p;
    sort(p.begin(), p.end());
    unique(p.begin(), p.end());
    while (p.size() != distinct)
        p.pop_back();
    int p3[n], prev[n], next[n];
    for (int i = 0; i < p.size(); i++)
    {
        vector<int> idx[401];
        int counter = 0;
        long long sum = 0;
        for (int j = 0; j < n; j++)
        {
            if (p2[j] <= p[i])
            {
                p3[counter] = p2[j];
                prev[counter] = counter - 1;
                next[counter] = counter + 1;
                idx[p2[j]].push_back(counter);
                if (counter >= 1)
                    sum += abs(p3[counter - 1] - p2[j]);
                counter++;
            }
        }
        mp[p[0]][p[i]] = sum;
        for (int j = 1; j <= i; j++)
        {
            for (auto &k : idx[p[j - 1]])
            {
                update(p3, prev, next, k, sum, counter);
            }
            mp[p[j]][p[i]] = sum;
        }
    }
    for (int i = 0; i < q; i++)
    {
        int l = queries[i]->l, r = queries[i]->r;
        int ll = lower_bound(p.begin(), p.end(), l) - p.begin();
        int rr = upper_bound(p.begin(), p.end(), r) - p.begin() - 1;
        if (ll > rr)
        {
            cout << 0 << '\n';
            continue;
        }
        cout << mp[p[ll]][p[rr]] << '\n';
    }
}

Đpt: \(O((n+h)*h+q\log(h))\).

3. Subtask 3

Với subtask 3, ta chỉ cần phải chèn và xóa các phần tử 1 cách trâu. Do giới hạn của subtask này khiến cho chỉ có \(O(n)\) phần tử bị chèn thêm / xóa đi. Ta vẫn sử dụng chặt nhị phân để tìm kết quả (chuyển \([l,r]\) thành \([l',r']\) gần nhất mà \(l',r'\) thuộc mảng).

Code

void add(map<int, int> &s, int i, int x, long long &sum)
{
    auto it = s.lower_bound(i), it2 = it;
    if (it != s.begin())
        it2--;
    if (it != s.end() && it != s.begin())
    {
        sum -= abs((*it).second - (*it2).second);
    }
    if (it != s.end())
    {
        sum += abs((*it).second - x);
    }
    if (it != s.begin())
    {
        sum += abs((*it2).second - x);
    }
    s[i] = x;
}
void rmv(map<int, int> &s, int i, long long &sum)
{
    auto it = s.lower_bound(i), it2 = it, it3 = it;
    if (it != s.begin())
        it3--;
    it2++;
    if (it2 != s.end() && it != s.begin())
    {
        sum += abs((*it2).second - (*it3).second);
    }
    if (it2 != s.end())
    {
        sum -= abs((*it2).second - (*it).second);
    }
    if (it != s.begin())
    {
        sum -= abs((*it3).second - (*it).second);
    }
    s.erase(it);
}
void sub3(vector<int> &p, vector<query *> &queries)
{
    map<int, vector<int>> idx;
    for (auto &i : p)
        idx[i].push_back(&i - &p.front());
    long long sum = 0;
    map<int, int> s;
    sort(p.begin(), p.end());
    unique(p.begin(), p.end());
    while (p.size() != idx.size())
        p.pop_back();
    int oldl = queries.front()->l, oldr = queries.front()->r, oldll, oldrr;
    oldll = lower_bound(p.begin(), p.end(), oldl) - p.begin();
    oldrr = upper_bound(p.begin(), p.end(), oldr) - p.begin() - 1;
    for (int i = oldll; i <= oldrr; i++)
    {
        for (auto &k : idx[p[i]])
        {
            add(s, k, p[i], sum);
        }
    }
    cout << sum << '\n';
    for (int i = 1; i < q; i++)
    {
        int l = queries[i]->l, r = queries[i]->r;
        int ll = lower_bound(p.begin(), p.end(), l) - p.begin();
        int rr = upper_bound(p.begin(), p.end(), r) - p.begin() - 1;
        while (ll > oldll)
        {
            for (auto &k : idx[p[oldll]])
            {
                rmv(s, k, sum);
            }
            oldll++;
        }
        while (rr > oldrr)
        {
            oldrr++;
            for (auto &k : idx[p[oldrr]])
            {
                add(s, k, p[oldrr], sum);
            }
        }
        if (ll > rr)
        {
            cout << 0 << '\n';
            continue;
        }
        cout << sum << '\n';
    }
}

Đpt: \(O((n+q)\log(n))\).

4. Subtask 4

Ta cắt mảng ra thành các phần, mỗi phần có kích thước \(BLOCK\). Sử dụng thuật toán gần giống với sub2 trên mỗi phần tuy nhiên ta sẽ lưu vị trí của \(l',r'\) thay vì lưu \(l',r'\), ta sẽ lấy được kết quả cho mỗi phần. Ta sử dụng RMQ để tìm phần tử gần nhất với 2 đầu mút nằm trong khoảng \([l,r]\). (Vì việc gộp các phần lại cần phải có 2 phần tử đầu mút để tìm chênh lệch đúng). Ta sẽ sắp xếp lại các truy vấn theo \(l\) rồi theo \(r\) để tính vị trí của \(l',r'\) nhanh hơn (mất \(O(q*n/BLOCK)\) thay vì \(O(q*n/BLOCK*\log(BLOCK))\)). Việc xử lý như subtask 2 sẽ mất \(O(BLOCK*n)\).

Code

struct result
{
    long long query;
    int first, last;
};
struct block
{
    // blk^2
    long long mp[BLOCK][BLOCK];
    // blk*log(blk)
    int rmq[lgblk][BLOCK], rmq2[lgblk][BLOCK];
    // blk
    vector<int> p, p2;
    result query(int l, int r, int ll, int rr)
    {
        if (ll > rr)
            return {0, -1, -1};
        int lg = _lg[rr - ll + 1];
        return {mp[ll][rr], p2[min(rmq[lg][ll], rmq[lg][rr - (1 << lg) + 1])],
                p2[max(rmq2[lg][ll], rmq2[lg][rr - (1 << lg) + 1])]};
    }
};
void process(vector<int> p, block &blk)
{
    map<int, vector<int>> idx;
    for (int i = 0; i < BLOCK; i++)
    {
        idx[p[i]].push_back(i);
    }
    blk.p2 = p;
    sort(p.begin(), p.end());
    unique(p.begin(), p.end());
    vector<int> orders(BLOCK);
    while (p.size() != idx.size())
        p.pop_back();
    for (int i = 0; i < BLOCK; i++)
        orders[i] = lower_bound(p.begin(), p.end(), blk.p2[i]) - p.begin();
    blk.p = p;
    for (int i = 0; i < p.size(); i++)
    {
        blk.rmq[0][i] = idx[p[i]].front();
        blk.rmq2[0][i] = idx[p[i]].back();
    }
    for (int i = 1; i <= lgblk; i++)
    {
        for (int j = 0; j + (1 << i) <= p.size(); j++)
        {
            blk.rmq[i][j] =
                min(blk.rmq[i - 1][j], blk.rmq[i - 1][j + (1 << (i - 1))]);
            blk.rmq2[i][j] =
                max(blk.rmq2[i - 1][j], blk.rmq2[i - 1][j + (1 << (i - 1))]);
        }
    }
    int p3[n], prev[n], next[n];
    for (int j = 0; j < p.size(); j++)
    {

        vector<int> idx[BLOCK];
        int counter = 0;
        long long sum = 0;
        for (int i = 0; i < BLOCK; i++)
        {
            if (orders[i] <= j)
            {
                p3[counter] = blk.p2[i];
                prev[counter] = counter - 1;
                next[counter] = counter + 1;
                idx[orders[i]].push_back(counter);
                if (counter >= 1)
                    sum += abs(p3[counter - 1] - blk.p2[i]);
                counter++;
            }
        }
        blk.mp[0][j] = sum;
        for (int i = 1; i <= j; i++)
        {
            for (auto k : idx[i - 1])
            {
                update(p3, prev, next, k, sum, counter);
            }
            blk.mp[i][j] = sum;
        }
    }
}
void sub4(vector<int> &p, vector<query *> &queries)
{
    block blks[n / BLOCK];
    for (int i = 0; i < n / BLOCK; i++)
    {
        process(vector<int>(p.begin() + i * BLOCK, p.begin() + (i + 1) * BLOCK),
                blks[i]);
    }
    vector<int> ll(n / BLOCK), rr(n / BLOCK);
    sort(queries.begin(), queries.end(),
        [](query *a, query *b)
        { return a->l < b->l; });
    for (auto &i : queries)
    {
        for (int j = 0; j < n / BLOCK; j++)
        {
            int psz = blks[j].p.size();
            while (ll[j] < psz && blks[j].p[ll[j]] < i->l)
                ll[j]++;
        }
        i->ll = ll;
    }
    sort(queries.begin(), queries.end(),
        [](query *a, query *b)
        { return a->r < b->r; });
    vector<long long> qres(q);
    for (auto &i : queries)
    {
        long long res = 0;
        vector<int> acc;
        for (int j = 0; j < n / BLOCK; j++)
        {
            int psz = blks[j].p.size();
            while (rr[j] < psz && blks[j].p[rr[j]] <= i->r)
                rr[j]++;
            result x = blks[j].query(i->l, i->r, i->ll[j], rr[j] - 1);
            res += x.query;
            if (x.first != -1)
            {
                acc.push_back(x.first);
                acc.push_back(x.last);
            }
        }
        for (int i = 1; i + 1 < acc.size(); i += 2)
            res += abs(acc[i] - acc[i + 1]);
        qres[i->id] = res;
    }
    for (auto &i : qres)
        cout << i << '\n';
}

Chọn \(BLOCK=\sqrt{n}\) ta có đpt \(O((n+q)\sqrt{n})\).