I tried looking at all other compilers, the only that fit my size constraints was tcc... that would mean dropping std::string and only use ASCII char*... not a big deal for my game because my .ttf fonts dont have anything else but for other projects throwing UTF-8 out of the window might be a show stopper!?
Also considered Rust, but that is so bloated/slow! Can't stop anyone from using Rust for the game in the future though, because any .so/.dll will be able to hot-deploy.
There is nothing it does regarding support for non-ASCII characters over what you get from buf and len. And for UTF-8, you don't even need len, plain old strlen from the 1980s works fine on valid UTF-8, so plain char* from C works just as well.
And the union is just a performance optimization for small strings (is_big is probably not an additional field in good impls, but I separated it here), it's logically identical to just the buf and len.
The union has nothing to do with UTF-8, so let's ignore it. If you want more details about it, search for "c++ small string optimization", but the one-sentence version is that it's just a way to avoid a heap allocation for strings <= 16 bytes long (including any NUL terminator).
So ignoring that irrelevant optimization, std::string is basically:
struct std_string {
size_t len;
char* buf;
};
For storing valid UTF-8, the len is unnecessary, since a NUL byte is not valid UTF-8. You can still tell how many bytes of UTF-8 you have by using strlen, because when you find a 0 byte, it is never part of the string, it's always the NUL terminator. So the len is not strictly necessary for valid UTF-8 -- leaving us with just char*.
And not trying to dodge your question about UTF-8 logic, but my point was you can dodge that whole question, because std::string provides the same amount of UTF-8 logic as char* -- that is, none at all. If you've been getting by on std::string, then you can get by on char*. If you only need to support UTF-8 input and output, and you don't need to manipulate strings (replace characters, truncate them, normalize them for use as keys in a data structure, etc.) or only need to do simple substring searches for ASCII characters, then you can just use char* or std::string. UTF-8 has a great design, which was consciously chosen to make all of that possible.
Shipping cl.exe is illegal, I tried looking at clang but it was alot bigger than 50MB, and it was unclear how to untangle it from installers and other dependencies:
I'm shipping it as part of https://ossia.io (statically linked to the app) ; my installers are between 50 and 100mb for something that ships clang+llvm, boost, qt, ffmpeg and a lot of other things (in addition to its own code):
you just need to build llvm/clang statically and target_link_libraries(<couple stuff>) in cmake (and ship the headers if you want to do useful things, this actually takes much more space uncompressed but it'd be the same whatever the compiler)
it's only "mingw" because it uses the mingw headers. It uses the microsoft modern C runtime (ucrt) and runs just fine under normal cmd.exe shell with c:/windows/formatted/paths, and does not require e.g. MSYS.