Stop posting AI slop, especially slop pull requests like the one you made to OpenClaw. Learn the first thing about a project you want to monetize and make fake contributions to. For example, OpenClaw is overwhelmed with slop PRs and the author has talked about this a lot.
In theory the same way people are making those claims about "stolen" art, such as models that produced watermarks from Getty images or Shutterstock. Similar "watermarks" have existed in some LLM output.
Trying to set up a prompt injection attack for someone accessing a repo with a coding agent is juvenile and pointless. And it doesn't deal with the training part.
It's been around for almost 15 years and stable enough for several providers to roll it out in production the past 10 years (GCP and Azure in 2017).
AWS is just late to the game because they've rolled so much of their own stack instead of adapting open source solutions and contributing back to them.
> AWS is just late to the game because they've rolled so much of their own stack instead of adapting open source solutions and contributing back to them.
This is emphatically not true. Contributing to KVM and the kernel (which AWS does anyway) would not have accelerated the availability.
EC2 is not just a data center with commodity equipment. They have customer demands for security and performance that far exceed what one can build with a pile of OSS, to the extent that they build their own compute and networking hardware. They even have CPU and other hardware SKUs not available to the general public.
If my sources are correct, GCP did not launch on dedicated hardware like EC2 did, which raised customer concerns about isolation guarantees. (Not sure if that’s still the case.) And Azure didn’t have hardware-assisted I/O virtualization ("Azure Boost") until just a few years ago and it's not as mature as Nitro.
Even today, Azure doesn’t support nested virtualization the way one might ordinarily expect them to. It's only supported with Hyper-V on the guest, i.e., Windows.
> While nested virtualization is technically possible while using runners, it is not officially supported. Any use of nested VMs is experimental and done at your own risk, we offer no guarantees regarding stability, performance, or compatibility.
We operate a postgres service on Firecracker. You can create as many databases as you want, and we memory-snapshot them after 5 seconds of inactivity, and spin them up again in 50ms when a query arrives.
Not sure what you mean. It IS the same model, just a smaller version of it. And gpt-5.3-codex is a smaller version of gpt-5.3 trained more on code and agentic tasks.
Their naming has been pretty consistent since gpt-5. For example, gpt-5.1-codex-max > gpt-5.1-codex > gpt-5.1-codex-mini.
what do you mean by the same model, just smaller version? Codex should be finetune of the "normal" version, where did you get it's smaller? It's not that simple as to take some weights from the model and create a new model, normaly the mini or flash models are separately trained based on the data from the larger model.
Yea it's been butchering relatively easy to moderate tasks for me even with reasoning set to high. I am hoping it's just tuning that needs to be done since they've had to port it to a novel architecture.
If instead the model is performing worse due to how much they had to shrink it just so it will fit on Cerebras hardware, then we might be in for a long wait for the next gen of ginormous chips.
Agree w/ you on the model's tendency to butcher things. Performance wise, this almost feels like the GPT-OSS model.
I need to incorporate "risk of major failure" into bluey bench. Spark is a dangerous model. It doesnt strongly internalize the consequences of the commands that it runs, even on xhigh. As a result I'm observing a high tendency to run destructive commands.
For instance, I asked it to assign random numbers to the filename of the videos in my folder to run the bm. It accidentally deleted the files on most of the runs. The funniest part about it is that it comes back to you within a few seconds and says something like "Whoops, I have to keep it real, I just deleted the files in your folder."
Ouch, at least it fesses up. I ran into problems with it first refusing to use git "because of system-level rules in the session". Then later it randomly amended a commit and force pushed it because it made a dumb mistake. I guess it was embarassed.
Not if you're suggesting that "(served by Cerebras)" should be part of the name. They're partnering with Cerebras and providing a layer of value. Also, OpenAI is "serving" you the model.
We don't know how they integrate with Cerebras hardware, but typically you'd pay a few million dollars to get the hardware in your own datacenter. So no, "served by Cerebras" is confusing and misleading.
Also "mini" is confusing because it's not analagous to gpt-5.1-codex vs gpt-5.1-codex-mini. Gpt-5.3-codex-spark is a unique, _experimental_ offering that doesn't fit the existing naming suffixes.
I don't understand what's wrong with "spark". It's friendly and evokes a sense of something novel, which is perfect.
If you want to know more about the model, read the first paragraph of the article. That information doesn't need to be hardcoded into the model name indefinitely. I don't see any "gpt-5.3-codex-nvidia" models.
Uh, that paragraph translated from "marketing bullshit" into "engineer" would be "we distilled the big gpt-5.3-codex model into a smaller size that fits on the 44GB of SRAM of a Cerebras WSE-3 multiplied by whatever tensor parallel or layer parallel grouping they're doing".
(Cerebras runs llama-3.3 70b on 4 WSE-3 units with layer parallelism, for example).
That's basically exactly what gpt-5.3-codex-mini would be.
> Also "mini" is confusing because it's not analagous to gpt-5.1-codex vs gpt-5.1-codex-mini.
So perhaps OpenAI intentionally picked the model's layer param count, MoE expert size, etc to fit onto the Cerebras machines. That's like saying "the DVD producer optimized this movie for you" (they just cropped and compressed it down to 4.7GB so it would fit on a DVD). Maybe the typical mini model is 100gb, and they made it 99gb instead or something like that. It's still analogous to gpt-5.3-codex-mini.
I'm underselling it a little bit, because it takes a bit more work than that to get models to run on Cerebras hardware (because they're so weird and un-GPU-like), but honestly if Cerebras can get Llama 3.1 405b or GLM 4.7 running on their own chips, it's not that much harder to have Cerebras get gpt-5.3-codex-mini running.
Uh, the combined offering (smaller model + ~800 tps on cerebras) is nothing like the previous mini offerings, and you're hallucinating details about their process of creating it.
Read more about how Cerebras hardware handles clustering. The limit is not 44 GB or 500GB. Each CS-3 has 1,200 TB of MemoryX, supporting up to ~24T parameter models. And up to 2,048 can be clustered.
Yeah, it's pretty clear you're loud mouthed and don't know anything about distilling ML models or anything Cerebras. Distilling ML models into smaller mini versions is basic stuff. How do you think Qwen 3 235b and Qwen 3 30b were made? Or GLM 4.5 355b vs GLM 4.5 Air 105b? Or Meta Llama 4 Maverick and Scout? And everyone knows that the reason Cerebras never served Deepseek R1 or Kimi K2 or any other model bigger than ~500B is because their chips don't have enough memory. People have been begging Cerebras to serve Deepseek forever now, and they never actually managed to do it.
Cerebras doesn't run inference from MemoryX, the same way no other serious inference provider runs inference off of system RAM. MemoryX is connected to the CS-3 over ethernet! It's too slow. MemoryX is only 150GB/sec for the CS-3![1] If you're running inference at 800tokens/sec, with 150GB/sec that means each token can only load 0.18GB of params. For obvious reasons, I don't think OpenAI is using a 0.18B sized model.
The limit is 44GB for each WSE-3. [2] That's how much SRAM a single WSE-3 unit has. For comparison, a Nvidia H100 GPU has 80GB, and a DGX H100 server with 8 GPUs have 640GB of VRAM. Each WSE-3 has 44GB to play around with, and then if you have each one handling a few layers, you can load larger models. That's explicitly what Cerebras says they do: "20B models fit on a single CS-3 while 70B models fit on as few as four systems." [3]
You're reading marketing material drivel about training models that NOBODY uses Cerebras for. Basically nobody uses Cerebras for training, only inference.
[1] https://www.kisacoresearch.com/sites/default/files/documents... "The WSE-2’s 1.2Tb/s of I/O bandwidth is used for [...] transmitting gradients back to the MemoryX service." That quote is about WSE-2/CS-2, but the CS-3 spec lists the same System I/O: 1.2 Tb/s (12×100 GbE).
[2] https://cdn.sanity.io/images/e4qjo92p/production/50dcd45de5a... This really makes it obvious why Cerebras couldn't serve Deepseek R1. Deepseek is 10x larger than a 70b model. Since they don't do tensor parallelism, that means each chip has to wait for the previous one to finish before it can start. So not only is it 10x more memory consumption, it has to load all that sequentially to boot. Cerebras' entire market demands 1000 tokens per second for the much higher price that they charge, so there's no profit in them serving a model which they can only do 500 tokens/sec or something slow like that.
Yes. In order to serve 1k/s, they must be fitting the entire model on SRAM and not reaching out to off chip RAM. This means they’re likely chaining multiple wafer chips together to serve this model or they shrunk the model to fit one wafer chip. It’s uneconomical for many use cases but for highly valuable tasks, it could be worth it.
This is one area Nvidia chips have not been able to do, ultra fast, ultra high value tasks. Hence, the Grog acquisition.
Yea, it's pretty clear you're loudmouthed and an aggressively arrogant know-it-all (at least you think). You keep moving the goalposts too. First you're acting like they can't run models that don't fit in 44GB or 4x44GB. Then you say they can "only" run a larger model at 500 tps but that wouldn't be profitable.. Lol
Give Pi[1] a try. Comes pretty barebones out of the box, yet still provides a decent default experience. Extension points are all TypeScript if you want. There are a lot of examples[2] and some 3rd party extensions[3].
I'll point out that if you want permission prompts for certain behavior, you have to add that yourself. There's at least one example.
Edit: Just noticed the article's author is using a fork of Pi.
reply