It is easy to defend on the motte hill (protection of children, protection against abuse and heinous crimes), and easy to expand and farm on the bailey (universal surveillance, mass data collection, and the erosion of privacy).
Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.
Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.
Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.
For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.
Good point tho, will add this point in the blog too :)
Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.
The value of such a benchmark, to me, would be, "what is peak performance", not just "what is mid-tier performance". Also, possibly, "what's the per-dollar performance". Time and money permitting, I'd really want to see your benchmark extended to the large reasoning models.
If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).
This slogan started as a reaction against rising antisemitism after 7. Octobre. A problem which many here just deny (" Germany is again on the wrong site")
Well, "There Will Be a Scientific Theory of Deep Learning" looks like flag planting - an academic variant of "I told you so!", but one that is a citation magnet.
It's actually really fascinating that there isn't a scientific theory of deep learning, especially as it's a product of human engineering as opposed to e.g. biology or particle physics.
There are very good reasons why it took this long, but can be summed up as: everyone was looking in the wrong place. Deep learning breaks a hundred years of statistical intuition, and you don't move a ship that large quickly.
Calling it “a product of human engineering” is misleading. Deep learning exploits principles we don’t fully understand. We didn’t engineer those principles. It’s not fundamentally any different than particle physics or biology, which are both similarly consequences of rules that we didn’t invent and can’t control.
As much as the website looks nice, the design looks AI generated - image loading animations, or quotation marks for species names. (Both are needles decorations.)
Not at all! Decorations are needed for lots of things. For example, obviously decorations are needed for decorating. Successful sexual posturing in some birds requires large, decorative body parts like feathers or crests.
Hmm if Mythos can secure all models that can't secure themselves, then it's likely that Mythos can secure itself. So since it can secure itself, then it can't secure itself because it can.......
It is easy to defend on the motte hill (protection of children, protection against abuse and heinous crimes), and easy to expand and farm on the bailey (universal surveillance, mass data collection, and the erosion of privacy).
reply