More

stared · 2026-04-29T17:31:49 1777483909

Online age verification is an example of the Motte-and-bailey fallacy (https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy, https://slatestarcodex.com/2014/11/03/all-in-all-another-bri...).

It is easy to defend on the motte hill (protection of children, protection against abuse and heinous crimes), and easy to expand and farm on the bailey (universal surveillance, mass data collection, and the erosion of privacy).

stared · 2026-04-29T17:16:03 1777482963

Thank you for sharing benchmark. However, the results are selective.

Why no Opus 4.7? Why Gemini 3.1 Pro is missing?

If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.

When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.

Flux159 · 2026-04-29T17:30:10 1777483810

Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.

Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.

khurdula · 2026-04-29T18:12:39 1777486359

Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.

For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.

Good point tho, will add this point in the blog too :)

Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.

staticshock · 2026-04-29T21:26:59 1777498019

The value of such a benchmark, to me, would be, "what is peak performance", not just "what is mid-tier performance". Also, possibly, "what's the per-dollar performance". Time and money permitting, I'd really want to see your benchmark extended to the large reasoning models.

stared · 2026-04-29T19:54:50 1777492490

Then the way to go is to use Pareto frontier, e.g. https://quesma.com/benchmarks/binaryaudit/#cost

If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).

Also, usually it is good to look at the newest model. Gemini 2.5 Flash is quite dated. Gemini 3.1 Flash Lite is the new one (https://openrouter.ai/google/gemini-3.1-flash-lite-preview).

stared · 2026-04-26T23:20:21 1777245621

Well, there is "never again" (https://en.wikipedia.org/wiki/Never_again).

It is up to us to decide if "never again" is a universal rule or "oh, but this time it is different".

datsci_est_2015 · 2026-04-27T12:59:04 1777294744

“Nie wieder ist jetzt” is common graffiti in Berlin right now: “Never again is now.”

snowpid · 2026-04-27T17:12:47 1777309967

This slogan started as a reaction against rising antisemitism after 7. Octobre. A problem which many here just deny (" Germany is again on the wrong site")

stared · 2026-04-24T23:02:08 1777071728

Well, "There Will Be a Scientific Theory of Deep Learning" looks like flag planting - an academic variant of "I told you so!", but one that is a citation magnet.

A_D_E_P_T · 2026-04-24T23:36:11 1777073771

It's actually really fascinating that there isn't a scientific theory of deep learning, especially as it's a product of human engineering as opposed to e.g. biology or particle physics.

hodgehog11 · 2026-04-25T00:07:01 1777075621

There are very good reasons why it took this long, but can be summed up as: everyone was looking in the wrong place. Deep learning breaks a hundred years of statistical intuition, and you don't move a ship that large quickly.

seydor · 2026-04-25T15:39:38 1777131578

Well there are some fundamentals such as universal approximation.

And we may find that biology also exploits structures like deep nets

slashdave · 2026-04-25T00:25:21 1777076721

There is, but it is fractured. I would equate this effort as more of a standardization of terms and language.

antonvs · 2026-04-25T07:29:39 1777102179

Calling it “a product of human engineering” is misleading. Deep learning exploits principles we don’t fully understand. We didn’t engineer those principles. It’s not fundamentally any different than particle physics or biology, which are both similarly consequences of rules that we didn’t invent and can’t control.

stared · 2026-04-24T07:17:19 1777015039

SWE-bench Verified is, at this point, contaminated https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

So it os hard to tell how much of a model gain is due to skill, and how much - overfitting.

stared · 2026-04-23T13:25:28 1776950728

As much as the website looks nice, the design looks AI generated - image loading animations, or quotation marks for species names. (Both are needles decorations.)

Aurornis · 2026-04-23T13:57:48 1776952668

I don’t care if someone uses AI to get a simple website up. I’m here for their content, the photos.

razorbeamz · 2026-04-24T02:39:22 1776998362

The photos are all either stolen or AI generated.

derektank · 2026-04-23T13:35:05 1776951305

Aren’t all decorations needless?

sjsanc · 2026-04-23T14:14:34 1776953674

Not at all! Decorations are needed for lots of things. For example, obviously decorations are needed for decorating. Successful sexual posturing in some birds requires large, decorative body parts like feathers or crests.

stared · 2026-04-22T22:08:23 1776895703

"Your audience is good at recognizing problems and bad at solving them" - Mark Rosewater (Lead Designer of Magic: The Gathering, from his famous "20 Years, 20 Lessons" GDC talk, http://magic.wizards.com/en/news/making-magic/twenty-years-t...)

stared · 2026-04-22T20:38:52 1776890332

I really recommend using Agent Safehouse (https://news.ycombinator.com/item?id=47301085).

Don’t give your agent access to content it should not edit, don’t give keys it shouldn’t use.

stared · 2026-04-22T12:16:10 1776860170

Mythos can secure all models that cannot secure themselves. Can Mythos secure itself?

skeledrew · 2026-04-23T01:12:30 1776906750

Hmm if Mythos can secure all models that can't secure themselves, then it's likely that Mythos can secure itself. So since it can secure itself, then it can't secure itself because it can.......

stared · 2026-04-21T19:15:53 1776798953

Memorandum: please do not use the word "periodic" for things that are not periodic

Other suitable choices: chart, classification, taxonomy, visualization, table, map, etc, etc.