Hacker Newsnew | past | comments | ask | show | jobs | submit | stared's commentslogin

Online age verification is an example of the Motte-and-bailey fallacy (https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy, https://slatestarcodex.com/2014/11/03/all-in-all-another-bri...).

It is easy to defend on the motte hill (protection of children, protection against abuse and heinous crimes), and easy to expand and farm on the bailey (universal surveillance, mass data collection, and the erosion of privacy).


Thank you for sharing benchmark. However, the results are selective.

Why no Opus 4.7? Why Gemini 3.1 Pro is missing?

If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.

When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.


Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.

Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.


Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.

For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.

Good point tho, will add this point in the blog too :)

Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.


The value of such a benchmark, to me, would be, "what is peak performance", not just "what is mid-tier performance". Also, possibly, "what's the per-dollar performance". Time and money permitting, I'd really want to see your benchmark extended to the large reasoning models.

Then the way to go is to use Pareto frontier, e.g. https://quesma.com/benchmarks/binaryaudit/#cost

If you want to avoid using Opus 4.7 them why GPT-5.4 (unless with a disclaimer that it is low reasoning setting, or check that on medium its price is comparable with Haiku/Flash).

Also, usually it is good to look at the newest model. Gemini 2.5 Flash is quite dated. Gemini 3.1 Flash Lite is the new one (https://openrouter.ai/google/gemini-3.1-flash-lite-preview).


Well, there is "never again" (https://en.wikipedia.org/wiki/Never_again).

It is up to us to decide if "never again" is a universal rule or "oh, but this time it is different".


“Nie wieder ist jetzt” is common graffiti in Berlin right now: “Never again is now.”

This slogan started as a reaction against rising antisemitism after 7. Octobre. A problem which many here just deny (" Germany is again on the wrong site")

Well, "There Will Be a Scientific Theory of Deep Learning" looks like flag planting - an academic variant of "I told you so!", but one that is a citation magnet.

It's actually really fascinating that there isn't a scientific theory of deep learning, especially as it's a product of human engineering as opposed to e.g. biology or particle physics.

There are very good reasons why it took this long, but can be summed up as: everyone was looking in the wrong place. Deep learning breaks a hundred years of statistical intuition, and you don't move a ship that large quickly.

Well there are some fundamentals such as universal approximation.

And we may find that biology also exploits structures like deep nets


There is, but it is fractured. I would equate this effort as more of a standardization of terms and language.

Calling it “a product of human engineering” is misleading. Deep learning exploits principles we don’t fully understand. We didn’t engineer those principles. It’s not fundamentally any different than particle physics or biology, which are both similarly consequences of rules that we didn’t invent and can’t control.

SWE-bench Verified is, at this point, contaminated https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

So it os hard to tell how much of a model gain is due to skill, and how much - overfitting.


As much as the website looks nice, the design looks AI generated - image loading animations, or quotation marks for species names. (Both are needles decorations.)

I don’t care if someone uses AI to get a simple website up. I’m here for their content, the photos.

The photos are all either stolen or AI generated.

Aren’t all decorations needless?

Not at all! Decorations are needed for lots of things. For example, obviously decorations are needed for decorating. Successful sexual posturing in some birds requires large, decorative body parts like feathers or crests.

"Your audience is good at recognizing problems and bad at solving them" - Mark Rosewater (Lead Designer of Magic: The Gathering, from his famous "20 Years, 20 Lessons" GDC talk, http://magic.wizards.com/en/news/making-magic/twenty-years-t...)

I really recommend using Agent Safehouse (https://news.ycombinator.com/item?id=47301085).

Don’t give your agent access to content it should not edit, don’t give keys it shouldn’t use.


Mythos can secure all models that cannot secure themselves. Can Mythos secure itself?

Hmm if Mythos can secure all models that can't secure themselves, then it's likely that Mythos can secure itself. So since it can secure itself, then it can't secure itself because it can.......

Memorandum: please do not use the word "periodic" for things that are not periodic

Other suitable choices: chart, classification, taxonomy, visualization, table, map, etc, etc.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: