9 Comments
User's avatar
Lydia Nottingham's avatar

speculation on proprietary secrets is one of my fav blog post genres, & i'm optimistic about us getting to see how accurate this was at some point

Expand full comment
Re:Courses's avatar

Ty deep mind nerd. I hope you have a good day

Expand full comment
Tim Dingman's avatar

NIAH is really not that good of a long context benchmark. Like it's fine but pretty limited. RULER for example I think is more complete.

The ideal long context benchmark also tests holistic understanding like "what are the overall themes of this novel I've attached"

Expand full comment
Everett's avatar

I think we’ll eventually find out about what was going on in the frontier labs, it just may take a couple decades

Expand full comment
Garloid 64's avatar

Well, at least we'll get to know how it was done once one of the Chinese lab manages to replicate this kind of performance...

Expand full comment
0k's avatar

I have heard there’s no 3.5 being tested internally, unless you mean ‘trained’ not ‘tested’. In that case IDK.

Regarding costs, Google is the only one in the race with money in the bank. Perhaps they consider usage right now strategically important enough to give away?

Expand full comment
Giampiero Campa's avatar

Well, we humans do it almost entirely with (continual, model based) RL, don't we?

Expand full comment
Substack Enjoyer's avatar

god i wish i was smart enough to fully understand stuff like this

Expand full comment
swiley's avatar

They might have just used fewer attention heads and done some kind of needle/haystack RLVR.

Expand full comment