I have heard there’s no 3.5 being tested internally, unless you mean ‘trained’ not ‘tested’. In that case IDK.
Regarding costs, Google is the only one in the race with money in the bank. Perhaps they consider usage right now strategically important enough to give away?
speculation on proprietary secrets is one of my fav blog post genres, & i'm optimistic about us getting to see how accurate this was at some point
Ty deep mind nerd. I hope you have a good day
NIAH is really not that good of a long context benchmark. Like it's fine but pretty limited. RULER for example I think is more complete.
The ideal long context benchmark also tests holistic understanding like "what are the overall themes of this novel I've attached"
I think we’ll eventually find out about what was going on in the frontier labs, it just may take a couple decades
Well, at least we'll get to know how it was done once one of the Chinese lab manages to replicate this kind of performance...
I have heard there’s no 3.5 being tested internally, unless you mean ‘trained’ not ‘tested’. In that case IDK.
Regarding costs, Google is the only one in the race with money in the bank. Perhaps they consider usage right now strategically important enough to give away?
Well, we humans do it almost entirely with (continual, model based) RL, don't we?
god i wish i was smart enough to fully understand stuff like this
They might have just used fewer attention heads and done some kind of needle/haystack RLVR.