Plus: Google Maps' biggest upgrade in a decade, Travis Kalanick's secret robotics company, and Anthropic's Dispatch. Everything you need to know about AI this week.
You can get surprisingly far with LM Studio + VS Code. The main bottleneck is context window and reasoning quality on local models. For a basic research loop (search → summarize → iterate), something like Llama 3 70B quantized works. The gap vs. Claude/GPT-4 is biggest on the "knowing when to stop and synthesize" step. But for electricity-only cost? Absolutely worth experimenting with.
The challenge with this and the evals is that you're mainly testing structure, not resonance, which, admittedly, is really hard to do. Testing marketing page copy doesn't actually test whether it actually resonates with the audience. There's still a lot of judgment and discernment in place that really can't be evaluated by just the eval criteria that you used.
The "locked eval harness" is the hugely important design choice here. It seems that everyone focuses on the loop but the real insight is that constraining what the agent CAN'T touch is what makes the whole thing work.
If the agent could edit both the code and the scoring criteria, it would just Goodhart's Law itself into oblivion.
Karpathy basically solved AI alignment for optimization loops by making the eval immutable.
I wonder if I could run the research loop locally with LM studio and VS code.
It would be very cool if you could do this with just electricity as cost.
You can get surprisingly far with LM Studio + VS Code. The main bottleneck is context window and reasoning quality on local models. For a basic research loop (search → summarize → iterate), something like Llama 3 70B quantized works. The gap vs. Claude/GPT-4 is biggest on the "knowing when to stop and synthesize" step. But for electricity-only cost? Absolutely worth experimenting with.
The final mRNA vaccine version for the dog was created with Grok actually
Good clarification! Facscinating.
True, grok 4.20 heavy seems to be really good, but hardly anyone tries what it can do.
It’s more expensive than the others and has fewer integrations.
The challenge with this and the evals is that you're mainly testing structure, not resonance, which, admittedly, is really hard to do. Testing marketing page copy doesn't actually test whether it actually resonates with the audience. There's still a lot of judgment and discernment in place that really can't be evaluated by just the eval criteria that you used.
The "locked eval harness" is the hugely important design choice here. It seems that everyone focuses on the loop but the real insight is that constraining what the agent CAN'T touch is what makes the whole thing work.
If the agent could edit both the code and the scoring criteria, it would just Goodhart's Law itself into oblivion.
Karpathy basically solved AI alignment for optimization loops by making the eval immutable.