Specialist AI Beats Frontier | AI Field Notes #60

A small, sharply drawn pencil casts a long shadow over toppled oversized machines beside a few coins, suggesting a cheap specialist tool outperforming costly frontier AI.

Specialist AI is starting to beat the frontier on real work at a fraction of the cost. A fine-tuned model from Bridgewater and Mira Murati's Thinking Machines topped every leading model on six finance tasks at a fourteenth of the price per call, and Cognition's Devin Security Swarm out-found rival security agents on live bugs for less per fix. Together AI raised $800 million on the same bet that developers keep swapping closed models for cheaper open ones. Google shipped Gemini Omni Flash for conversational video, GitHub Copilot's browser tools went general, and a new Remote Labor Index still caps the best agent at 16.1% of real freelance jobs. ByteDance's $39 billion Brazil data center and Meta's plan to rent out spare compute show where the money is flowing.

AI Agents LLM Evals AI Models AI Industry

Latest issue · About

AI Models ·Thinking Machines Lab

Bridgewater's fine-tuned model beats frontier LLMs on finance at 1/14th the cost

AnalysisA custom model trained on one hedge fund's own expert judgment scored 84.7% on six real finance tasks, beating the best carefully-prompted frontier model at 78.2% and running at about a fourteenth of the cost per call. Thinking Machines, the lab Mira Murati started after leaving OpenAI, ran it with Bridgewater's AIA Labs and published the results July 2. The jobs were unglamorous: scoring whether an article is relevant, reading a central-bank statement. Off-the-shelf models sat near 50% on a plain prompt. For a narrow task with good labeled data, teaching a small model your judgment now beats renting a genius by the token.

Read the source for Bridgewater's fine-tuned model beats frontier LLMs on finance at 1/14… · Thinking Machines Lab · thinkingmachines.ai

AI Agents ·Cognition

Devin Security Swarm: Cognition's agents out-find Claude and Codex on live bugs

AnalysisPoint a swarm of AI agents at a codebase, let each one prove which bugs are real, and ship the fix as a pull request. That is what Cognition's Devin Security Swarm, released July 1, did against 50 known vulnerabilities: it caught 36 at roughly $90 a run. Anthropic's Claude security tool found 34 and cost half again as much per scan; OpenAI's Codex caught 24. Three critical holes the rivals missed, including a PHP sandbox bypass, surfaced only here. A security review that once ate a specialist's weekend now runs for about the price of lunch.

Read the source for Devin Security Swarm: Cognition's agents out-find Claude and Codex on… · Cognition · cognition.com

AI Industry ·TechCrunch

Together AI raises $800M as developers keep swapping closed models for open

Analysis$800 million at an $8.3 billion valuation, up from $3.3 billion sixteen months ago: Together AI, which rents Nvidia chips and hosts open-weight models, more than doubled its price in barely a year. The July 1 round, led by Saudi Aramco's venture arm, rides one bet: teams building on AI would rather run a model they can inspect and scale cheaply than pay a frontier lab per token. Together says annual bookings passed $1.15 billion, with Cursor and Cognition on the customer list. Open weights, models whose internals are public and free to self-host, have quietly become a billion-dollar business.

Read the source for Together AI raises $800M as developers keep swapping closed models fo… · TechCrunch · techcrunch.com

AI Agents ·GitHub Changelog

GitHub Copilot's browser tools go general, and its CLI now picks the model for you

AnalysisCopilot can now open a browser inside VS Code and click through a running app while it writes the code, a feature GitHub took out of preview on July 1. The same day, its command-line tool began choosing which model to run on its own, routing each request by how hard the task looks and what a dependable answer costs. That second change is the quiet one. Developers have spent a year hand-picking a model per task, which usually means doing the work first just to judge its difficulty. Handing that to a router trades a sliver of control for a lot less fiddling.

Read the source for GitHub Copilot's browser tools go general, and its CLI now picks the… · GitHub Changelog · github.blog

AI Agents ·Z.ai ZCode

Z.ai ships ZCode, a cross-platform coding harness for its GLM-5.2 model

AnalysisChina's open-model camp now has a coding agent aimed straight at Claude Code and Cursor. Z.ai, the lab behind the GLM model family, released ZCode on July 2: a cross-platform harness that runs GLM-5.2 with your own API keys and keeps long-running goals alive across more than 20 tools and several agents at once. Paid plans start near $16 a month, well under what Western coding assistants charge, and it reaches users through WeChat, Feishu, and Telegram. For anyone weighing where their agent workflow should live, the list of serious options just stopped being American-only.

Read the source for Z.ai ships ZCode, a cross-platform coding harness for its GLM-5.2 mod… · Z.ai ZCode · zcode.z.ai

AI Models ·Google AI for Developers

Google's Gemini Omni Flash turns video generation into a conversation in the API

AnalysisAsk for a video, watch it, then ask for changes in plain language and get a new cut back. Google put that loop into its API on July 2 with Gemini Omni Flash, a model that generates and edits video from a text prompt or an uploaded clip and holds the thread across several rounds of revision. The move is from one-shot generation toward something closer to directing. A freelance video editor who used to rough out three concept versions by hand can now describe the change and watch it land. Google has not posted per-second pricing, the figure that decides whether this is a tool or a toy.

Read the source for Google's Gemini Omni Flash turns video generation into a conversation… · Google AI for Developers · ai.google.dev

AI Models ·Hugging Face

Nvidia's Nemotron diffusion model claims near-top quality at 2.4x the speed

AnalysisMost language models write one token at a time, left to right. Nvidia's new Nemotron TwoTower, a 30-billion-parameter model released July 2, uses diffusion instead, refining a whole block of text at once the way image models sharpen a picture out of noise. Nvidia reports 98.7% of the quality of a standard model at 2.42 times the generation speed, with the work split into one tower that reads context and another that denoises. Faster output at similar quality is the whole ballgame for anyone paying by the token or waiting on an agent. If the numbers hold outside Nvidia's own benchmark, the left-to-right default gets its first real challenger.

Read the source for Nvidia's Nemotron diffusion model claims near-top quality at 2.4x the… · Hugging Face · huggingface.co

AI Industry ·Venice

Venice hits $1B on a bet that some users will pay to keep AI private

AnalysisA privacy-first AI startup just became a unicorn. Venice raised $65 million on July 1 at a $1 billion valuation, led by the crypto investor Dragonfly, on the back of more than $70 million in yearly revenue and 3.5 million registered users. Venice runs open models without storing prompts or chat logs, selling the one thing the big assistants cannot: a conversation nobody keeps a copy of. Buyers turned up among people who treat their search history as their own business. Whether that instinct scales past a privacy-minded niche is the question a billion-dollar tag now puts on the table.

Read the source for Venice hits $1B on a bet that some users will pay to keep AI private · Venice · venice.ai

LLM Evals ·Center for AI Safety on X

Remote Labor Index: the best AI agent still finishes 16% of real freelance jobs

AnalysisSet loose on 240 real remote-work projects across 23 fields, the strongest public AI model finished 16.1% of them to a standard a paying client would accept. That is the latest read from the Remote Labor Index, a benchmark run by the Center for AI Safety and Scale AI that buys actual freelance deliverables instead of scoring toy tasks, updated July 2. Anthropic's Fable 5 led the field. The number cuts against the louder story of the week. Agents that ace coding demos still flub most of the messy, underspecified work that fills a real job, where the brief is vague and nobody hands you a rubric.

Read the source for Remote Labor Index: the best AI agent still finishes 16% of real free… · Center for AI Safety on X · x.com

AI Industry ·Bloomberg

ByteDance picks Brazil for a $39B data center, its biggest outside China

Analysis$39 billion is the reported price on ByteDance's plan to build its largest data center outside China, and the site is Brazil. Bloomberg reported the project on July 1, part of a push by Chinese tech giants to plant AI compute in markets that Washington's export controls do not reach. Brazil offers cheap hydropower and a government courting the investment. For TikTok's owner, it is a hedge: infrastructure on friendly ground, beyond the reach of US chip rules, close to hundreds of millions of users in Latin America. The map of where AI actually runs is being redrawn by which doors stay open.

Read the source for ByteDance picks Brazil for a $39B data center, its biggest outside Ch… · Bloomberg · bloomberg.com

AI Industry ·UN Web TV

UN's first global AI science report warns the window to coordinate is closing

AnalysisThe first UN-backed scientific assessment of AI landed July 2 with a blunt message: the time to agree on shared guardrails is short and getting shorter. Written under a panel that includes Yoshua Bengio, a founding figure in deep learning, and the journalist Maria Ressa, it flags concentrated corporate power and safety controls that are failing to keep pace. The report arrives days before governments meet in Geneva on July 6 for the first Global Dialogue on AI Governance. It carries no binding force. Its real use is as a scorecard those same governments will be held against later.

Read the source for UN's first global AI science report warns the window to coordinate is… · UN Web TV · webtv.un.org

AI Industry ·CNBC

Meta explores selling its spare AI compute, and Wall Street likes it

AnalysisMeta is weighing a cloud business that would rent out its surplus AI compute, and possibly hosted access to its models, and the stock jumped 9% on the news July 1. The logic is plain: Meta has bought staggering amounts of chips and power to train its own models, and idle capacity earns nothing. Renting it, the way Amazon turned spare servers into AWS two decades ago, turns a cost center into revenue. For a company that has spent the year defending its AI spend to investors, a way to show cash coming back is worth a 9% pop. The open question is whether Meta wants to become a landlord.

Read the source for Meta explores selling its spare AI compute, and Wall Street likes it · CNBC · cnbc.com