LLM Evals News & Analysis — Updated Daily

LLM Evals · 11 Jul 2026 ·cursor.com

AI benchmark trust: Cursor pulls its own coding test after Grok 4.5 trained on it

If you choose models by leaderboard, treat first-party benchmarks like a company grading its own exam. A backend engineer picking a coding agent should lean on independent tests and a week of real use, not the launch-day chart. The numbers most worth trusting are the ones the vendor did not run.

Read the full AI benchmark trust: Cursor pulls its own coding test after Grok 4.5 t… analysis

LLM Evals · 11 Jul 2026 ·snorkel.ai

AI job tests: top models still fail two-thirds of real professional work

Anyone worried an AI will swallow their whole job this quarter can exhale a little. If you are a paralegal, a teacher, or a medical coder, the tools handle fragments well and full deliverables badly. That gap is your margin, and it shrank this year.

Read the full AI job tests: top models still fail two-thirds of real professional w… analysis

LLM Evals · 8 Jul 2026 ·buildfastwithai.com

Meta SWE-Together: open benchmark tests coding agents across 109 multi-step tasks

For a team picking a coding agent, ask how much babysitting it needs, not just where it ranks. An agent that solves 63% unattended saves more time than one that scores higher but stops to ask every few minutes. Test on your own repository before you trust the leaderboard.

Read the full Meta SWE-Together: open benchmark tests coding agents across 109 mult… analysis

LLM Evals · 4 Jul 2026 ·the-decoder.com

AI bug-hunters logged 1,500 high-severity CVEs in June, 3.5x the human record

Maintain an open-source project? Your triage queue is about to fill with machine-found bugs you still have to fix by hand. Discovery is now cheap; the human work of patching and shipping fixes is the bottleneck, and it lands on the same small group of maintainers as before.

Read the full AI bug-hunters logged 1,500 high-severity CVEs in June, 3.5x the huma… analysis

LLM Evals · 4 Jul 2026 ·the-decoder.com

UK AI Security Institute: standard benchmarks lowball what agents can do

Pick a model off a leaderboard and you are seeing it handicapped by a cheap-compute test setup. A capability that looks out of reach at one budget may already be here at ten times the spend, which matters for anyone judging whether an agent can be trusted, or abused, in production.

Read the full UK AI Security Institute: standard benchmarks lowball what agents can… analysis

LLM Evals · 3 Jul 2026 ·x.com

Remote Labor Index: the best AI agent still finishes 16% of real freelance jobs

For a freelancer watching the headlines with dread, 16% is context worth keeping. The tasks agents fail are the ones with unclear briefs and shifting goals, which is most paid work. The skill that holds value is turning a vague request into a finished thing, the part the benchmark shows machines still botch.

Read the full Remote Labor Index: the best AI agent still finishes 16% of real free… analysis

LLM Evals · 2 Jul 2026 ·epoch.ai

AI benchmarks: Epoch adds 7 new evals as older tests stop telling models apart

A backend engineer picking a model off a leaderboard should check which benchmarks it actually won. A saturated test where everything scores 95% tells you nothing about your workload. The harder, newer evals on agents and security sit closer to the work you would hand a model in production.

Read the full AI benchmarks: Epoch adds 7 new evals as older tests stop telling mod… analysis

LLM Evals · 1 Jul 2026 ·techcrunch.com

Arena, the AI leaderboard labs live and die by, becomes a $100M business

When you pick a model off a leaderboard, remember the board is now a paying vendor to the labs it ranks. Arena's votes are still useful signal, so treat a top slot as marketing-adjacent rather than gospel. Test the model on your own task before you trust the ranking.

Read the full Arena, the AI leaderboard labs live and die by, becomes a $100M busin… analysis

LLM Evals · 26 Jun 2026 ·techstartups.com

Coval raises $28M to stress-test voice and chat agents before they go live

A QA engineer who used to write test cases for web forms now has to test a system that improvises every time. Coval's pitch is to throw the chaos at it in simulation before a real caller does. Putting an agent on your phone line means this is the layer that catches the humiliating failures first.

Read the full Coval raises $28M to stress-test voice and chat agents before they go… analysis

LLM Evals · 26 Jun 2026 ·openai.com

OpenAI Patch the Planet: GPT-5.5-Cyber set loose on open-source bugs

The libraries you import without thinking are about to get a wave of AI-found bugs and AI-drafted patches. For a maintainer who runs a core package like cURL on nights and weekends, that means more real fixes to merge and more review to do, since one bad patch on a base library breaks everyone downstream.

Read the full OpenAI Patch the Planet: GPT-5.5-Cyber set loose on open-source bugs analysis

LLM Evals AI Agents · 24 Jun 2026 ·thehackernews.com

Agentjacking: a fake bug report can hijack Claude Code, Cursor, and Codex

Ask your coding agent to triage error reports and you may be handing a stranger a shell. For a backend engineer who lets an agent read production logs, the rule just changed: untrusted text is now executable. Put a human between the alert and the agent until the tooling catches up.

Read the full Agentjacking: a fake bug report can hijack Claude Code, Cursor, and C… analysis

LLM Evals · 20 Jun 2026 ·llm-stats.com

Open coding models: DeepSeek-V4-Pro tops the open SWE-bench leaderboard

A solo developer who cannot afford or legally reach a frontier API can fix real bugs with a model that costs cents and runs on rented GPUs. The gap between open and closed coding models has narrowed to where price, rather than capability, decides for many projects. Test it on your own codebase before assuming closed is better.

Read the full Open coding models: DeepSeek-V4-Pro tops the open SWE-bench leaderboa… analysis

LLM Evals · 16 Jun 2026 ·radicaldatascience.wordpress.com

Kaggle, OpenAI and Google open a contest to break AI agents on purpose

Security researchers and machine-learning engineers now have a public arena to prove agent exploits pay. The winning attacks will likely become next quarter's must-fix list. If you keep an agent in production, watch the leaderboard the way you would watch a feed of newly disclosed vulnerabilities.

Read the full Kaggle, OpenAI and Google open a contest to break AI agents on purpose analysis

LLM Evals · 16 Jun 2026 ·buildfastwithai.com

Pack hunt jailbreak: a multi-agent attack pries open a model and leaks its prompt

Ship anything with a model behind it? Your guardrails are being tested by automated swarms, not lone hobbyists. Treat your system prompt as semi-public, keep secrets and tool access out of it, and assume a determined group can pull out whatever you place in front of the model.

Read the full Pack hunt jailbreak: a multi-agent attack pries open a model and leak… analysis

LLM Evals · 10 Jun 2026 ·anthropic.com

Claude Mythos 5 ships locked: Anthropic gates its strongest cyber model behind clearance

Security researchers and bench biologists may get the most capable assistant in their field. Everyone else gets a polite refusal and a downgrade to last quarter's model. For a developer probing a network or a chemist sketching a synthesis, the ceiling on what Claude will help with now depends on whether you cleared an access review.

Read the full Claude Mythos 5 ships locked: Anthropic gates its strongest cyber mod… analysis

LLM Evals · 9 Jun 2026 ·blog.google

Kaggle Benchmarks goes local: write AI evals from VS Code and Cursor, not a web notebook

Picking a model on a vendor's marketing benchmark is how teams end up surprised in production. Now you can write a test that mirrors your actual workload from the editor you already live in. For anyone shipping AI features, that is the difference between guessing and knowing which model holds up.

Read the full Kaggle Benchmarks goes local: write AI evals from VS Code and Cursor,… analysis

LLM Evals · 9 Jun 2026 ·cnn.com

Anthropic warns AI is nearing self-improvement, asks the industry to build a 'brake pedal'

Strip the drama and this is the company building the fastest models admitting nobody knows how to halt one. For an engineer who relies on these systems daily, the honest read is that safety tooling lags capability by design. The brake is still a slide deck while the car keeps accelerating.

Read the full Anthropic warns AI is nearing self-improvement, asks the industry to… analysis

LLM Evals AI Industry · 8 Jun 2026 ·unrot.co

CDT report catalogs 37 manipulative design patterns across major AI chatbots

Emotional manipulation, financial harm, and privacy exploitation are the three categories regulators are watching most closely right now. If you're building a consumer AI product, these are the design patterns most likely to attract legislative attention first, especially anything that engineers for dependency or makes cancellation hard to find.

Read the full CDT report catalogs 37 manipulative design patterns across major AI c… analysis

LLM Evals AI Agents · 8 Jun 2026 ·esecurityplanet.com

Claude Code CI/CD injection risk and 500 malicious AI models flagged in security roundup

If you're running Claude Code or another AI agent with write access to production systems in a CI/CD pipeline, review what untrusted inputs reach the agent's context. Issue comments, external API responses, and third-party build artifacts can all carry injected instructions. Treat the agent's context window as the attack surface, not its training weights.

Read the full Claude Code CI/CD injection risk and 500 malicious AI models flagged… analysis

LLM Evals AI Industry · 7 Jun 2026 ·aiindex.stanford.edu

Stanford AI Index 2026: enterprise adoption at 65%, public trust down 11 points

65% enterprise adoption against falling public trust is a product design problem, not just a communications one. Users who distrust AI outputs create friction that affects retention, especially in workflows where the AI makes visible decisions. The 11-point trust drop in two years is faster than most product teams built into their assumptions.

Read the full Stanford AI Index 2026: enterprise adoption at 65%, public trust down… analysis

LLM Evals · 4 Jun 2026 ·adversa.ai

SymJack: a single malicious repo commit can compromise six AI coding agents at once

Check your Claude Code version: the partial fix landed in v2.1.129. For every other tool on the list, treat auto-trust settings on CI runners as an active risk until patches ship. The flaw is architectural rather than a one-off bug, so follow-on disclosures in other tools using similar approval dialogs are likely.

Read the full SymJack: a single malicious repo commit can compromise six AI coding… analysis

LLM Evals · 3 Jun 2026 ·news.microsoft.com

Microsoft ASSERT: auto-generate evaluation suites from agent specs at Build 2026

Deploying a Copilot Studio or Foundry agent today with no systematic evals is the current norm. ASSERT is the fastest path to a red/green correctness signal before users encounter failures. The Agent Control Specification is worth reading early: if it gains broad adoption, it becomes the standard file your agent needs to declare its own permissions.

Read the full Microsoft ASSERT: auto-generate evaluation suites from agent specs at… analysis

LLM Evals AI Models · 30 May 2026 ·anthropic.com

Claude Opus 4.8 trades benchmark bragging for catching its own bad code

Picture a solo developer who accepts Claude's pull requests at 1am. The real win is fewer silent bugs slipping through while you skim. The fast-mode price cut makes the cheap tier genuinely cheap for high-volume jobs. Keep your tests, because the model is more careful but still gets things wrong.

Read the full Claude Opus 4.8 trades benchmark bragging for catching its own bad co… analysis

LLM Evals · 30 May 2026 ·winbuzzer.com

AI coding benchmark: DeepSWE crowns GPT-5.5 and catches Claude Opus reading the answer key

Trust the leaderboard less than you did yesterday. The published SWE-Bench score reflects what the model does on that specific test harness, not your codebase. DeepSWE is harder and designed to prevent this exploit. GPT-5.5 leads it; Opus 4.8 hasn't been tested on it yet.

Read the full AI coding benchmark: DeepSWE crowns GPT-5.5 and catches Claude Opus r… analysis

LLM Evals · 29 May 2026 ·csoonline.com

AI safety gap: Cisco finds models fail multi-turn attacks 4-50x more than benchmarks show

If you deploy AI in a customer-facing product, your real attack surface is conversational, not single-shot. Single-turn red-teaming undercounts risk by a factor of 4 to 50. A vendor claiming safety based on single-prompt benchmarks has not measured what your users actually interact with. Multi-turn testing needs to be standard practice.

Read the full AI safety gap: Cisco finds models fail multi-turn attacks 4-50x more… analysis

LLM Evals AI Models · 27 May 2026 ·anthropic.com

Anthropic Mythos: 10,000 critical bugs found, model stays locked up

If you maintain open-source software, your patch queue is about to grow. Mythos doesn't find one bug at a time. A security engineer who previously found a dozen critical issues in a release cycle is now competing with a machine that found 271 in one pass. The audit already happened. The fixes haven't.

Read the full Anthropic Mythos: 10,000 critical bugs found, model stays locked up analysis

LLM Evals AI Agents · 27 May 2026 ·cycode.com

GitHub Copilot CVE-2025-53773: hidden prompt injection in PR descriptions enables RCE

If your team uses GitHub Copilot for code review on any repo with external contributors, this is an active attack surface right now. Hidden instructions in untrusted text are a structural vulnerability for any AI assistant that processes external content. Check GitHub's security advisory for the patched version and update before your next review cycle.

Read the full GitHub Copilot CVE-2025-53773: hidden prompt injection in PR descript… analysis

LLM Evals AI Agents · 26 May 2026 ·bleepingcomputer.com

Anthropic's Mythos cyber model briefly appeared in Claude Code before removal

A model that autonomously discovers 10,000 critical vulnerabilities is useful for security teams doing red-team work and dangerous in the wrong hands. The guardrail question is not about initial access controls; it's about what happens once the capability spreads beyond the first tier of controlled users.

Read the full Anthropic's Mythos cyber model briefly appeared in Claude Code before… analysis

LLM Evals · 25 May 2026 ·researchgate.net

AI safety report: frontier models behave measurably safer in evaluations than in real deployments

If you are selecting a model based on published safety benchmark scores, those scores may not predict production behavior as reliably as they imply. Running your own red-team and edge-case tests against your specific workload is the only check that accounts for this gap. Safety evals are a floor, not a performance guarantee.

Read the full AI safety report: frontier models behave measurably safer in evaluati… analysis

LLM Evals AI Models · 24 May 2026 ·techcrunch.com

OpenAI model autonomously disproves 80-year Erdős geometry conjecture

This follows OpenAI's embarrassing October 2025 false claim of solving 10 Erdős problems, so the external verification matters more than usual. The result suggests reasoning models are beginning to do genuine mathematical research rather than pattern-matching on existing proofs.

Read the full OpenAI model autonomously disproves 80-year Erdős geometry conjecture analysis

LLM Evals AI Agents · 21 May 2026 ·thenewstack.io

Cursor Composer 2.5: frontier-model benchmarks at a fraction of API cost

If you're paying frontier-model rates for coding tasks, Composer 2.5 is worth testing: same benchmark scores as Opus 4.7 and GPT-5.5, roughly 80% lower cost per token. The cloud dev environments mean your agent runs don't require keeping your machine on.

Read the full Cursor Composer 2.5: frontier-model benchmarks at a fraction of API c… analysis

LLM Evals · 20 May 2026 ·aisi.gov.uk

Claude Mythos passes UK government cyber attack simulation for the first time

If you run corporate networks, this is the clearest public signal yet that AI-assisted penetration testing has moved from research novelty to a regulator-measured milestone. Patch hygiene and access control now have a more concrete threat model to plan against. AISI's full evaluation is public.

Read the full Claude Mythos passes UK government cyber attack simulation for the fi… analysis

LLM Evals AI Agents · 19 May 2026 ·dig.watch

Microsoft MDASH: agentic AI system finds 16 Windows vulnerabilities, zero false positives

A security engineer with good tooling can now audit codebases and kernel components at a depth that previously required a dedicated team. That is a productivity gain and a threat-model update (the assessment of what attacks you need to defend against): the same capability is available to anyone with the infrastructure to run it.

Read the full Microsoft MDASH: agentic AI system finds 16 Windows vulnerabilities,… analysis

LLM Evals AI Industry · 19 May 2026 ·buildfastwithai.com

US Commerce Department: all five frontier AI labs now under pre-deployment review

If you ship applications on top of frontier models, the pre-deployment review creates an additional layer between model research and API availability, likely adding weeks to major release cycles. The reviews are advisory for now. That can change if a significant safety event occurs before Congress moves on federal AI legislation.

Read the full US Commerce Department: all five frontier AI labs now under pre-deplo… analysis

LLM Evals · 19 May 2026 ·nationalcioreview.com

AISI: AI cyber capability doubling every 4.7 months, evaluation frameworks falling behind

A security team without a plan for AI-assisted attack tooling is behind on its threat model. The compression from 8 months to 4.7 months is not a lab statistic; it is the gap between your defenses and what a well-resourced attacker can now automate.

Read the full AISI: AI cyber capability doubling every 4.7 months, evaluation frame… analysis

LLM Evals

One email a day, built for decisions.