Issue #6: OpenAI's new model scores 88.7% on the coding bar

Models Evals ·OpenAI

OpenAI's new model scores 88.7% on the coding bar

AnalysisOpenAI shipped GPT-5.5 on April 23, rolling it immediately to paid subscribers in ChatGPT and its Codex coding assistant, with API access following on April 24. The model scores 88.7% on SWE-bench Verified (a test measuring whether an AI can resolve real software bugs from open-source repositories) and 92.4% on MMLU (a broad knowledge benchmark), and per OpenAI's system card, cuts hallucination rates by 60% compared to GPT-5.4. Artificial Analysis, an independent benchmarking firm, placed GPT-5.5 at the top of its Intelligence Index by 3 points, breaking a three-way tie with Anthropic's Claude Opus 4.7 and Google's Gemini 3.1 Pro. API pricing is $5 input and $30 output per million tokens, above GPT-5.4's rates. Released the same week as DeepSeek's near-equivalent V4 at a fraction of the price, GPT-5.5 makes the premium-tier cost case harder for anyone revisiting AI vendor contracts this quarter.

Read the source · OpenAI · openai.com

Agents Industry ·OpenAI

ChatGPT gets a longer leash in the office

AnalysisOpenAI on April 23 launched Workspace Agents for ChatGPT users on its Business ($20 per user per month), Enterprise, Edu, and Teachers subscription plans. The agents are built without code, connected to tools like Slack and Salesforce, and assigned persistent multi-step workflows: summarizing project updates, drafting replies, routing requests across teams. Workspace Agents replace custom GPTs (the previous tool that let organizations build specialized ChatGPT personas) and give those bots the ability to act on connected systems rather than just generate text in a chat window. Access is free until May 6, after which credit-based pricing begins. The timing alongside GPT-5.5 is deliberate: OpenAI is assembling a stack where the model, the agent layer, and the enterprise integrations are all its own, replicating the account lock-in that enterprise software firms spent decades cultivating.

Read the source · OpenAI · openai.com

Models Industry ·DeepSeek API Docs

DeepSeek V4 arrives one year after its first earthquake

AnalysisChinese AI lab DeepSeek released a preview of its V4 model series on April 24, one year after DeepSeek R1 upended the assumption that frontier-level AI required frontier-level compute budgets. DeepSeek-V4-Pro uses a Mixture-of-Experts architecture (MoE, a design that activates only 49 billion of its 1.6 trillion parameters per inference call, slashing per-query cost) and scores 80.6% on SWE-bench Verified, within 0.2 percentage points of Anthropic's Claude Opus 4.6. A larger variant, V4-Pro-Max, outperforms GPT-5.2 and Gemini 3.0 Pro on reasoning benchmarks, trailing GPT-5.4 and Opus 4.6. The V4-Flash variant costs between 18 and 214 times less per token than GPT-5.5, depending on task. Model weights are open; the older V3 and V3.2 versions retire July 24. The previous shock was about cost. This release confirms the trajectory holds.

Read the source · DeepSeek API Docs · api-docs.deepseek.com

Industry ·GeekWire

Microsoft offers a golden handshake, first time in 51 years

AnalysisMicrosoft disclosed on April 23 a voluntary retirement program, the first in the company's 51-year history, offering US employees a financial payout and extended healthcare to leave without being forced out. The program targets staff whose combined age and years of service total 70 or above, excluding certain senior and sales-incentive roles. CNBC characterized the eligible pool as up to 7% of the US workforce, representing thousands of employees. The stated rationale is cost control during a massive AI infrastructure buildup. The mechanics are telling: voluntary framing limits legal exposure compared to forced layoffs, and the age-plus-tenure threshold preferentially removes the highest-paid, longest-serving staff, typically those most embedded in legacy products rather than AI-native work. Structurally, this is a bet on what the next decade of Microsoft products requires from its people.

Read the source · GeekWire · geekwire.com

Industry ·CNN

The grid is full. The AI build-out is not.

AnalysisA CNN investigation published April 23 produced the bluntest assessment yet of AI's energy problem. 'Basically, we have run out of headroom, largely speaking, in the US,' said Ben Hertz-Shargel, an electrification expert at energy research firm Wood Mackenzie, describing the gap between grid supply and what AI's continued expansion demands. The forcing function is the shift from conversational AI to autonomous agents, which run continuous inference (repeatedly querying a model to complete long-running tasks) rather than bursty single queries, multiplying sustained electricity demand per user. Technical fixes exist: efficiency improvements, demand-response programs, small modular nuclear reactors. They are moving too slowly against current deployment pace. ITIF (the Information Technology and Innovation Foundation) found that new data center deals fell more than 40% between Q3 and Q4 2025, with only one-third of 240 gigawatts of planned capacity actually being built, suggesting constraints are already curbing expansion.

Read the source · CNN · cnn.com

Evals Industry ·OpenAI

An AI told to be a doctor now scores above physicians on the test its maker designed

AnalysisOpenAI's April 23 release notes disclosed a finding that will travel well outside the AI industry: GPT-5.4, deployed inside ChatGPT for Clinicians (a workspace tailored for healthcare professionals), outperformed both other AI models and human physicians on HealthBench Professional, OpenAI's clinical reasoning benchmark. HealthBench Professional tests diagnostic reasoning, treatment selection, and evidence interpretation across complex medical scenarios. The caveat embedded in the result is important: OpenAI designed, administered, and published the benchmark on which its own product scored highest. Independent clinical validation has not followed at the same pace. The claim matters regardless of that caveat. A Fortune 500 bank executive quoted in Fortune about GPT-5.5's launch cited 'hallucination resistance' as the decisive enterprise value, showing that medical-grade accuracy language has already crossed into financial services procurement conversations.

Read the source · OpenAI · openai.com

Evals Models ·Handy AI (citing Artificial Analysis)

OpenAI says 60 percent fewer hallucinations. One benchmarker says 86 percent rate. Both are right.

AnalysisOpenAI's GPT-5.5 system card claims a 60% reduction in hallucination rate compared to GPT-5.4. Independent evaluation firm Artificial Analysis published a different figure the same week: on its own methodology, GPT-5.5 scores an 86% hallucination rate in absolute terms. Both numbers can be accurate simultaneously because they measure different things. OpenAI's '60% reduction' is a relative improvement over the prior version, tested on OpenAI's internal evaluation suite. Artificial Analysis's '86%' is an absolute rate on their benchmark, which checks whether models fabricate specific factual claims on demand. A FindSkill.ai analysis circulated April 24 put the practical implication plainly: what used to be considered good practice for high-stakes AI use (double-checking outputs, constraining model scope) has shifted to necessary practice with GPT-5.5 specifically, because the baseline hallucination level, even after a 60% improvement, remains significant. Hallucination is the current condition of frontier AI, not a temporary gap.

Read the source · Handy AI (citing Artificial Analysis) · handyai.substack.com

Industry ·Zee Biz

US tech fires thousands. India's IT sector mostly did not get the memo.

AnalysisWhile Meta and Microsoft announced a combined potential 23,000 position reductions on April 23, a Zee Biz analysis published April 24 pointed to a geographic divergence worth watching: Indian IT services firms, including Tata Consultancy Services, Infosys, and Wipro, show only limited headcount decline, and some are adding AI-adjacent staff. The mechanism is different from US tech. Vertically integrated US firms that build their own AI and run their own cloud are cutting the roles their internal tools replace. Indian IT services companies bill clients for implementation and integration work, meaning AI tools make them more productive as service providers rather than directly replacing them. The exposure is real but delayed: if AI agents eventually manage the implementation layer that Indian IT firms currently sell, the disruption arrives later, not never. For now, the first wave of AI labor displacement is concentrating heavily at US tech headquarters.

Read the source · Zee Biz · zeebiz.com

Models ·CNBC

DeepSeek's biggest model yet is MIT-licensed and free to download

AnalysisDeepSeek, the Hangzhou-based Chinese AI startup that rattled chip stocks in early 2025 with its R1 model, dropped preview releases of DeepSeek-V4-Pro and DeepSeek-V4-Flash on April 24. V4-Pro runs 1.6 trillion total parameters with 49 billion active per query, using a mixture-of-experts design that routes only part of the network per request to cut compute costs. V4-Flash is a leaner 284 billion total with 13 billion active. Both carry MIT licenses (free to use commercially and modify), run a 1-million-token context window, and score near GPT-5.4 and Gemini 3.1 Pro on standard reasoning benchmarks while falling marginally short at the very top. Any organization that wants near-frontier AI inference without routing data through a US cloud now has a credible option to download and self-host.

Read the source · CNBC · cnbc.com

Models ·OpenAI Blog

GPT-5.5 arrives at $5 per million tokens with a super-app ambition

AnalysisOpenAI released GPT-5.5 on April 23, rolling it out to ChatGPT Plus, Pro, Business, and Enterprise subscribers and to the API at $5 per million input tokens and $30 per million output tokens, with a 1-million-token context window. Company president Greg Brockman described the model as a step toward an AI super app, a single product meant to become the default for most knowledge work, with improvements in coding, computer use, and long-horizon research. The timing matters: the launch arrived one day before DeepSeek dropped V4-Pro under a free MIT license, which immediately raises a question about how long a $5-per-million-token frontier model holds its pricing premium when a comparable open-weight alternative is free to download.

Read the source · OpenAI Blog · openai.com

Models Evals ·Fortune

Anthropic's most restricted model got out through a contractor's employee

AnalysisBloomberg reported on April 22 that a Discord group accessed Anthropic's Claude Mythos Preview, a model the company has withheld from public release and characterizes as having cybersecurity capabilities it rates too dangerous to deploy openly. Access came through a third-party vendor environment, with no direct compromise of Anthropic's own systems, according to Cybernews. One employee at the unnamed contractor appears to have been the entry point. OpenAI CEO Sam Altman called Anthropic's framing fear-based marketing, but security professionals quoted by Fortune focus on the timeline compression: AI-enabled attacks are now outpacing defenders' ability to adapt. The structural problem the incident surfaces is that AI labs operate through networks of contractors, and the security of a tightly controlled model reaches only as far as the most exposed party in that chain.

Read the source · Fortune · fortune.com

Industry ·Let's Data Science (relaying The Information)

Microsoft is keeping the Nvidia GPUs for OpenAI and its own teams

AnalysisThe Information reported on April 24 that AI startups are losing access to Nvidia GPUs because cloud providers, led by Microsoft, are redirecting supply toward internal operations and anchor customers including OpenAI. The structural cause is a priority queue: cloud providers that made the largest advance purchase commitments with Nvidia fulfill the most profitable contracts first. Smaller customers on month-to-month or spot-rate contracts receive what remains. The practical result is that AI companies either pay premium rates from specialized GPU rental firms, delay product launches, or build on lower-tier hardware that limits what they can run. The GPU scarcity that defined AI's early expansion has not resolved; the supply just flows toward larger incumbents.

Read the source · Let's Data Science (relaying The Information) · letsdatascience.com

Industry ·Anthropic Blog

NEC puts 30,000 employees on Claude and calls it Japan's largest AI workforce

AnalysisNEC Corporation, the Tokyo-based IT and electronics conglomerate, announced on April 23 a partnership making it Anthropic's first Japan-based global partner. The deal deploys Claude to approximately 30,000 NEC Group employees worldwide through the company's NEC BluStellar enterprise software platform, with the stated goal of building one of Japan's largest AI-native engineering organizations. Japan has lagged the US and much of Europe in pushing organization-wide AI adoption; NEC's move marks the shift from cautious pilot programs to mandated rollout that has characterized enterprise AI in 2026. For Anthropic, the deal extends geographic reach at a moment when OpenAI and Google are competing for the same enterprise contracts across Asia and positions Anthropic as the preferred provider in Japan's large corporate sector.

Read the source · Anthropic Blog · anthropic.com

Industry ·Anthropic News

Anthropic says Claude will never run ads, and frames it as a structural argument

AnalysisOn April 16, Anthropic published a statement declaring Claude will remain permanently free of advertising and sponsored content. The company's argument is structural: advertising pushes an AI assistant to optimize for engagement and sponsor satisfaction rather than user benefit, and the two goals are not compatible in its view. Google's Gemini products sit within a business that operates the world's largest advertising network. Microsoft's Copilot lives inside products that serve commercial partners. Anthropic is positioning Claude as the assistant that answers only to the paying subscriber. The statement functions as an implicit critique of both competitors without naming them, and it creates a public commitment with real reputational cost if the company reverses. Whether Anthropic can sustain that stance depends on whether enterprise subscriptions and API contracts cover the cost of frontier-class inference (running the model for every user query).

Read the source · Anthropic News · anthropic.com

Industry ·Anthropic Research

Anthropic's data: AI could handle far more work than companies are giving it

AnalysisAnthropic published 'Labor market impacts of AI: A new measure and early evidence' in March 2026, using actual enterprise usage data from Claude to map where AI is touching the US workforce. The study found Claude usage concentrates in roles requiring an estimated 14.4 years of education, roughly an associate degree level, compared to 13.2 years across the broader economy, consistent with heavier use in white-collar occupations. Crucially, the researchers found no clear statistical impact on employment in exposed occupations through the measurement period, despite CEO Dario Amodei having previously said AI could disrupt half of entry-level white-collar work. Actual deployment remains a small fraction of what the tools are technically capable of performing. That gap will not stay this wide, and the study amounts to an early measurement of a displacement that has not yet appeared in the employment statistics.

Read the source · Anthropic Research · anthropic.com

Industry ·EU AI Act Timeline

Every company running high-risk AI in Europe has four months left

AnalysisThe EU AI Act's obligations for high-risk AI systems take full effect on August 2, 2026, four months from today. High-risk under the Act covers AI used in hiring, credit scoring, education admissions, law enforcement, biometric identification, and critical infrastructure, among other categories. The compliance requirements are specific: mandatory conformity assessments (formal audits verifying the system meets legal requirements before deployment), ongoing logging, human oversight mechanisms, and technical documentation. Systems placed in operation before August 2 have a grace period to August 2027, but any new high-risk system deployed after August 2 must comply immediately. The EU framework functions as a de facto global standard for multinationals that cannot afford separate technical stacks per jurisdiction, meaning the deadline affects organizations well outside Europe.

Read the source · EU AI Act Timeline · artificialintelligenceact.eu

Industry ·ConstructConnect

The electricity grid is the binding constraint on AI expansion

AnalysisPower infrastructure spending related to new data center construction is projected to grow 31.6% year-over-year in 2026, according to ConstructConnect forecast data cited in the Stanford AI Index. The binding constraint is the interconnection queue, the formal backlog of projects awaiting approval to connect to the electricity grid, which in some US regions stretches years. AI-optimized facilities now regularly require 100 to 500 megawatts per site, a range that once described small cities rather than single buildings. NVIDIA's engineering lead for energy systems told Data Center World 2026 attendees that operators are increasingly using on-site generation (gas turbines, backup generators) as a workaround, and characterized those solutions as temporary. AI infrastructure expansion is gated by grid expansion, a problem no chip roadmap resolves and one that will keep compute costs elevated longer than the hardware headlines suggest.

Read the source · ConstructConnect · news.constructconnect.com

Issue #6

OpenAI's new model scores 88.7% on the coding bar

ChatGPT gets a longer leash in the office

DeepSeek V4 arrives one year after its first earthquake

Microsoft offers a golden handshake, first time in 51 years

The grid is full. The AI build-out is not.

An AI told to be a doctor now scores above physicians on the test its maker designed

OpenAI says 60 percent fewer hallucinations. One benchmarker says 86 percent rate. Both are right.

US tech fires thousands. India's IT sector mostly did not get the memo.

DeepSeek's biggest model yet is MIT-licensed and free to download

GPT-5.5 arrives at $5 per million tokens with a super-app ambition

Anthropic's most restricted model got out through a contractor's employee

Microsoft is keeping the Nvidia GPUs for OpenAI and its own teams

NEC puts 30,000 employees on Claude and calls it Japan's largest AI workforce

Anthropic says Claude will never run ads, and frames it as a structural argument

Anthropic's data: AI could handle far more work than companies are giving it

Every company running high-risk AI in Europe has four months left

The electricity grid is the binding constraint on AI expansion

One email a day, zero hype.