AI Socratic

AI Socratic Dec 2025

Federico Ulfo, Roberto StagiDecember 17, 202515 min read
Market News

The most important AI news and updates from last month: Nov 15 - Dec 15.

Events

AI Aperitivo 2.0

Milano · Tuesday, December 16

AI Builders Milan hosts the second AI Aperitivo 🍸🍷🫒🧀 — an evening of Socratic dialogues with Milan's top AI engineers, researchers, and founders.
RSVP →

AI Aperitivo Milan

AI Dinner 16.0

New York · Wednesday, December 17

AI NYC hosts another AI Dinner 🍲🍕🍺 — we'll discuss news and updates using this blog post to run the Socratic dialogues.
RSVP →

AI Dinner NYC

GPT-5.2 vs Opus 4.5 vs Gemini 3 vs Grok 4.1

The "Model Wars" have intensified with major releases from all top providers, focusing heavily on reasoning and efficiency.

GPT-5.2: OpenAI’s latest step is less “bigger model” and more “better worker”. Instant / Thinking / Pro variants tuned for deep, multi-step knowledge work (coding, long-context synthesis, and tool-heavy agent workflows like spreadsheets and presentations). On ARC-AGI-2 (Verified), GPT-5.2 Thinking posts 52.9% and Pro reaches 54.2%, positioning as OpenAI’s flagship for coding + agentic tasks. Even at higher per-token pricing, it’s pitched as cheaper-per-quality due to improved token efficiency (note: GPT 5.1 already signaled massive efficiency gains, since it was reaching o3 performance at 150x lower cost).

Grok 4.1:

Gemini 3.0: Google released Gemini 3 (Pro and Flash), is a massive leap over ChatGPT 5.1 in reasoning, speed, and video. It reportedly "one-shotted" an entire website build, leading some to declare front-end development "dead".

Claude Opus 4.5: Anthropic's new flagship model is a significant breakthrough. It outperforms predecessors while being cheaper than Sonnet 4.5. Notably, it embeds reasoning directly into files when traces are disabled and is marketed as the best model for coding and agentic computer use. All engineers agree on this being the best coding model.

Model Wars Cycle

ARC Prize Leaderboard

US AI startups are increasingly built on Chinese open-source foundations

Chinese open-source models (like DeepSeek and Qwen) have surpassed US models in global downloads (17% vs 15.8% market share).

Risks: imported censorship/ideology in weights; regulatory surprises if US decides some of those models are "foreign critical tech."

Payoff: price/perf / context length that's very attractive to early-stage founders.

Also, DeepSeek v3.2 got released.

Top 12 nations map ranked by all time huggingface downloads 🤗 HuggingFace Nation Map

Developer and National Market Share Developer and National Market Share

Model Size Distribution Model Size Distribution

Model Modality Distribution Model Modality Distribution

The "Agentic IDE" Wars

The battle for the developer environment has moved beyond autocomplete to full autonomous agency.

Google Antigravity

Google launched "Antigravity," an agent-first IDE positioning itself as a direct competitor to Cursor. It features Gemini 3 Pro and browser control for automated testing.

Controversy: Varun Mohan joined Google leaving his team behind. Antigravity brings Windsurf code, to the point that they didn't even change the name of the coding agent.

Cursor Composer 2.0

Cursor released Composer 2.0 with an agentic browser that allows parallel agents to code and self-test, claiming a 99.9% cost reduction compared to traditional dev teams.

Claude Code

Opus 4.5 is now available in Claude Code.

 * ▐▛███▜▌ *   Claude Code v2.0.69
* ▝▜█████▛▘ *  Opus 4.5 · Claude Max
 *  ▘▘ ▝▝  *   ~/projects/aisocratic

The Shift

Senior engineers are accepting more AI code than juniors because they know how to prompt and decompose work effectively: agents are amplifying senior skill rather than replacing it.


Genesis Mission: AI Manhattan Project

The intersection of AI and geopolitics has escalated to Manhattan Project levels.

The White House launched the Genesis Mission, a massive initiative using Department of Energy (DOE) supercomputers to build a national AI platform. The goal is to automate scientific research in biotech, nuclear, and quantum fields. This is a clear signal that the White House is favoring AI companies.

Recently the Trump administration also approved the sale of H200 to China, which in less than 24 hours, confirmed their ban for any NVIDIA chips, claiming Huawei is building something better.

ref: https://genesis.energy.gov/

Agentic AI Foundation & MCP

Anthropic, OpenAI, and Block created the Agentic AI Foundation (AAIF) under the Linux Foundation, donating MCP, AGENTS.md, and goose as founding projects.

MCP got a new spec in late November, pushing it from "tool calling" into long-running, production-grade workflows.


Claude Opus 4.5 used in a hack attack

Recently a Chinese state sponsored attack used Claude to run 80-90% of the work using MCP tools to harvest credentials, plant backdoor, and write exploits. The implication is that AI agents boost attacker scale and effectiveness. Let's take with a grain of salt that Dario Amodei is focusing on the risk of AI and pushing for more restrictive regulations, he's spreading awareness, yes, but also fear to push for strongest regulations that will benefit Anthropic.

Anthropic: Disrupting AI Espionage

Dario Amodei interview: https://www.youtube.com/embed/aAPpQC-3EyE?si=eJLwZFYiuwdFxx-I

Related to hack attacks, OpenAI was hacked, potentially compromising API user data including names and locations.

OpenAI Mixpanel Incident


Research Papers

Google's Nested Learning paper

A new paper proposes neural networks as a hierarchy of learners that update parameters during inference, allowing for continuous learning without forgetting—potentially the "next Transformer" moment. https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

Sakana AI — Continuous Thought Machines

Continuous Thought Machines (CTM), is an AI model that uniquely uses the synchronization of neuron activity as its core reasoning mechanism, inspired by biological neural networks. Unlike traditional artificial neural networks, the CTM uses timing information at the neuron level that allows for more complex neural behavior and decision-making processes. This innovation enables the model to “think” through problems step-by-step, making its reasoning process interpretable and human-like. Our research demonstrates improvements in both problem-solving capabilities and efficiency across various tasks. The CTM represents a meaningful step toward bridging the gap between artificial and biological neural networks, potentially unlocking new frontiers in AI capabilities.

https://sakana.ai/ctm/

OpenAI's $1.4 Trillion Bet

OpenAI is projecting $100B in revenue by 2027 but is committing a staggering $1.4 Trillion to infrastructure.


Notable New Tools & Research

  • Nano Banana Pro: A standout tool this month for visuals. It can compress entire earnings PDFs into infographics, generate insights from papers, and create slides.

  • SAM 3 (Segment Anything 3): Scale AI and Meta released SAM 3 for open-source image/video segmentation and 3D reconstruction.

  • Intellect-3: A 100B+ parameter MoE (Mixture of Experts) model released by Prime Intellect (PI), trained using decentralized computing. It shows state-of-the-art performance in math and code.

  • NotebookLM got Deep Research


Other news, in short

  • Poetiq AI Agent surpasses 50% at ARC-AGI-2, reaching superhuman performance at ~$50/task, half the cost of previous SOTA, suggesting agent scaffolding may be more important than raw model capability for certain reasoning tasks.

  • Disney invested $1B into OpenAI + 3-year licensing for Sora to use Disney/Marvel/Pixar/Star Wars characters, a gigantic signal about IP + AI video.

  • OpenAI is silently testing its next-gen image backend that people are informally calling "Image 2", allegedly considered in the same frontier tier as Nano Banana Pro.

World Models

SimWorld

An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds. These researchers built a Tiny Economy in which different models, participating in a market economy, and challenges to make money, for example with food delivery. Claude and Qwen did pretty well taking a very risky approach, while other models played a more risk averse game, this caused great standard deviation in the results of Claude and Qwen but with high returns.

It's also hilarious to see how OpenAI lost their contract because Qwen and DeepSeek underbid them. https://simworld.org/

Videos and Podcasts

Ilya at Dwarkesh Podcast

We are back at the age of research.

Podcast: Sakana – Continuous Thought Machines (CTM)

A new episode dives deep into Sakana AI's Continuous Thought Machines, exploring the underlying science and engineering of CTM and its parallels to biological neural timing and reasoning. If you're interested in the intersection of neuroscience and advanced AI research, this gives strong background and accessible explanations.

The Thinking Game

A journey into the heart of DeepMind, capturing a team striving to unravel the mysteries of intelligence and life itself.

Jeff Dean (Google DeepMind, cofounder of Google Brain & TensorFlow) spoke at Stanford AI Club on the biggest shifts in AI: foundation models scaling, better hardware (TPUs), tool-using agents, multimodal models, and why responsible deployment and real-world feedback matter most.

Full Source List

AI Builders

Benchmark

Blog posts

Consumer devices

DeAI

Economics and geopolitics

Events

Funding

Hardware

LLMs

Learning

Lol

Opinions

Philosophy

Random

Research

Robotics

Videos and Podcasts

Visuals

World Models

vLLM, Diffusion Models, and Audio Models