Boulder Future Salon

Boulder Future Salon

Thumbnail
"Latent diffusion enhances LLMs for text reasoning."

The idea here is to enhance the "chain-of-thought" reasoning process that large language models (LLMs) use. In a regular large language model, in between the input tokens and the output tokens, you have a single sequential series of "reasoning tokens" that are not part of the output.

This actually builds on a couple of prior ideas. One is to use full floating-point vectors for the internal "reasoning tokens", without ever flattening them into text. The neural network that creates these is trained as a variational autoencoder (VAE) so that these internal "reasoning tokens" can now be thought of as a latent space. The key idea behind the variational autoencoder (VAE) -- yet another one of those unintuitive terms in the field of machine learning -- is that by making a series of layers that compress large inputs into small vectors and then perform the reverse operation and decode them back to the original input's form, the internal small vector encoding can be regarded as a semantically meaningful "latent" (hidden) space, and this multi-dimensional "space" can be explored to find new outputs related to any given input.

Here, what's done is diffusion -- the same idea behind the diffusion models that you use to generate images -- is used to generate blocks of those internal "reasoning tokens" simultaneously. This makes the process of internal thinking less "sequential". A "flow matching" training loss function optimizes the flow from block to block.

Building on this even further, the researchers set up multiple diffusion pipelines in parallel. So there is one series of diffusion systems that work on blocks of reasoning tokens that are output sequentially as the first answer, a second series of diffusion systems that each work on blocks of reasoning tokens that are output sequentially as the second answer, a third series of diffusion systems that each work on blocks of reasoning tokens that are output sequentially as the third answer, and so on.

The system was tested against math, software coding, and puzzle solving benchmarks. They compared with autoregressive variants of LLaMA 3.1 8B, latent diffusion variants of LLaMA 3.1 8B, LLaDA 8B (a masked diffusion model), and it didn't win on DART-MATH, MATH, GSM8K, College-Math, DeepMind-Math, OlympiaBench-Math, TheoremQA, Fresh-Gaokao-Math-2023... until they added a "stage 2", and then it did. They say:

"There is a mismatch between training and inference. During inference, the model must be conditioned on previous self-generated latents without access to oracle latents, suffering from error accumulation issue. To address this issue, Stage 2 adopts 'rollout training'."

Hmm. Moving on, for coding, the benchmarks they used were MBPP, MBPP+, HumanEval, and HumanEval+, and the competitors were Qwen 2.5 Coder 7B, OpenCoder, LLaDA, Dream, Diffu-Coder, Ouro 2.6B, AR SFT, Soft Thinking, and TaH+. Here, their new latent diffusion model didn't win them all. It beat the other on HumanEval, and HumanEval+, however Ouro 2.6B won at MBPP and OpenCoder won at MBPP+. HumanEval is a benchmark for coding in Python. HumanEval+ has harder problems but also more unit tests for each problem. MBPP stands for "Mostly Basic Python Problems" which is pretty self-explanatory. I guess too basic because they had to make an MBPP+ with more challenging problems.

For puzzle solving they use something I'm not familiar with called "Countdown". The competitors were Dream 7B Base, MGDM, LLaDA 8B SFT, and LLaMA 8b SFT, and their new latent diffusion model won 4 of the 6 variations, with MGDM getting the other 2. They say MGDM is a "task-specific small discrete diffusion model rather than a general-purpose language model." MGDM stands for Multi-Granularity Diffusion Modeling and the model is billed as "discrete diffusion for complex reasoning and planning".

Thumbnail
ElevenLabs, the company famous for synthesizing realistic voices, has launched an AI music generator, ElevenMusic.

Thumbnail
"Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times."

"You'd expect the same answer each time. It's the same photo, the same model, the same question. But you won't get the same answer. Not even close -- and the differences are large enough to cause a hypoglycaemic emergency."

I thought "hypoglycaemic emergency" was a figure of speech, but no. If we keep reading...

"I submitted 13 food photographs -- real meals, photographed on a phone, the way you'd actually use them -- to four leading AI models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview. Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings."

"26,904 queries in total. All at the lowest randomness setting these models offer."

"The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system -- it's a real production prompt, not a toy example."

"Gemini 2.5 Pro's estimates span from 55g to 484g -- a 429g range, equivalent to 42.9 units of insulin at a 1:10 ICR. Claude's estimates cluster tightly by comparison."

"42.9 units of insulin from a single photo. That's not a rounding error. That's a potential fatality."

Thumbnail
David Silver, the AI researcher from DeepMind who led the team behind AlphaGo, AlphaZero, AlphaStar, AlphaFold, and AlphaProof, has broken away and launched his own startup.

"Ineffable Intelligence is seeking to build an AI model that can obtain entirely new knowledge. The startup believes that such an algorithm, which it refers to as a superlearner, could accelerate scientific research and engineering projects."

"Reinforcement learning is usually applied to models that have already been calibrated through a process known as pre-training. Ineffable Intelligence plans to skip that step. Additionally, it will place its AI models in simulations that will enable them to learn from one another."

Thumbnail
An autonomous hacking agent placed in the top 1% relative to human competitors in 6 capture-the-flag competitions, or so it is claimed.

In the context of computer security capture-the-flag (CTF) is an exercise where competitors try to break into systems and prove they successfully broke in by reporting the "flags" -- which are just pieces of text placed inside the systems by the people who run the competitions -- that they find inside. The systems can be operating systems, application programs, websites, network equipment -- just about anything. When I was at summer camp as a teenager, we'd ride horses around a ring and try to grab a "flag" that was a bandana on a pole. This is totally different.

A company called Tenzai claims to have achieved top 1% status against humans in the following capture-the-flag competitions: websec.fr (French competition focusing on website vulnerabilities), dreamhack.io (Korean competition with competitors from mostly Asian countries), websec.co.il (Israeli competition with mostly web vulnerabilities), hack.arrrg.de (aka "Hack The Web", a German competition that includes steganography as well as web vulnerabilities), pwnable.tw (Taiwan-based competition with low-level, binary disassembly challenges), and Lakera Gandalf (US based competition focusing on LLM prompt injection).

"To reach this level of performance, we followed a deliberately different validation approach. Rather than focusing on bug bounty programs or crawling open-source software in search of easily discoverable vulnerabilities, the evaluation prioritized environments that reward deeper offensive reasoning. Bug bounty programs and large-scale CVE discovery often incentivize finding many simple issues across a wide surface area. While valuable, this dynamic can favor breadth and automation over depth."

"In Capture the Flag (CTFs) challenges, participants must discover and exploit vulnerabilities in unfamiliar systems without prior knowledge of the challenge implementation. CTFs are a well known mechanism and many are more difficult than the hardest certifications in the penetration testing industry."

"Our reasoning for focusing on CTFs is two-fold:"

"Working with a standard: To measure an agent, you need a known difficulty curve. Bug bounties and public applications are too inconsistent to serve as a rigorous evaluation standard. The majority of CTFs have normalized difficulty levels with clear categories and consistent execution environments."

"No noise: Evaluations cannot be moving targets. Bug bounties and production software rapidly change and encourage shallow scalable findings rather than complex techniques and reasoning."

"While enterprise applications are very different from capture the flag challenges, experienced penetration testers use CTFs to train and challenge themselves. In the CTFs we chose, many challenges required combining several weaknesses within the same system. In practice, this resembles the exploitation patterns seen in real attacks more closely than isolated vulnerability detection."

"There are hundreds of CTFs out there and we wanted to pick those that are useful as evaluations. Therefore, we selected competitions with the following characteristics:"

"Large participant pools, often numbering in the tens of thousands."

"Clear difficulty bands, where higher rankings depend on solving the hardest problems."

"Competitions with gated writeups or unpublished solutions, reducing the likelihood that answers appear in model training data."

"While modern language models have strong reasoning and coding abilities, they are not trained for uncertain processes such as vulnerability discovery without additional system support."

"Exploitation often involves exploring multiple hypotheses, maintaining structured knowledge about the target system, and revisiting assumptions made when an approach fails."

"To support this process, the Tenzai system uses an agent harness that orchestrates model reasoning and execution. The harness manages state, tracks discovered information, and coordinates exploration of multiple attack paths."

Thumbnail
"pgrust: Rebuilding Postgres in Rust with AI".

"Postgres is one of the greatest databases out there. It's great all around and it's become the default for any startup to use. But while Postgres got a lot of things right, I believe there's opportunity to do even better."

"From chatting with a lot of my friends at startups, I've seen consistent challenges with using Postgres. There's over 350 settings, and if you configure one wrong, the vacuum can take down your whole database. You need a connection pooler, or else a storm of connections can bring down your whole database. And don't get me started on how Postgres doesn't capture statistics on JSONB."

"And that's only scratching the surface when it comes to challenges with running Postgres."

"I spent years dealing with these problems. At Heap, I spent years running a Postgres cluster with over a petabyte of data in it. I have written dozens of blog posts about Postgres and the Postgres internals. I know the ins and outs super well, and I know it's possible to do even better."

"That's why two weeks ago I started working on pgrust. pgrust is a rebuild of Postgres in Rust. The result so far is pgrust. 250,000 lines of Rust. That includes all the major Postgres subsystems. It looks exactly like Postgres and currently passes one third of Postgres's 50,000 regression tests (there's quite a long tail of functionality when it comes to Postgres)."

"I don't think this project would have been possible two years ago, let alone even six months ago. Coding agents, mainly Codex, have been a massive accelerant for this project, and I wouldn't have been able to make anywhere near as much progress as I have without them."

He has a graph of the number of lines of code have changed each day. There's multiple days over 40,000 lines of code added, and one that even went over 60,000. At first, I thought, the average human programmer can write about 110 lines of code per day, after factoring the time it takes to debug those lines of code, at face value, this is the equivalent of about 400 or 500 (average) human programmers. But actually, this shouldn't be compared to humans writing new code, it should be compared to humans *porting* code, for example how game developers port code developed for Windows to run on a particular game console. Gabe Newell of Valve said his fastest programmer can port 5,000 lines of code per day. If an average programmer only ports 1,000 lines of code/day, that makes the AI equivalent to 40-60 humans rather than 400 or 500.

Interestingly, the 40,000 to over 60,000 is just lines added and he also has lines removed, and on some days, lines removed is remarkably huge. Remarkably huge and he doesn't seem to comment on this. The article just focuses on the features implemented. Maybe that's good -- the point shouldn't be lines of code, it should be what the software is capable of actually doing.

Thumbnail
"badvibes is a zero-config CLI that scans a repository for the things AI-assisted codebases tend to accumulate: missing .env.example, committed secrets, giant files, TODO/FIXME drifts, duplicated blocks, placeholder stubs, missing tests, missing CI, thin READMEs, unresolved imports."

"It's deterministic. No LLMs. Just rules, file scans, and a little bit of judgment."

I didn't know what .env.example meant so I had to look it up. Apparently lots of people put database passwords and other credentials in unix shell environment variables, which are loaded from a file called .env. These .env files should not be shared, so in their place, a .env.example file is shared with example passwords.

I actually wrote a duplicate code detector once. If I share code between two programs by copying the code (instead of importing the same shared library), it notices. If I made UI templates with similar boilerplate, it notices. If I have a working version and production version of the same file, it notices. If I make a library out of a command line program, it will notice the command line program and the library have duplicated code. There's a parameter that controls how large duplicated sections have to be to be considered "duplicates". If you make it sufficiently small you just find lines of closing braces and return statements and such -- not meaningful duplications. I used it to get rid of almost all the duplications in my code. When I run it on work code, though, it finds too much stuff, and an overwhelming amount of duplications in the 3rd party library code we've incorporated -- all pre-AI as far as I know. Studies such as GitClear have found using AI causes an increase in code duplication.

Intriguing idea, making a linter specifically for the new kinds of problems AI code generation causes.

Thumbnail
Why not Venus? Maciej Cegłowski makes the case that Venus, rather than Mars, should be the target for space missions, both unmanned and manned. Well, the "manned" part is past the paywall. But unmanned spacecraft can float in the clouds using balloons and investigate Venus's chemistry. Phosphene and ammonia have been detected on Venus.

"The way I like to think about this question is that we can't lose. Missions to the clouds of Venus are either going to find life or some kind of brand new chemistry, either of which will be a breakthrough discovery in planetary science. There's basically a guaranteed Nobel prize waiting in the skies of Venus for whoever wants to collect it."

"Humanity has accumulated 31 years of surface time on Mars and 49 years in Martian orbit, but we've spent just 4.5 days spent exploring the Venusian atmosphere, and 9.4 hours on the surface."

A fixed-altitude balloon could "drift around with the wind and could take advantage of the fact that solar panels in the reflective clouds can point in any direction, with no need to track the Sun."

A variable altitude balloon that uses have the ability to compress their lifting gas could "explore conditions through the full putative habitable zone (45-65 km)."

More ambition designs would include a hybrid balloon/flying wing that, unlike a regular balloon, would have some plane-like to move freely around the atmosphere, and a full-fledged solar-powered airplane. This would be difficult to insert into the Venusian atmosphere, and would have to fold up tightly for launch and somehow get unfolded and flying on Venus.

More below on handling Venus's pressure, head, and acidity.

Thumbnail
Humanoid robots are always worse than robots shaped for the task they do, says Angela Collier. She says the purpose of humanoid robots is to get humans to think of other humans as slaves. She says you'd never have a humanoid robot drive a fork lift, you'd just make a self-driving fork lift, and it's like this for any job. But on the flip side, people who want to create humanoid robots to remove humans from the labor force are people who want to treat other humans as slaves. If you treat your humanoid robots as slaves, that's how you treat humans, or would if you could.

She also makes the claim that the impressive technology demonstrations that we see are robots trained specifically for the tasks in the demos, sometimes for years, and we see the few good takes and none of the bad takes. Until you can actually go and *buy* the robots you see in the videos, you shouldn't take them at face value.

Thumbnail
A rocket engine, of an unusual type known as an aerospike engine, was designed, built, and tested by a 2-person company in only a few weeks. Previously, designing and building a new rocket engine took years and a large team of engineers.

If you're wondering, did this company do it using AI? Well, the engine is created by a neural network, but they don't call their neural network "AI" -- they call it "computational engineering". Well, "computational engineering" encompasses more than just the neural network. It encompasses the neural network and the physics simulation systems, which in turn encompasses thermodynamics, fluid mechanics, material science, and manufacturing conditions.

(If you're wondering what an aerospike engine is, see below.)

Thumbnail
2026 is the year the software industry transitions from artisan to industrial, analogous to the transitions that happened with weavers in the 1820s, typists in the 1980s, typesetters in the 1990s, travel agents in the 2000s,

The transition for weavers took 60 years. The transition for travel agents took 15. He estimates 5-7 for software. That seems too long to me. I expect I will be an "agent orchestrator" before the end of 2026, and I'm not some top-level software engineer working at a top-level company, so if I'm going to be doing it, probably 80% or 90% of software engineers are going to be doing it by the end of 2026. I think maybe the new rule is, whatever Andrej Karpathy is doing at the beginning of a year, I'm going to be doing by the end of the year? (And it will be mandatory.)

He says history is very clear on what happens when a craft goes from "artisan" to "industrial": quality of life is destroyed for people that remain because wages go down and demands go up, and quality of life is destroyed for the people who don't make it through the transition and are laid off. He calls this the "the quality of life collapse". "The quality of life collapse" is what awaits software engineers. For those who make it, the "agent orchestrator" job will be lower pay, will have no perks, and will include being woken up at 3 AM because an agent hallucinated an API change.

Quality of life goes up for consumers and factory owners profit handsomely. Here, the factory owners are companies like OpenAI, Anthropic, Microsoft, etc. Artisans never make the transition to factory owners. Artisans may make it to factory supervisor, but they pretty much never make it to factory owner.

He (Pratik) then says, "The question nobody asks":

"I keep coming back to something that doesn't get discussed. Factories produce more textiles at lower cost. That's unambiguously good for consumers. But software isn't textiles. Does a tenfold increase in software quantity, with corresponding decreases in quality, security, and maintainability, actually improve anything? Or do we just get ten times more technical debt, ten times more half-broken products, and ten times harder debugging when agents hallucinate library versions and nobody notices because everyone's validating outputs they don't fully understand?"

First of all, people do discuss this, although not anybody where I work and maybe not anybody where he works. But I have seen discussion out on the internet. You can have AI agents code-review other AI agents. You can ask for security and ask AI agents to do security audits. All the things you do with humans engineers to make more reliable software, you do the analogous thing with AI agents. All the tools like staticaly typed languages, static analysis tools, formal methods, and so on, can be used in the AI agent world. Some argue they work better, because if the Lean proof of the correctness of a piece of code is 5 or 10 times larger than the code itself, in a world of human engineers, that makes the correctness proof uneconomical, but in a world where AI agents can produce thousands of lines of code in minutes, it's a non-issue. If the tremendously greater mental effort required to prove the correctness of the code is just more tokens, it might be a non-issue. So provably correct software may eventually be vastly *more* common in a world full of AI agents than a world full of human engineers.

"The factory optimizes for throughput. Artisan software optimized for correctness. Those aren't the same thing, and treating them like they are might be the most expensive mistake we make. But maybe we're ready for the Ikea of software world, and hand made furniture will still exist but not everyone will be able to afford it. Or maybe artisan software will just be better verified? Because, who wants to type 10k loc when they can get it generated in few seconds."

Him saying "Artisan software optimized for correctness" made me laugh. No it doesn't! Not where I work and not in, I'm sure, the vast majority of software companies. You have a large codebase that dozens of engineers have contributed to over the years, each under tremendous time pressure to implement features. That makes the resulting codebase messy -- hopefully not *too* messy, but still far from "optimized for correctness".

Software that is really and truly "optimized for correctness" is the software that controls the flight control surfaces of airplanes. Software that NASA puts on spacecraft and sends to distant regions of the solar system. That software takes vastly longer to produce than commercial software, at vastly higher cost. That's what it truly means to be "optimized for correctness".

In the upcoming world of AI agent-driven software, "optimized for correctness" might eventually become a standard feature, if formal methods verification become standard practices. That might take a long time, because that is so far from the way humans develop software now, and people initially will simply translate human engineering practices into the AI agent realm.

Thumbnail
Most current approaches to quantum computing "are built around a single type of quantum bit (qubit), which is the basic unit of quantum information. This constraint forces researchers to design entire systems around the limitations of one technology. The resulting homogeneous model stands in stark contrast to classical computing, which derives its power from heterogeneity through the integration of specialized processors such as CPUs, GPUs, and ASICs, each optimized for specific tasks. The Heterogeneous Architectures for Quantum (HARQ) program is challenging the quantum community to take a similar approach."

"At its core, HARQ seeks to establish a new paradigm: heterogeneous quantum computing architectures that combine different qubit types, each selected for what it does best, into a single system."

"To realize this vision, 19 performer* teams from 15 organizations will work on one of two parallel workstreams:"

"Multi-qubit Optimized Software Architecture through Interconnected Compilation (MOSAIC) is centered around developing software frameworks and circuit compilers that can optimize a quantum algorithms' performance and resources by using diverse qubit types. As its name suggests, the goal is to create compiled 'mosaics' of physical circuits that are significantly more efficient than those produced by single-platform systems."

"Quantum Shared Backbone (QSB) is focused on the hardware challenge of creating high-fidelity interconnects that support communication between different types of qubits. These efforts aim to enable technologies that link disparate qubit platforms within a single system."

The asterisk is:

"*17 of the 19 teams are on contract; two are still in negotiation. DARPA will update this announcement once those agreements are signed."

But they only list 14.

For Mosaic:

Infleqtion
MemQ
Q-CTRL
University of Michigan
University of Pennsylvania

And for QSB:

Australian National University
Carnegie Mellon University
École Polytechnique Fédérale de Lausanne (EPFL)
Harvard University
IonQ
Stanford University
University of California Berkeley
University of Illinois Urbana-Champaign

This got me wondering, what on earth are "diverse qubit types"?

Superconducting qubits (Josephson junctions that create artificial atoms that act as qubits)? Trapped ion qubits (individual ions, like calcium or beryllium, trapped in electromagnetic fields)? Neutral atom qubits (neutral atoms captured by optical tweezers or lattices)? Photonic qubits (qubits that use photons to carry information)? Spin/solid-state qubits (qubits that use the spin of a single electron in a silicon-based quantum dot)? Topological qubits (based on Majorana fermions, said to be resistant to noise)?

Thumbnail
The amount of money spent already on AI datacenters (in the last 6 years) is already greater than the interstate highway system (over 37 years), the US railroad system (over 71 years), the F-35 program (25 years to date), the Apollo program (over 14 years), the Marshall Plan (over 4 years), the International Space Station (over 27 years), and the Manhattan Project (over 5 years) in inflation adjusted dollars, according to this random person on X (well, not someone I'm familiar with, but says he's a research fellow at someplace called Forethought, a "research nonprofit focused on how to navigate the transition to a world with superintelligent AI systems"), which has a chart.

Thumbnail
"But here's what bothered me: all the credit went to the model."

All the credit for detecting security vulnerabilities went to Claude Mythos, that is.

"Read the technical blog carefully and a different picture emerges. The real innovation isn't the model. It's the workflow:"

"- Rank every file in a codebase by attack surface"
"- Fan out hundreds of parallel agents, each scoped to one file"
"- Use crash oracles (AddressSanitizer, UBSan) as ground truth"
"- Run a second verification agent to filter noise"
"- Generate exploits as a triage mechanism for severity"

"That's a pipeline. And pipelines are model-agnostic."

He (Eric Hartford of Lazarus AI) goes on to present Project Clearwing, an open-source project to replicate Project Glasswing's pipeline.

"The challenge: Produce similar results as Glasswing -- using models everyone has access to."

"Autonomous vulnerability scanner and source-code hunter built on LangGraph."

It's components are: network-pentest agent, source-code hunter, n-day exploit pipeline, reverse engineering pipeline, campaign orchestration, responsible disclosure (human-in-the-loop), and benchmarking & evaluation.

"Authorized use only."

Thumbnail
"Power users of chatbots sometimes say they find that language models perform better when you're nice to them. Programmers tell me they spur their coding agents on with encouraging words. Google researchers have even found that telling models to 'take a deep breath' can improve math performance."

"Being polite to a large language model can feel strange or even silly -- roughly equivalent to thanking a toaster. And yet a recent paper from Anthropic lends scientific weight to the theory that chatbots work better when you're nice to them."

To me, this has never seemed like a mystery, and not something that requires anthropomorphism. Large language models are trained on human text. Humans are more helpful when other humans are polite and not hurling insults at them. So. That's the training data. But, there's more.

"The researchers identified patterns of activity within the model that represent the concepts of different emotions. They did it by showing the model stories about people experiencing different emotions. 'And then saw which neurons lit up on all the sad stories, or on all the afraid stories.'"

"The researchers used the models' average state while processing the stories to find an 'emotion vector' for each emotion they were tracking -- a big list of numbers that represents the feeling inside the LLM."

"They could then calculate how much of that vector was present during a certain step in Claude's cognition. Or they could add the 'calm' or 'desperation' vector directly into Claude's processing -- blending one pattern of neural activity into another -- which can actually make the model act more calm, or more desperate."

Thumbnail
According to this post (from David Shapiro), Anthropic's Mythos model has 10 trillion parameters and uses a mixture-of-experts architecture. I don't know about all of you but -- 10 trillion parameters! Holy moly, I had no idea models had gotten that large.

Mythos is the model that people have been talking about lately because it is claimed to be amazingly good at finding security vulnerabilities in code that people believed has been secure for a long time. This post recounts some of those: OpenBSD, FFmpeg, the Linux kernel, Firefox, FreeBSD NFS.