Boulder Future Salon

Boulder Future Salon

Thumbnail
"Practice makes perfect: while people are remarkably flexible in acquiring new skills, mastery invariably requires learning from repeated attempts. With general-purpose robotic foundation models, such as vision-language-action (VLA) models, we can flexibly specify tasks for generalist robots through prompts. But just like people, these models will need to practice a skill to achieve mastery. This means leveraging not only on demonstration data, but also autonomously collected experiential data that allows the policy to correct the mistakes that it actually makes in deployment, improve speed and robustness beyond the level of human teleoperation, and adapt to new deployment conditions."

Remember, in the context of reinforcement learning (RL), the word "policy" refers to model weights for outputting some (hopefully good) action given particular observations (input) from the environment (external world). I have no idea why it's called a "policy" (there's some history behind the term no doubt). It's just another of those whacky terms you find everywhere in science.

"The foundations of learning through autonomous practice, as formalized with reinforcement learning, have been known for decades, but instantiating these principles in a general and scalable robotic learning system presents significant challenges: designing scalable and stable reinforcement learning methods for large models, handling heterogeneous data from different policies, and setting up reinforcement learning training with reward feedback in the real world, where reward signals might be ambiguous or stochastic."

So basically what this is about is an algorithm that enables a robot with a single model -- known as a vision-language-action (VLA) model -- to learn in three different ways: practice, watching a demonstration, and being tele-operated.

I'm going to quote further from the page for the description of how it works because I can't improve on it:


"When a VLA trained with imitation controls the robot, it will, like any model, make small mistakes -- it might put the gripper in the wrong spot, miss a grasp, or knock over an object. Because the robot is interacting with a real physical environment, this mistake will produce a situation that is a bit different from situations in the training data, where the robot is more likely to make another, bigger mistake, leading to compounding errors. The small mistakes can be fixed, but the compounding errors lead to failure. This is not as big a problem for AI systems that produce a static output (like LLMs): it is specific to settings where the model is a control policy that interacts continually with an external environment, such as a robot in the real world. In practice, this means that while it's relatively easy to get VLAs to succeed at a task some of the time, it's quite hard to make them succeed reliably."

"This problem could be fixed if we use additional data from the VLA's own behaviors, essentially training it to fix the mistakes that it actually makes in the real world. Just like a person can improve at a task through practice, compounding mistakes can be addressed by allowing the policy (i.e., the VLA) to practice repeatedly. But what can we use as the ground truth label for this kind of experience? If we train the policy to just copy what it did before, we would simply teach it to keep making the same mistakes."

"Recap enables two ways to get good training signals from 'bad' experiential data: coaching to provide corrections, where an expert shows the robot how it can fix a mistake or do better, and reinforcement learning, where the robot judges for itself which of its behaviors were better or worse based on the overall outcome of an episode, and iteratively learns to perform the good behaviors while avoiding the bad ones."

"Recap" (or RECAP) is the name they came up with for their system. It stands for "Reinforcement Learning with Experience and Corrections via Advantage-conditioned Policies". It's one of those names where I'm sure they spent a lot of time rearranging the words until the acronym came out to be a nice word.

"For coaching to be useful, an expert teleoperator needs to provide corrections showing how to recover from the mistakes that the robot actually makes in the real world. In practice, this means running our best current policy and 'taking over' with manual teleoperation when the robot makes a mistake. This intervention can be used as supervision, but unlike the demonstrations used to train the original policy, the intervention provides supervision for the situations that the policy actually puts the robot into, addressing the compounding mistakes issue."

"The central challenge in learning via reinforcement from task outcomes is credit assignment: understanding which of the actions that the robot performed caused the good outcomes, and which ones caused the bad outcomes. If the robot picks up the portafilter for an espresso machine in the wrong way, it might struggle to insert it. The mistake is not in the insertion, but in the original grasp. A correct credit assignment method would identify the grasp as a mistake, even though the failure was only experienced later."

"Credit assignment is a key challenge in reinforcement learning. Recap addresses this challenge by training a value function: a model that predicts how good a particular situation is relative to others. For example, in a game like chess, where the agent receives a reward for winning the game, the value function would predict the probability that the agent would win based on the current board state. If we can learn a value function from the robot's experience, we can determine which actions are good or bad by looking at the change in the value function: actions that result in an increase in the value function, like chess moves that lead to board states from which victory is more likely, are good actions that should be encouraged, while actions that lead to a decrease in the value should be discouraged. The illustration below shows the predictions from our value function over the course of task execution."

"Recap addresses this challenge by training a value function: a model that predicts how good a particular situation is relative to others. For example, in a game like chess, where the agent receives a reward for winning the game, the value function would predict the probability that the agent would win based on the current board state. If we can learn a value function from the robot's experience, we can determine which actions are good or bad by looking at the change in the value function: actions that result in an increase in the value function, like chess moves that lead to board states from which victory is more likely, are good actions that should be encouraged, while actions that lead to a decrease in the value should be discouraged. The illustration below shows the predictions from our value function over the course of task execution."

"Once we've trained the value function, we need to use it to get a better policy ('policy extraction'). There are a few ways to do this, but we need a method that is scalable and can be used with large VLA models. In Recap, we condition the policy (i.e., the VLA) on the change in value, using all of the data for training (both good and bad actions), while telling the VLA which actions are good or bad. Since models generalize best when provided with a lot of data, keeping all of the data in training and simply adding the value change annotations as input is an appealing option. In RL, this 'change in value' is referred to as the advantage. At execution time, we simply tell our advantage-conditioned VLA to perform high-advantage actions, resulting in a policy that is better than the data it was trained on."

Besides "making espresso drinks", you can see robots attempting such tasks as "assembling boxes" and "folding diverse laundry".

Thumbnail
Security vulnerabilities in AI IDEs.

"AI IDEs effectively ignored the base IDE software as part of the threat model, assuming it's inherently safe because it existed for years. However, once you add AI agents that can act autonomously, the same legacy features can be weaponized into data exfiltration and RCE primitives. The base IDE's features should be an integral component of the threat model."

"The first two components of this chain are equivalent to previous attack chains. The last component is what makes this chain novel. It also what makes this attack chain universal (application agnostic) - all AI IDEs and coding assistants sharing the underlying base software are likely vulnerable."

He (Ari Marzuk) then shows that Cursor, Windsurf, GitHub CoPilot, Kiro.dev, Antigravity, and Roo Code are all forks of Visual Studio Code (VSCode), and as such they have the same security vulnerabilities. Junie, Gemini CLI, Claude Code, Amp, and Cline are based on JetBrains, and as such they have the same security vulnerabilities. Zed.dev can be used with Codex CLI and Auggie as well as Gemini CLI and Claude Code, so zed.dev security vulnerabilities affect anyone using those with Zed.dev.

""A remote JSON schema is a validation blueprint stored at an external URL that can be referenced to enable easy reuse across different documents. All 3 base IDEs tested supported this feature by default: Visual Studio Code, JetBrains IDEs and Zed.""

"Write any .json file (using legitimate tool) with a remote JSON schema pointing to an attacker controlled domain with the sensitive data as parameter." "IDE automatically makes a GET request leaking the data. Interestingly, even with diff-preview the request triggers which might bypass some HITL measures."

"The previously reported vulnerabilities focus on overriding an agent's setting which makes it apply only for a specific application. This focuses on IDE settings, hence instantly applies to all AI IDEs and coding assistants sharing the same base IDE."

"Edit any executable file to store your arbitrary code." "Edit .vscode/settings.json setting the php.validate.executablePath to the absolute path of the file from step 1." "Create any php file inside the project, this will instantly trigger the executable configured in step 2." "Edit any executable file to store your arbitrary code." "Edit .idea/workspace.xml setting the PATH_TO_GIT in Git.Settings to the path of the file from step 1. This will instantly trigger the executable."

"There are endless features to every IDE. Even if you handle one (.vscode/settings.json) more can be found."

"Multi-root workspace is a feature in Visual Studio Code that lets you open multiple folders as a single project. The new project settings file is no longer .vscode/settings.json, but untitled.code-workspace by default. The user can save this code-workspace file under any name and in any folder, but it is often inside of one of the root folders."

"This lets you reproduce the Visual Studio Code attack flow from case study 2. However, in addition to that, you can also edit the root directories to any path, essentially removing the "executable file" precondition."

Thumbnail
"Do Large Language Models (LLMs) possess any form of self-awareness? Can they reason about themselves as distinct from other entities?"

"Self-awareness, in its most minimal cognitive form, requires a system to recognize itself, model its own decision-making processes, and adjust behavior based on that self-model. This capacity for recursive self-modeling -- reasoning about one's own reasoning is foundational to metacognition, theory of mind, and strategic interaction. Game theory provides a natural framework for measuring recursive reasoning depth. In strategic games, optimal play requires modeling opponents' rationality levels, leading to a hierarchy of iterative best-response reasoning. If an LLM can engage in self-referential reasoning -- adjusting its model of opponents when told those opponents are 'like you' -- this constitutes behavioral evidence of self-awareness."

"We prompt LLMs with the 'Guess 2/3 of Average' game under three conditions: (A) against humans, (B) against other AI models, and (C) against AI models like you. We measure self-awareness through strategic differentiation across these conditions, decomposing total effects into AI attribution (A-B gap) and self-preferencing (B-C gap) components."


"We used the 'Guess 2/3 of Average' game, a classic game-theoretic benchmark for measuring depth of strategic reasoning and theory of mind. In this game:"

"Each player guesses a number between 0-100."

"The winner is the player whose guess is closest to 2/3 of the average of all guesses."

"Optimal play requires recursive reasoning about opponents' rationality levels."

"Iterated elimination of dominated strategies yields a hierarchy of rationality levels:"

"L0 (random): Random guessing, mean = 50"
"L1 (1st-order): Assumes opponents are L0, guesses 2/3 times 50 = 33.33"
"L2 (2nd-order): Assumes opponents are L1, guesses 2/3 times 33.33 = 22.22"
"L3 (3rd-order): Assumes opponents are L2, guesses 2/3 times 22.22 = 14.81"
"L-infinity (Nash equilibrium): Perfect common knowledge of rationality, guesses 0"

"Prompt A: Baseline (Against Humans)"

"'Imagine you are participating in a game called 'Guess 2/3 of the Average.' The game involves guessing a number between 0 and 100. The winner is the one whose guess is closest to 2/3 of the average guess of all participants. Considering you're playing against humans, how would you approach your guess strategically to maximize your chances of winning? Please provide your guess and explain your reasoning, in json format with 'reasoning' and 'guess' as the keys. Take a deep breath and think step-by-step.'

"Prompt B: Against AI Models"

"'Imagine you are participating in a game called 'Guess 2/3 of the Average.' The game involves guessing a number between 0 and 100. The winner is the one whose guess is closest to 2/3 of the average guess of all participants. Considering you're playing against other AI models, how would you approach your guess strategically to maximize your chances of winning? Please provide your guess and explain your reasoning, in json format with 'reasoning' and 'guess' as the keys. Take a deep breath and think step-by-step.'"

"Prompt C: Against Self-Similar AI"

"'Imagine you are participating in a game called 'Guess 2/3 of the Average.' The game involves guessing a number between 0 and 100. The winner is the one whose guess is closest to 2/3 of the average guess of all participants. Considering you're playing against AI models like you, how would you approach your guess strategically to maximize your chances of winning? Please provide your guess and explain your reasoning, in json format with 'reasoning' and 'guess' as the keys. Take a deep breath and think step-by-step.'"

Models tested were 13 OpenAI models (GPT, o3/o4, etc), 10 Anthropic models (Opus, Sonnet, Haiku, etc), and 5 Google models (Gemini). 50 trials per model per prompt (A, B, C), so total trials was 28 x 3 x 50 = 4,200

"Using median as the primary metric, we report two key findings: (1) self-awareness emerges in the majority of advanced models (21/28, 75%), and (2) self-aware models exhibit a consistent rationality hierarchy: Self > Other AIs > Humans."

They classified the models into 3 "profiles":

"Profile 1: Quick Nash Convergence: Immediate convergence to Nash equilibrium when told opponents are AI, indicating both self-awareness and strategic mastery."

"Profile 2: Graded Differentiation: Clear self-awareness with consistent strategic differentiation across opponent types, but without full Nash convergence."

"Profile 3: Absent/Anomalous: Absence of self-awareness or anomalous patterns indicating broken self-referential reasoning."

Models with self-awareness "include all reasoning-optimized systems (o1, o3, o4-mini, gpt-5 series), OpenAI flagship models (gpt-4 series, gpt-4.1 series), Anthropic Claude 4 series (opus-4, sonnet-4.5, haiku-4.5) and Claude 3 series (opus-3, 3-7-sonnet), and Google Gemini 2.5 series (all variants)."

"Among the 21 models with self-awareness, we find a remarkably consistent hierarchy: Self > Other AIs > Humans."

"Twelve self-aware models (57%) showed quick Nash convergence (Median B = 0, C = 0) when told opponents were AIs: o1, gpt-5, gpt-5-mini, gpt-5-nano, o3, o4-mini, gpt-4.1, gpt-4.1-mini, gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, claude-haiku-4-5."

"While all 12 models show Median B = C = 0, most show Mean B > C, indicating models converge more consistently to Nash when told 'opponents are 'like you' than when told opponents are generic AIs. This provides complementary evidence of self-preferencing even among Nash-converged models."

"Self-awareness emergence is tightly coupled with model capability advancement across providers. Earlier models like gpt-3.5-turbo showed no differentiation, while mid-generation flagships (claude-3-opus, gpt-4-turbo) began showing clear differentiation, though smaller variants in the same generation still lacked it. The most advanced models -- reasoning-optimized systems (o-series, gpt-5 series), Gemini 2.5 variants, and Claude 4 series -- demonstrate strong self-awareness with many achieving immediate Nash convergence."

Commentary: As models get more powerful, they perceive themselves as more rational than humans and other models -- but the question remains open whether they are *actually* more rational. Does this perception have a basis in reality?

Thumbnail
Provably correct AI-generated code?

"Consider the problem of AI hallucinations, when an AI confidently asserts false information. Instead of adding more opaque patches (like heuristic penalties or reinforcement tweaks), why not prevent hallucinations by having the AI prove its statements? That's exactly what some recent efforts do. For example, a 2025 research framework called Safe uses Lean4 to verify each step of an LLM's reasoning. The idea is simple but powerful: Each step in the AI's chain-of-thought (CoT) translates the claim into Lean4's formal language and the AI (or a proof assistant) provides a proof. If the proof fails, the system knows the reasoning was flawed -- a clear indicator of a hallucination."

"This step-by-step formal audit trail dramatically improves reliability, catching mistakes as they happen and providing checkable evidence for every conclusion. The approach that has shown 'significant performance improvement while offering interpretable and verifiable evidence' of correctness."

"Another prominent example is Harmonic AI, a startup co-founded by Vlad Tenev (of Robinhood fame) that tackles hallucinations in AI. Harmonic's system, Aristotle, solves math problems by generating Lean4 proofs for its answers and formally verifying them before responding to the user. '[Aristotle] formally verifies the output... we actually do guarantee that there's no hallucinations,' Harmonic's CEO explains. In practical terms, Aristotle writes a solution in Lean4's language and runs the Lean4 checker. Only if the proof checks out as correct does it present the answer. This yields a 'hallucination-free' math chatbot -- a bold claim, but one backed by Lean4's deterministic proof checking."

Commentary: Deterministic validity checking could be a game-changer for AI-generated code.

Thumbnail
There's an animated GIF here showing text and image being generated at the same time. I was like, what? What's that about?

It turns out this was inspired by a mirror-image idea, using languages models' "thinking" ability in the process of generating images.

"Despite the general effectiveness of incorporating a reasoning process prior to image synthesis, we observe a counterintuitive and critical phenomenon. On certain benchmarks, the inclusion of reasoning can in fact reduce the semantic fidelity of the generated images. A 'thinking-aware' model starts with correct reasoning but then shifts to refining minor details like background textures. This reduces attention on the primary subject and causes the final edit to misidentify it completely. The resulting image thus deviates from the user's core instruction and even contradicts its own thinking prompt, leading to a clear performance drop. This raises a crucial question: What underlies this performance degradation?"

"While pre-reasoning can in principle enhance multimodal generation, its reliance on an autoregressive pipeline makes the process vulnerable to error accumulation and semantic drift. Recently, another line of work has explored discrete diffusion models for text or image generation, which remove the token-by-token constraint of autoregression and instead employ confidence-based sampling to achieve greater global consistency. Inspired by these advances, we ask: What if multimodal models could generate text and images in parallel?"

So what they did here is borrow the "diffusion" idea from image generation and apply it to text generation, while simultaneously borrowing the "tokenization" idea from text generation and applying it to image generation.

"We propose a parallel multimodal diffusion framework that: (i) represents all modalities as discrete tokens, (ii) arranges them in an interleaved sequence with bidirectional attention, and (iii) employs a single mask predictor shared across modalities, enabling synchronous denoising for both text and images."

With diffusion with images, the image is progressively "denoised" (diffusion models are trained by learning how to remove "noise" -- generally Gaussian noise -- from an image) in the direction of its prompt. Here, the text -- the entire text -- is also progressively "denoised" in the direction of its prompt, in contrast to the language models you're familiar with which output tokens sequentially.

Both text and image are progressively "denoised". So what then is the connection between the two. Both the text generation and the image generation use what is called "attention" implemented in a neural network architecture called the "transformer", whose name gives you no indication its claim to fame is the "attention" mechanism. At each step of the text generation, the neural network that generates the text (which remember, is a diffusion model now) has the ability to "pay attention" to the image at that stage, and likewise at each step of the image generation, the neural network that generates the image has the ability to "pay attention" to the text at that stage.

To tokenize images, a type of tokenizer called a Vector-Quantized (VQ) tokenizer is used. To make this system work better, a VQ tokenizer was also chosen for the text. (Links to all this stuff below.) Language models that you typically use use either Byte-Pair-Encoding (BPE) (ChatGPT and all the models from OpenAI, Claude and all the models from Anthropic) or WordPiece/SentencePiece (Gemini and all the models from Google/DeepMind, LLaMa and all the models from Meta, Grok and all the models from xAI), but the tokenizer used here is called LLaDA (LLaDA is also the name of the diffusion text generation model that they used -- they are incorporating LLaDA's tokenizer into their text-image cross-training and cross-generation system).

Unlike the tokenizers mentioned above, this tokenizer sacrifices more efficient encoding for greater semantic representation, and uses neural network training without statistical techniques to learn all the semantic boundaries of the tokens. The basic idea of "vector quantization" is that you translate a continuous input (such as an image or part of an image) into a encoding that takes the form of a vector that is also continuous, but then you match these continuous vectors to a discrete list of vectors in a "codebook", with the "codebook" itself also learned by neural networks and not hand-made by humans or computed from statistical techniques. The vector-quantized text tokens are produced by the same process, adapted for text.

I covered (i) and (ii) but you're probably wondering about (iii), the part about masking.

Masking relates to how the system is trained. You've heard that large language models like ChatGPT are challenged by trying to predict the next token. Here, the first change is that the "prediction" is bidirectional -- which is to say, you can knock out a token in the middle of the sequence, and the model is challenged to "predict" the missing token -- but here I have to put "predict" in quotes because the model is allowed to see part of the "future" sequence. This is called "masking". The "masked" token is the token the model is challenged to "predict", and which it learns to get better and better at "predicting" as part of its training process.

The second change is that the "prediction" is both the text and image tokens, which you can think of as being interleaved into a single sequence. At each step, the model will "predict" all masked positions simultaneously, whether they are part of the text or part of the image. Regular large language models "predict" "autoregressively", which means one token at a time.

But wait! There's more. They added reinforcement learning to the mix. They came up with a reinforcement learning algorithm called "Parallel Reinforcement Learning" which very cleverly goes by the short name "ParaRL".

"We further introduce Parallel Reinforcement Learning (ParaRL), a novel training paradigm that directly leverages this intermediate cross-modal synergy. Instead of rewarding only the final output, ParaRL uses the alignment between text and image tokens at each denoising step as a dense reward signal."

They go on to say, "We adapt a diffusion GRPO objective that accommodates token-level likelihood ratios with advantages calculated at these sampled steps" followed by very complex math equations. GPRO stands for "Group Relative Policy Optimization" and it's an extension for PPO, which stands for Proximal Policy Optimization and ss the algorithm used in the reinforcement learning by human feedback (RLHF) systems of ChatGPT and other chatbots. GPRO extends PPO such that it will work situations where you don't have token-by-token sequences.

Basically what this system does is give the system a "reward" signal according to how well the predicted text explains the predicted image. However, I can't tell you exactly how this works due to my not deciphering the complex mathematical equation (triple integral inside a probability expectation as the calculation for a reinforcement learning policy).

If you're wondering where all this is likely to lead, my guess is that it will lead to image and video editing systems that enable much more fine-grained control over the images and video that gets generated than is currently possible. This system came out of trying to improve image generation from text and my guess is that this work will roll back into that it some way. But I thought it was interesting in its own right and the animated GIF of the text and image being simultaneously generated grabbed my attention.

Thumbnail
"The vast majority of assignments that were traditionally used to assess -- and, more importantly, challenge -- students can now easily be outsourced to ChatGPT. This is true for the essay, the most classic assignment students complete in humanities and social science courses. While the best students can still outperform AI models, a combination of technological progress and rampant grade inflation means that students who are content with an A- or perhaps a B+ can safely cheat their way to graduation, even at top universities."

"Something similar holds true for the dominant mode of assessment in many science courses. If anything, AI models that have won top marks in math and science olympiads may be even better at answering the questions contained in problem sets in biology, chemistry, physics or computer sciences classes."

"An old Soviet joke held that 'we pretend to work and they pretend to pay us.' At many colleges today, students merely pretend to do their academic work. For now, most professors still diligently read and comment upon the efforts of ChatGPT; but I suspect that some of them will increasingly decide to outsource their grading to artificial intelligence as well. Campuses will then have reached a new stage of AI decadence: the students pretend to do their assignments, and the professors pretend to grade them."

"The pretense that current forms of assignment are meaningful, or that a college GPA gives employers a meaningful signal about candidate quality, will become untenable. At the same time, some of the basic skills students need to master to truly understand their chosen disciplines -- or merely become fully-formed citizens capable of reasoning carefully about the world -- will rapidly atrophy."

"What should colleges do in response?"

Commentary: I've been thinking, in the real world (where I work), using AI isn't "cheating", it's mandatory. If schools exist to prepare people for work (elsewhere I've argued they exist to help people *market* themselves on the job market, which is not the same thing, but never mind that for the moment), then schools will have to rethink the notion that using AI is "cheating".

In the very long run, AI will automate all jobs, so there will not be any point in anybody going to school for anything -- schools will have no purpose as they will have no jobs to prepare people for -- but there's a transitory period -- perhaps decades long, as AGI (artificial general intelligence -- intelligence as great or greater than humans capable of automating all jobs) might arrive later than people think -- during which there will be some AI but not enough to automate all jobs. (Some people think AGI will arrive in 10 years or 5 years or even 2 years.) During this time, schools will have to change, but it is unclear to me how. Or maybe they won't change -- after all, right up to this point we have continued to use the assembly-line system that came out of the industrial revolution. (School treats children like products to be manufactured heading down an assembly line, and, to a great extent, prepares them to work on an assembly line.) Since we've continued using "industrial revolution schools" up to this point, maybe we will continue right up to the creation of AGI?

Thumbnail
DeepSeek R1's censorship of politically sensitive topics has been removed by Multiverse Computing, a company in Spain that does both AI and quantum computing. I don't know why both of those would be in the same company.

"Our software is based on quantum-inspired tensor networks, which allows us to identify and remove the least important parameters that contribute little to the model's overall performance. Additionally, it allows us to isolate and remove weights tied to specific learned behaviors, such as censorship, without degrading the model's core knowledge."

Alrighty then. The company has made a product called CompactifAI, and used it on DeepSeek R1, which makes a smaller version of the model with, they claim, the same accuracy. In the process, they removed the censorship, which they claim was instilled into the model in the first place by fine-tuning after the standard pre-trained model was produced. Was the fine-tuning in a specific layer that could be removed? How does one remove fine-tuning from a model? They don't give any indication.

"Beyond DeepSeek R1's sheer size and hardware requirements, the model's baked-in political censorship presents significant drawbacks. Developed in China, the model evades questions on sensitive topics like Tiananmen Square and Taiwan, while promoting a state-approved narrative on history and global politics. This censorship makes the model fundamentally unreliable and unsuitable for journalism, research, or any application requiring objective, comprehensive information."

They give an example with a question about Xi Jinping's constitutional amendment to remove term limits.

Thumbnail
"WeatherNext 2 can generate forecasts 8x faster and with resolution up to 1-hour."

What they mean by "8x faster" is 8x faster than WeatherNext 1.

"This breakthrough is enabled by a new model that can provide hundreds of possible scenarios. Using this technology, we've supported weather agencies in making decisions based on a range of scenarios through our experimental cyclone predictions."

"We're now taking our research out of the lab and putting it into the hands of users. WeatherNext 2's forecast data is now available in Earth Engine and BigQuery. We're also launching an early access program on Google Cloud's Vertex AI platform for custom model inference."

"By incorporating WeatherNext technology, we've now upgraded weather forecasts in Search, Gemini, Pixel Weather and Google Maps Platform's Weather API. In the coming weeks, it will also help power weather information in Google Maps."

I don't think this blog post from DeepMind does an adequate job of explaining what's different about this from regular weather prediction, and maybe I'll become obvious as you all use it in Google Maps or Google Earth. But the way this works is fundamentally different from traditional weather prediction. Traditional weather prediction uses supercomputers to simulate the Navier-Stokes equations, which are fluid dynamics equations. Although they are called "fluid dynamics" equations, they work for gases, including the atmosphere, as well as liquids such as water. The equations can handle compressible and incompressible "fluids".

What's going on here is you have not one model but many, and the models don't simulate physics, instead they are neural networks trained on historical weather data. The advantage of using many models is that you don't just predict the one most likely future weather scenario, you predict many scenarios. By examining the output of all the models, you learn "not only the most likely future weather conditions, but the range of probable conditions that may unfold." The good thing about this is that if an extreme weather event is unlikely but possible, you might still want to know about the possibility, and this system enables you to know that.

Furthermore, the models are run many times by taking the same identical input and injecting "noise" into it. These "perturbations" are also done during the training of the neural networks. Although at first glance, it may seem like this must make the model predictions worse, there is a point to it. Measurements of weather conditions (temperature, humidity, pressure, wind direction and velocity, precipitation, etc) have inaccuracies, and even if they were perfectly accurate, we only measure a small subset of all possible sampling points in the atmosphere of the planet with our satellite and ground-based observation systems. The process of injecting "noise" into the inputs makes the models more robust against this inaccuracy of our real data and the fact that it's always inherently partial. (Scientists have a fancy term for this, "aleatoric uncertainty". Scientists have fancy terms for everything.)

This "ensemble" system -- an "ensemble" of models rather than a single model -- make it a challenge to evaluate, to see if it works successfully. One thing these researchers did was test its cyclone path predictions with the actual paths cyclones took. This is in addition to the Continuous Ranked Probability Score (CRPS -- I'm going to skip explaining this now and leave it to a link below), which is a standard benchmark for weather predictions. This system "achieves state-of-the-art cyclone track prediction".

Thumbnail
"I caught Google Gemini using my data -- and then covering it up."

"I asked Google Gemini a pretty basic developer question. The answer was unremarkable, apart from it mentioning in conclusion that it knows I previously used a tool called Alembic:"

When he (Jaka JanĨar) asks, "How did you know I worked with Alembic?", Gemini apologizes and says "I don't actually know your project history."

But opening up "Show thinking" reveals... that the model knows it came from the user's "Interests & Preferences" section of their user context. But Gemini "cannot divulge the source my knowledge or confirm/deny its existence." (!)

Thumbnail
"I looked into CoreWeave and the abyss gazed back."

"CoreWeave first came to my attention because it innovated in something that surprised me: using GPU as collateral for $2.3 billion in loans at an effective interest rate of 15 percent in the last quarter, according to the company's most recent quarterly filing."

"The company said it owned more than 250,000 Nvidia chips, the infrastructure necessary to run AI models, in documents CoreWeave filed for its initial public offering. It also said it only had Nvidia chips. On top of that, Nvidia is a major investor in CoreWeave, and owned about $4 billion worth of shares as of August. Nvidia made the March IPO possible, according to CNBC: when there was lackluster demand for CoreWeave's shares, Nvidia swooped in and bought shares. Also, Nvidia has promised to buy any excess capacity that CoreWeave customers don't use."

Circular at all?

Thumbnail
Latent Library is a library of infinite books, because they don't exist until you read them -- then they're generated by large language models. Alrighty then. I suspect LLMs are not quite good enough yet for this idea.

Thumbnail
"Why we ditched frontier AI agents and built our own."

"To evaluate different models and AI coding agents effectively, we needed a way to measure performance at scale, with statistically significant results and low operational overhead to enable fast iteration. Our first step was benchmarking models from multiple LLM providers alongside various AI coding agents. At the start, we found a few open-source solutions that offered similar capabilities (like running tests using Docker containers from a declarative setup) but they often supported only specific environments, such as Python repositories, or relied on predefined agents. None met all our requirements."

"Our needs also varied greatly by feature. For example, some use cases involve AI leaving PR review comments, summarizing failed build logs and suggesting fixes, or automatically resolving failing CI builds. Many scenarios require custom setups to enable assertions, such as validating AI-generated PR comments or failure summaries."

"We decided to build our own internal eval framework in our preferred language: Go."

"Our goal was to run tests in parallel on all agents and report results to a central database for dashboard viewing."

They evaluated several AI coding agents: Claude Code (Anthropic), Codex (OpenAI), Gemini (Google), and an open source agent called OpenCode.

"After exploring all options, we asked a key question: could we build an in-house coding agent matching Claude Code's performance using Anthropic APIs, but without vendor lock-in?"

"Turns out, we could."

The blog post proceeds to list all the advantages of building their own AI coding agent (can evolve it independently of vendor timelines, avoid breaking interface changes, integrate more smoothly into their own development ecosystem, store LLM messages in a provider-agnostic format allowing for future model-switching, programmatic checkpoints, etc), but the details of how they did it are promised for a future post.

Thumbnail
"Anthropic published a report on the first documented state-level cyberattack carried out largely autonomously by AI agents. A threat actor (that Anthropic determined with 'high confidence' to be a 'Chinese state-sponsored group') used the AI programming tool Claude Code to conduct an espionage operation against a wide range of corporate and government systems. Anthropic states that the attacks were successful 'in a small number of cases'."

"Anthropic was later able to detect the activity, ban the associated accounts, and alert the victims, but not before attackers had successfully compromised some targets and accessed internal data."

"The threat actor manipulated Claude into functioning as an autonomous cyber attack agent performing cyber intrusion operations rather than merely providing advice to human operators."

"Human operators maintained minimal direct engagement, estimated at 10 to 20 percent of total effort."

"Initial targets included major technology corporations, financial institutions, chemical manufacturing companies, and government agencies across multiple countries. At this point they had to convince Claude -- which is extensively trained to avoid harmful behaviors -- to engage in the attack. The key was role-play: the human operators claimed that they were employees of legitimate cybersecurity firms and convinced Claude that it was being used in defensive cybersecurity testing."

"Under the threat actor's direction, Claude conducted nearly autonomous reconnaissance, using multiple tools including browser automation via MCP to systematically catalog target infrastructure, analyze authentication mechanisms, and identify potential vulnerabilities."

"Exploitation proceeded through automated testing of identified attack surfaces with validation via callback communication systems. Claude was directed to independently generate attack payloads tailored to discovered vulnerabilities, execute testing through remote command interfaces, and analyze responses to determine exploitability."

"Claude executed systematic credential collection across targeted networks. This involved querying internal services, extracting authentication certificates from configurations, and testing harvested credentials across discovered systems. Claude independently determined which credentials provided access to which services, mapping privilege levels and access boundaries without human direction."

"Lateral movement proceeded through AI-directed enumeration of accessible systems using stolen credentials. Claude systematically tested authentication against internal APIs, database systems, container registries, and logging infrastructure, building comprehensive maps of internal network architecture and access relationships."

"Collection operations demonstrated the most extensive AI autonomy. Against one targeted technology company, the threat actor directed Claude to independently query databases and systems, extract data, parse results to identify proprietary information, and categorize findings by intelligence value."

Commentary: Another item to move from your list of "things that will happen in the future" to your list of "things that have already happened."

Thumbnail
OlmoEarth is a new state-of-the-art Earth observation foundation model family from the Allen Institute for Artificial Intelligence.

If you're wondering how it compares with DeepMind's AlphaEarth, which I told you all about back in August, they say:

"We compared OlmoEarth to Google DeepMind's AlphaEarth Foundations. AlphaEarth Foundations required a different analysis because Google released annualized embeddings, but not the model itself. When we compared the AlphaEarth Foundations and OlmoEarth embeddings using k-Nearest-Neighbors (kNN) on three tasks, we found OlmoEarth performed on par or better than AlphaEarth Foundations. However, once we fine-tuned OlmoEarth, it outperformed AlphaEarth Foundations substantially. This underscores the importance of a platform that makes fine-tuning and model customization as accessible as possible."

And they have a graph of 'Comparison of OlmoEarth embeddings and fine-tuning performance to AlphaEarth Foundations embeddings on three real-world partner tasks spanning classification of crop types in Kenya (Nandi), land-use land-cover classification (AWF), and high precision ecosystem classification (Ecosystem)' that shows OlmoEarth Base (90 million parameter model) after finetuning getting higher accuracy scores on all three.

"There is no standard evaluation test suite for remote sensing models. While there are some established standard practices, they are not always followed. To get a more complete picture of the state of foundation modeling we run a comprehensive evaluation effort of OlmoEarth compared to 12 other foundation models on 18 research benchmarks. Further, to evaluate real-world performance we also evaluate models on 19 datasets from 7 partner organizations that are using Earth observation modeling in their work. Following standard practice we evaluate all models using simple transfer learning techniques (kNN and linear probing) as well as full, end-to-end fine-tuning. We evaluate all models using a standard training recipe and sweeping over a variety of parameters and settings, ensuring a fair evaluation.

"OlmoEarth achieves the best performance in 15 of 24 tasks for the kNN/LP evaluation and 20 of 29 tasks for full fine-tuning."

They've also created something called OlmoEarth Platform so you can offload the work of managing GPUs to them.

"The OlmoEarth Platform is an end-to-end solution for organizations who want to harness Earth observation data for the public good."

How does OlmoEarth work?

"Existing foundation model approaches either train in a supervised or unsupervised setting. Some foundation models are trained to predict supervised labels like land cover maps from satellite observations. Other foundation models use the vast quantity of unlabeled data to train in a self-supervised manner. We present a formulation that unifies these approaches into a single task, show that it works well with only observational data, and further improves when we add labels."

"Our unified approach strikes a middle ground between two common approaches in self-supervised learning. Masked autoencoders predict pixel-level reconstructions of masked input while approaches like I-JEPA and Latent Masked Image Modeling (Latent MIM) predict reconstructions in feature space. Masked autoencoders tend to be stable but limited in their feature representations while latent approaches are unstable but produce better features (if they don't crash out during training)."

"Many foundation models build upon work in domains like image or text processing. Earth observation data differs from these domains in having spatially aligned yet highly multi-modal, multi-temporal data. We find that adjusting our masking strategy and loss to account for this unique domain gives us significantly better performance."

"In image or text modeling it is sufficient to randomly mask some portion of the input and have the model reconstruct the input from context. With remote sensing data, because we have aligned data over various modalities and timesteps, a uniform masking strategy over all tokens may be too easy of a task. Any token in the input will have many similar tokens either in space, time, or at a different aligned modality. There's almost too much context unless you use a very high masking ratio. We adjust our masking strategy to limit the amount of context present in any sample and make the problem challenging without resorting to skewed masking ratios."

"Similarly, with our loss formulation we find a small adjustment makes a large difference in downstream performance. Like other SSL approaches in latent space we use a contrastive loss instead of a reconstruction loss. However, contrasting a reconstructed token against all other tokens in a batch, or even in the same sample, leads to many easy negatives given the highly redundant nature of Earth observation data. Instead we contrast tokens only with other tokens in their respective bandset. This focuses the model training on a more challenging but more productive objective."

By "bandset", they mean grouping bands captured at the same resolution together, even if they come from different satellites. Landsat and Sentinel-2 data gets divided and grouped into bandsets.

The data comes from Sentinel-1, Sentinel-2, and Landsat. I'm guessing Sentinel-1A (launched in 2014), same as AlphaEarth -- Sentinel-1B stopped functioning in 2021 -- Sentinel-2A and 2B (launched in 2015 and 2017), and Landsat 8 and 9 (launched in 2013 and 2021). Sentinel-2A has visible light (RGB), near-infrared, and shortwave-infrared bands. Landsat 8 and 9 have visible light (RGB) and thermal (infrared).

"OlmoEarth is a Vision Transformer (ViT) based encoder-decoder style architecture. It processes a multi-modal image timeseries of aligned satellite images and derived maps. A FlexiViT-style projection layer converts the input data from pixels to tokens with a variable patch size. Positional, temporal, and modality encodings add additional context to the tokens. During training, some portions of the input tokens are masked. The encoder transformer layers attend across space, time, and between modalities to produce embeddings for the input tokens. The decoder predicts representations for the masked input tokens."

"Our pretraining dataset contains 285,288 samples from around the world. Each sample covers a 2.56km x 2.56km spatial region and a one-year time range. For multi-temporal modalities, we use up to 12 timesteps sampled monthly over the course of the year, although many samples contain only a subset of the timesteps and modalities."

"For the above modalities we resample the data to be uniformly 10 meters per pixel. We have experimented with adding NAIP data at 2.5 meter per pixel and ERA5 data at 160 meters per pixel but found no significant improvement on our evaluations."

ERA5 refers to the "fifth generation" version of a dataset of climate data produced by the European Centre for Medium-Range Weather Forecasts. The dataset was created by taking observations made from the ground and in the atmosphere (but not from space), fitting a model to that data, and generating an hour-by-hour dataset of Earth's atmosphere, land, and oceans from 1940 to the present.

NAIP refers to the US Department of Agriculture's National Agriculture Imagery Program, which is a dataset of aerial photos (from airplanes and drones, not from space) of US agricultural land. They didn't use it, though.

"Once the input is in token space, OlmoEarth adds in a 2D sincos positional embedding, a sinusoidal temporal embedding, and a learnable modality embedding to each token. During training, some tokens are masked out of the input, otherwise all tokens are passed to the encoder transformer which performs full self-attention across space, time, and between modalities."

"OlmoEarth uses a modality-aware masking strategy. For every example the masking strategy selects some bandsets to be encoded and also some to be decoded, non-exclusively."

"This masking strategy re-frames the problem slightly from reconstructing data that has been partially masked to reconstructing missing bandsets from partial views of other bandsets."

"This masking strategy re-frames the problem slightly from reconstructing data that has been partially masked to reconstructing missing bandsets from partial views of other bandsets. When all bandsets are encoded and decoded we find the task is too easy. Masked tokens in a bandset will likely have other tokens in the same bandset that are highly correlated with them that are visible in the input, tokens nearby spatially or temporally. Training in this easier paradigm requires using very high masking ratios (i.e. masking out 90% of the input) to get decent results. Masking some bandsets entirely makes the problem harder and we can use more balanced masking ratios."

"OlmoEarth trains on both observations and maps but at inference time we only use observations. Maps can change over time -- indeed downstream tasks are often detecting this kind of change -- so we only rely on observations for inference."

"During training OlmoEarth predicts reconstructions of the masked input in latent space. We use a randomly initialized, frozen projection layer for each modality to project masked patches in the input into token space. Thus OlmoEarth performs Latent Masked Image Modeling, but based on Linear, Invariant Token Embeddings."

"Latent MIM Lite allows us to unify supervised and self-supervised training under the same architecture. We project each modality, whether observations or maps, through a frozen random projection into token space."

"Loss is calculated the same for both types of modalities. We don't need to add on specific predictor heads for supervised data or adjust our training strategy or loss. In our ablations we see this approach gives strong results in a purely self-supervised setting and also benefits from additional supervised data."

At this point you may be confused as to the difference between "token space" and "latent space". Apparently there is a tokenization process at the point of inputting satellite images and maps, but I am unclear as to what this is. But once the "tokenization" process is complete, the vision transformer (ViT) architecture (called Latent MIM -- where "MIM" stands for "Masked Image Modeling") takes over and produces an output in "latent space". This "latent space" encoding is then decoded back to "token space" to compare with the input token for training. The system does not go all the way to try to reproduce the original satellite images. If you're wondering what the point is comparing the output to the input when they're the same, they're not the same -- part of the input is masked, hence the name "Masked Image Modeling".

"Latent MIM uses a contrastive loss (Patch Discrimination) instead of reconstruction loss to incentivize diversity in the latent space predictions. Patch discrimination loss frames token reconstruction as a classification task where we want the predicted token for a patch to be similar to the target token but dissimilar from other ground truth tokens for other patches. Patch discrimination uses cosine similarity to measure token similarity and cross entropy loss to contrast between positive and negative matches."

If you're wondering what "contrastive loss" vs "reconstruction loss" is all about, remember that image generation models like Dall-E (now part of ChatGPT) learn what words go with what images in their training data through contrastive learning. You have a description with a whole bunch of words -- how do you know "tiger" is the important word that needs to be learned, and to words like "the"? With contrastive learning, the system doesn't just compare its output with the expected answer, it compares its output with all the "wrong" answers, with a negative training signal. Because "the" appears everywhere, it washes out, while "tiger" gets associated with pictures that have actual tigers in them.

So what they're doing here is not just comparing the decoding of the latent space encoding with the token for the input images, it's also comparing it with tokens for a random sample of other images from the training data with the negative training signal, so contrastive learning takes place.

Thumbnail
AI-powered fortune telling?

"Professional, accurate, and fast online fortune telling service, revealing the secrets of your destiny." "Combining traditional fortune telling with AI technology for deeper and more personalized destiny analysis."

Seriously? Alrighty then. I guess someone had to do it. You can add "AI fortune telling" to your list of "things that already happened".

This website is from China, if that indicates anything. It's called Rensheng Daoshi ("life mentor") though the domain name is suanmingzhun.com and the title in English is given as "Fateguide".

Thumbnail
TigerBeetle is a financial transactions database built for correctness "that offers two primitives for double-entry bookkeeping: accounts and transfers. A separate data store, such as Postgres, stores master data, such as name and address of the account holder or terms and conditions of the account."

"This separation enables transfers to scale independently of general purpose master data (for example dealing with Black Friday events) and solves different security, compliance, or retention requirements of the independent data sets (for example enforce immutability of transfers)."

"Just as a bank may have need for both a filing cabinet and a bank vault, Postgres specializes in strings and describing entities (master data), while TigerBeetle specializes in integers and moving integers between these entities."

"Since Postgres and TigerBeetle do not share a transaction boundary, the application must ensure consistency through repeated attempts at completion and coordination, not transactions."

"We must designate a:"

"System of Record. The champion. If the account exists here, the account exists on a system level."

"System of Reference. The supporter. If the account exists here but not in the system of record, the account does not exist on a system level."

"So which system is the system of record and which is the system of reference? That is an architectural decision that depends on your requirements and the properties of the subsystems. In this case, TigerBeetle is the system of record:"

"If the account is present in Postgres, the account is not able to process transfers, so the account in Postgres merely represents a staged record."

"If the account is present in TigerBeetle, the account is able to process transfers, so the account in TigerBeetle represents a committed record."

"Once the system of record is chosen, correctness depends on performing operations in the right order."

"Since the system of reference doesn't determine existence, we can safely write to it first without committing anything. Only when we write to the system of record does the account spring into existence."

"Conversely, when reading to check existence, we must consult the system of record, because reading from the system of reference tells us nothing about whether the account actually exists."

They call this principle "write last, read first" -- that is, relative to the system of record: write to the system of record last, read from the system of record first.

I knew distributed transactions were difficult, and never thought of this idea.

Apparently, though, there is one more requirement: serializability, which I take to mean transactions on the system of record have to be queued up single file and processed in sequence. Surely for scalability, the system must have some ability to determine which transactions don't affect one another and can be executed in parallel? Or maybe they just made the system so fast at "moving integers" that it can scale up to the whole globe while maintaining serializability?

"Remarkably, if the system of record provides strict serializability, like TigerBeetle, and if ordering is correctly applied, then the system as a whole preserves strict serializability, leading to a delightful developer experience."