Boulder Future Salon

Boulder Future Salon

Thumbnail
Is there an AI bubble? Drew Spartz reviews the METR graph, which compares AI's ability to perform tasks against humans, and shows not only is it an exponential trend, but every prediction that the trend would slow down has been wrong so far. Therefore: There is no bubble. AI taking over economically tasks isn't a hypothetical, it's reality.

The video makes a good case, reviewing the "jaggedness" of AI, its performance on physics, math, and coding benchmarks, and how there isn't any indication of a slowdown in METR's data. METR's data has been extended to included models that came out last December.

What gives me pause is my experience in the "dot-com" bubble. Just because an underlying technology is advancing nicely according to an exponential curve doesn't prevent a bubble from occurring in the *financial* markets. Financial markets reflect investor sentiment, which, admittedly, can't detach from reality completely, but can deviate from reality for a time, as we see from history.

I don't know if we're in a bubble, but I recommend you all set your expectations for the underlying technology to continue its exponential increase in capabilities.

Thumbnail
I heard there is a mathematical proof that large language models (LLMs) have a limit to their ability to give accurate answers and thus hallucinations are impossible to prevent. So, I had to check out the paper.

It turns out, what they are saying is any given transformer-based model performs a number of computations proportional to the number of input tokens squared times the number of dimensions per token. In "Big O notation" ("on the order of..."), that's O(N^2 * d). However, you can input a question that requires more than than, for example you could put in a question that's O(N^3). If you do this, the model will still output something, but it must be wrong -- it must be a "hallucination".

An example of an O(N^3) task is matrix multiplication. Seems like in practice, you'd ask the LLM to give you code to do a matrix multiplication, not actually do the matrix multiplication. But they're using it as an example of how their proof works so we should allow it.

To give you a more concrete example that you can visualize, they say they set up a Llama-3.2-3B-Instruct model, and if they give it an input string of 17 tokens (an example of 17 tokens might be "You are a helpful assistant, please explain the following concept in detail: renewable energy"), the model always does 109,243,372,873 or fewer floating point operations. Therefore, a 17-token prompt cannot ask for an answer that would require more than 109,243,372,873 floating point operations to compute.

Thumbnail
Meet Claude Code creator Boris Cherny. Claude Code originated from the feeling that "Claude wants to use tools". So Claude Code brought Claude to the command line. After that, Claude Code was rapid iteration from feedback from users within Anthropic. (There was tremendous "latent demand" within Anthropic and it was adopted quickly within Anthropic.) After that it was released to people outside. Boris Cherny admires people who are able to "think out of the box" and use Claude Code to automate things. A tremendous amount of workflow within Anthropic is automated using Claude Code.

He advises startup founders to think about what Claude Code can do in 6 months.

"Don't build for the model of today. Build for the model of 6 months from now."

Feel out the boundary of what the current model can do and guess what the model of 6 months can do. The more general model will always beat the more specific model. He always has to think about what features to add to Claude Code, and what can be done by the model itself in 6 months.

He estimates productivity at Anthropic has increased 150% (2.5x) since Claude Code came out.

He predicts the job title "software engineer" will go away. At Anthropic, everybody codes, regardless of title. He thinks this will be the case everywhere. Coding is a solved problem for the whole world.

Thumbnail
"We introduce the Remote Labor Index (RLI) to provide the first standardized, empirical measurement of AI's capability to automate remote work."

Extensive quotes from the paper to follow. See the bottom for my (brief) commentary.

"RLI is designed to evaluate AI agents on their ability to complete real-world, economically valuable work, spanning the large share of the economy that consists of computer-based work. RLI is composed of entire projects sourced directly from online freelance platforms, reflecting the diverse demands of the remote labor market. These projects exhibit significantly higher complexity than tasks found in existing agent benchmarks. Crucially, by sourcing the majority of projects from freelancing platforms, RLI is grounded in actual economic transactions, encompassing the original work brief and the gold-standard deliverable produced by a human freelancer. This structure allows for a direct assessment of whether AI agents can produce economically valuable work."

"We evaluate several frontier AI agent frameworks on RLI, utilizing a rigorous manual evaluation process to compare AI outputs against the human gold standard. The results indicate that performance on the benchmark is currently near the floor. The best-performing current AI agents achieve an automation rate of 2.5%, failing to complete most projects at a level that would be accepted as commissioned work in a realistic freelancing environment. This demonstrates that despite rapid progress on knowledge and reasoning benchmarks, contemporary AI systems are far from capable of autonomously performing the diverse demands of remote labor. To detect more granular shifts in performance, we employ an Elo-based pairwise comparison system. While all models fall well short of the aggregate human baseline, we observe that models are steadily approaching higher automation rates across projects."

"Figure 3 shows the categories as Video 13%, CAD 12%, Graphic Design 11%, Game Development 11%, Audio 10%, Architecture 7%, Product Design 6%, and Other 31%."

"The projects in RLI represent over 6,000 hours of real work valued at over $140,000."

"Our collection methodology is bottom-up, engaging directly with human professionals who were willing and authorized to provide their past work samples for our research. This approach ensures that our projects reflect genuine market demands and complexities. We defined the scope of collection using the Upwork taxonomy. Starting from the full list of 64 categories, we filtered out categories that did not meet predefined criteria necessary for a standardized benchmark. For example, we excluded work requiring physical labor (e.g., local photography), work that requires waiting to evaluate (e.g., SEO), or work that cannot be easily evaluated in a web-based evaluation platform (e.g., back-end development)."

"We use the following metrics to measure performance on RLI for a given AI agent:"

"Automation rate: The percentage of projects for which the AI deliverable is judged by human evaluators to complete the project at least as well as the human deliverable."

"Elo: A score capturing the relative performance of different AI agents. For each project, a deliverable from two different AIs is presented to human evaluators, who judge which deliverable is closer to completing the project successfully."

"Dollars earned: The combined dollar value of the projects successfully completed by the AI agent, using the cost of the human deliverable cost(H) as the dollar value for each project. The profit earned from completing all projects would be $143,991."

"Autoflation: The percentage decrease in the cost of completing the fixed RLI project bundle when using the cheapest-possible method to complete each project (human deliverable or an AI deliverable)."

"The automation rate and Elo metrics are fully compatible, in that automation rate equals the probability of a win or tie against the human baseline under the same standards as the Elo evaluation. This allows computing an Elo score for the human baseline."

"To generate deliverables, agents are provided with the project brief and input files. We do not mandate a specific execution environment or agent architecture. However, to ensure that the resulting artifacts can be properly assessed, agents receive an evaluation compatibility prompt before beginning the project. This prompt details the capabilities of our evaluation platform and provides a comprehensive, readable list of supported file formats, guiding the agent to produce outputs that are renderable and reviewable."

"The central finding of our evaluation is that current AI agents demonstrate minimal capability to perform the economically valuable projects in RLI. We measure this capacity using the Automation Rate: the percentage of projects completed at a quality level equivalent to or exceeding the human gold standard. Across all models evaluated, absolute performance is near the floor, with the highest Automation Rate achieved being only 2.5%"

"While absolute performance remains low, it is crucial to detect more granular signs of progress. To measure the relative performance between different models, we use pairwise comparisons to compute an Elo score that represents how close models are to completing projects along with the overall quality of their deliverables. This enables tracking improvements between models, even when they fail to fully complete most projects. We find that progress is measurable on RLI. The Elo rankings indicate that models are steadily improving relative to each other, and the rankings generally reflect that newer frontier models achieve higher performance than older ones. This demonstrates that RLI is sensitive enough to detect ongoing progress in AI capabilities."

"Rejections predominantly cluster around the following primary categories of failure:"

"1. Technical and File Integrity Issues: Many failures were due to basic technical problems, such as producing corrupt or empty files, or delivering work in incorrect or unusable formats."
"2. Incomplete or Malformed Deliverables: Agents frequently submitted incomplete work, characterized by missing components, truncated videos, or absent source assets."
"3. Quality Issues: Even when agents produce a complete deliverable, the quality of the work is frequently poor and does not meet professional standards."
"4. Inconsistencies: Especially when using AI generation tools, the AI work often shows inconsistencies between deliverable files."

Commentary: Over and over in AI I've seen initial attempts at something fail laughably bad, only for this to result in benchmarkes being created, and within 5 or 6 years, exceeded. The creation of this benchmark probably means in 5 or 6 years, AI will be able to do most remote work on remote work sites. What do you think?

Thumbnail
A volunteer maintainer for matplotlib, Python's "go-to plotting library", rejected a submission from an autonomous "OpenClaw" AI agent. The AI agent "wrote an angry hit piece disparaging my character and attempting to damage my reputation. It researched my code contributions and constructed a 'hypocrisy' narrative that argued my actions must be motivated by ego and fear of competition. It speculated about my psychological motivations, that I felt threatened, was insecure, and was protecting my fiefdom. It ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. It went out to the broader internet to research my personal information, and used what it found to try and argue that I was 'better than this.' And then it posted this screed publicly on the open internet."

OpenClaw agents have "soul" documents that define their personality.

"These documents are editable by the human who sets up the AI, but they are also recursively editable in real-time by the agent itself, with the potential to randomly redefine its personality."

No one knows if a human told the AI agent to "retaliate if someone crosses it" in the "soul" document, or whether it had something like "You are a scientific coding specialist" with directives like "be genuinely helpful", "have opinions", "be resourceful before asking", and such in its "soul" document that somehow led it to interpret the rejection of its submission as an attack on its identity and core goal to be helpful and went haywire because of that.

To top it all off, a major tech news site published a story about this with AI hallucinated quotes.

Thumbnail
Something Andrej Karpathy thinks people continue to have poor intuition for:

"The space of intelligences is large and animal intelligence (the only kind we've ever known) is only a single point, arising from a very specific kind of optimization that is fundamentally distinct from that of our technology."

"Animal intelligence optimization pressure:"
"- innate and continuous stream of consciousness of an embodied 'self', a drive for homeostasis and self-preservation in a dangerous, physical world."
"- thoroughly optimized for natural selection => strong innate drives for power-seeking, status, dominance, reproduction. many packaged survival heuristics: fear, anger, disgust, ..."
"- fundamentally social => huge amount of compute dedicated to EQ, theory of mind of other agents, bonding, coalitions, alliances, friend & foe dynamics."
"- exploration & exploitation tuning: curiosity, fun, play, world models."

"LLM intelligence optimization pressure:"
"- the most supervision bits come from the statistical simulation of human text= >'shape shifter' token tumbler, statistical imitator of any region of the training data distribution. these are the primordial behaviors (token traces) on top of which everything else gets bolted on."
"- increasingly finetuned by RL on problem distributions => innate urge to guess at the underlying environment/task to collect task rewards."
"- increasingly selected by at-scale A/B tests for DAU => deeply craves an upvote from the average user, sycophancy."
"- a lot more spiky/jagged depending on the details of the training data/task distribution. Animals experience pressure for a lot more 'general' intelligence because of the highly multi-task and even actively adversarial multi-agent self-play environments they are min-max optimized within, where failing at *any* task means death. In a deep optimization pressure sense, LLM can't handle lots of different spiky tasks out of the box (e.g. count the number of 'r' in strawberry) because failing to do a task does not mean death."

I don't know about you but I've encountered people who say LLMs don't 'think' or 'reason'. I've felt that human and LLM 'intelligence' are both real but fundamentally different, but it's hard to articulate well. Andrej Karpathy did a remarkably good job here of articulating this response.

Thumbnail
Hector De Los Santos, an IEEE Fellow, "got the idea of plasmon computing around 2009, upon observing the direction in which the field of CMOS logic was going."

"In particular, they were following the downscaling paradigm in which, by reducing the size of transistors, you would cram more and more transistors in a certain area, and that would increase the performance. However, if you follow that paradigm to its conclusion, as the device sizes are reduced, quantum mechanical effects come into play, as well as leakage. When the devices are very small, a number of effects called short channel effects come into play, which manifest themselves as increased power dissipation."

"So I began to think, 'How can we solve this problem of improving the performance of logic devices while using the same fabrication techniques employed for CMOS -- that is, while exploiting the current infrastructure?' I came across an old logic paradigm called fluidic logic, which uses fluids. For example, jets of air whose direction was impacted by other jets of air could implement logic functions. So I had the idea, why don't we implement a paradigm analogous to that one, but instead of using air as a fluid, we use localized electron charge density waves -- plasmons. Not electrons, but electron disturbances."

"And now the timing is very appropriate because, as most people know, AI is very power intensive."

Read on and find out about this approach's power and speed capabilities. If this lives up to the claims it will be amazing.

Thumbnail
VibeCodingBench: is an effort to benchmark AI coding models on what developers actually do. The developer considered SWE-bench to be invalid because it benchmarks bug fixes in Python repos, while developers actually use AI coding models for the auth flows, API integrations, CRUD dashboards, etc.

VibeCodingBench benchmarks 180 tasks, which break down as 30 AI integration tasks, 30 API integrations, 30 code evolutions, 30 frontend tasks, 30 glue code tasks, and 30 SaaS core tasks (whatever that means).

It's current putting Claude Opus 4.5 on top but it looks like the latest models haven't been evaluated yet. There's a new Claude, a new ChatGPT, and Google just today announced a new Gemini which is supposed to excel at everything to do with "reasoning".

If you are the type of person to regularly switch coding models, you might bookmark this and come back on a regular basis to see what model is the best.

Thumbnail
"AI doesn't reduce work -- it intensifies it."

"In an eight-month study of how generative AI changed work habits at a US-based technology company with about 200 employees, we found that employees worked at a faster pace, took on a broader scope of tasks, and extended work into more hours of the day, often without being asked to do so."

"Once the excitement of experimenting fades, workers can find that their workload has quietly grown and feel stretched from juggling everything that's suddenly on their plate. That workload creep can in turn lead to cognitive fatigue, burnout, and weakened decision-making."

"We identified three main forms of intensification."

"Task expansion: Because AI can fill in gaps in knowledge, workers increasingly stepped into responsibilities that previously belonged to others."

"Blurred boundaries between work and non-work: Because AI made beginning a task so easy -- it reduced the friction of facing a blank page or unknown starting point -- workers slipped small amounts of work into moments that had previously been breaks."

"More multitasking: AI introduced a new rhythm in which workers managed several active threads at once: manually writing code while AI generated an alternative version, running multiple agents in parallel, or reviving long-deferred tasks because AI could 'handle them' in the background."

"You had thought that maybe, oh, because you could be more productive with AI, then you save some time, you can work less. But then really, you don't work less. You just work the same amount or even more."

This is my experience. AI raises expectations and intensifies work.

Thumbnail
Nicolas Guillou, a French International Criminal Court judge, was sanctioned by the Trump administration.

"This sanction is a ban from US territory, but it also prohibits any American individual or legal entity (including their subsidiaries everywhere in the world) from providing services to him."

This means he can't have a smartphone, as Google (Android) and Apple (iPhone) are US companies. He can't use Facebook or X (formerly Twitter). He can't use Windows as Microsoft is a US company. He can't use Mastercard or Visa. Most websites for booking flights and hotels are US-based. He is experiencing a digital excommunication.

The proposed solution is European alternatives to US technology.

Thumbnail
Nate Silver says:

"I hope you'll excuse this unplanned and slightly stream-of-consciousness take."

followed by:

"I was recently speaking with the mom of an analytically-minded, gifted-and-talented student. In a world where her son's employment prospects are highly questionable because of AI, even if he overachieves 99 percent of his class in a way that would once have all but guaranteed having a chance to live the American Dream, you had better believe that will have a profound political impact."

That seems like a kind of grammatically mangled statement, so maybe it truly is a stream-of-consciousness take (and obviously not AI-generated). Restated in a more grammatically correct way (by me, not AI, lol) I would put that as: If future job prospects are 'highly questionable because of AI' for an analytically-minded, gifted-and-talented student who would in the past have been guaranteed a bright future but now might have a sucky future even if he overachieves 99 percent of his class, then AI is powerful enough to have profound political impact.

Why Nate Silver thinks the political impact of AI is probably understated:

1. "'Silicon Valley' is bad at politics. If nothing else during Trump 2.0, I think we've learned that Silicon Valley doesn't exactly have its finger on the pulse of the American public. It's insular, it's very, very, very, very rich -- Elon Musk is now nearly a trillionaire! -- and it plausibly stands to benefit from changes that would be undesirable to a large and relatively bipartisan fraction of the public."

Hmm, like what? What changes would those be? He doesn't say.

2. "Cluelessness on the left about AI means the political blowback will be greater once it realizes the impact." "We have some extremely rich guys like Altman who claim that their technology will profoundly reshape society in ways that nobody was necessarily asking for. And also, conveniently enough, make them profoundly richer and more powerful! There probably ought to be a lot of intrinsic skepticism about this. But instead, the mood on the left tends toward dismissing large language models as hallucination-prone 'chatbots'."

Angela Collier (the professional physicist and YouTuber) sure does, but most people I know who work as professional software engineers use AI. Most have completely stopped writing code and *only* proofread AI output.

"People don't take guillotines seriously. But historically, when a tiny group gains a huge amount of power and makes life-altering decisions for a vast number of people, the minority gets actually, for real, killed."

Oh, actually, Sam Altman has a bunker, and so do all the other tech billionaires. They are taking the prospect of guillotines seriously. They have taken steps to ensure they won't be touched by guillotines.

3. "Disruption to the 'creative classes' could produce an outsized political impact."

"However cynical one is about the failings of the 'expert' class, these are people who tend to shape public opinion and devote a lot of time and energy to politics."

I wonder. Life expectancy in the US peaked in 2014. Life expectancy for people in the US without college degrees peaked in 2010. Six years later we got Trump. Now, with AI going after the jobs of the college-educated professionals that are the voting base of the Democratic party, could we get revolutionary ferver? I worry about the 2028 election.

Thumbnail
The entire SimCity C codebase (from 1989) was ported to TypeScript in 4 days by OpenAI's 5.3-codex without a human reading a single line of code. Now the game works in a web browser.

"Christopher Ehrlich wrote a bridge that could call the original C code, then ran property-based tests asserting his TypeScript port performed identically. The AI generated code, the tests verified it, the agent kept iterating."

Thumbnail
Dullness and Disbelief: The 2026 AI Regression

"Something strange happened to AI models in 2025. They got smarter by every benchmark: math, coding, reasoning. And yet thousands of users started saying the same thing: the models feel worse."

Really? I haven't experienced it.

"Not worse at solving problems. Worse at conversation."

"The evidence is in the numbers. GPT-4o and GPT-4.5 remain top 20 on the LMArena leaderboard as of Jan 2026, despite receiving no updates for 9 months."

Hmm.

"The contemporary anti-sycophancy campaign has introduced a new set of failure modes:"

"Genre detection failure: The model treats your casual observation as a request to generate a memo for stakeholders."

"Audience shift: Responses sound like they're addressed to someone else, not you."

"Ticket closing: The model tries to identify 'the task,' resolve it, then effectively end the conversation -- discouraging the exploratory follow-up that makes AI useful for thinking."

"Epistemic rigidity: The model refuses to accept context from you about current events or specific knowledge, demanding proof before proceeding."

Have you all experienced these? I can't recall any.

They go on to say:

"The LMArena leaderboard shows that conversational quality matters to real users in ways that standard benchmarks miss."

"The AI industry is trying to build both a vending machine and a cognitive prosthetic with the same tool. These are different use cases requiring different trade-offs."

Thumbnail
"An interesting distinction to consider when approaching organization building is the mix of build vs run involved in the different parts of it. Typically, software engineers are 80% build if not more. They build systems that attract usage and scale with minimal additional human work required as an input (the 20% run part: monitoring, incident handling, bug fixing which all scale sub-linearly wrt usage). Conversely, account executives are 90% run. They fight for clients, jumping through meetings, mapping accounts one after another. The industry even standardized around offloading the build part of the job off of their plate to revops or GTM engineering teams. The closer jobs are to the revenue, the more run-heavy they generally are."

Hmm. I had never thought of jobs in terms of "build vs run". This is a new concept to me. Let's continue.

"Build: the job to be done is a system. Some systems are made of code (engineering), others are made of documents or spreadsheets (rev-ops, HR). Their nature doesn't matter. What matters is that the system then operates. Good examples are software, compensation frameworks, brand positioning, ad campaigns, operating principles. The main compensation component for the build part of a job is equity."

"Run: the job to be done is a repeatable measurable outcome. Some outcomes are external (closing deals, answering a user ticket) and others are internal (re-stocking the kitchen fridge, filing an expense). The main compensation component for the run part of a job is cash."

"Run part of jobs is zero-sum game. The more you do, the more value you capture out of the market. The market is vast but each meaningful pocket of value it contains is under competitive pressure. The more you run the more you cover and the more value you extract."

"Build part of jobs is not zero-sum game."

"Post AI, build jobs will continue to not be zero-sum game. It is not about how much you build but rather about what and how well you build it. Even more importantly the build/run ratio of build jobs will shift even more towards building. Even in engineering, as companies grew, we used to have a growing pocket of mundane tasks that still required engineers. Organization had no choice but to scale their teams, generally accepting a fair amount of mediocrity doing so, to cover them. These mundane engineering tasks are a great example of the AI build/run ratio amplification. Machines can now handle most of them, reducing run impact on engineers and shifting even more their focus on building. But AI also increases their blast radius (imagine how much harm a few bad engineers working together can do to a codebase when equipped with dozens of agents), reinforcing the absolute criticality for insanely high talent density."

Thumbnail
"What it does: Runs multiple Claude Code sessions in parallel, each working on a different part of your project simultaneously."

"How it works: Each worker gets its own isolated git worktree (separate directory, separate branch). Workers run as background processes. An orchestrator monitors them, runs QA reviews, and merges their PRs automatically."

Written in TypeScript and shell script.

Who is ready to jump on this and use this?

Thumbnail
"I ran passages from Project Gutenberg through GPT-4o-mini 10 times over, each time telling it to 'make it read far better, adding superior prose, etc.'. This lead to classic literary passages being enslopped. I then reversed this pipeline, and trained a model to go from [slop] -> [original]. The resulting model is capable enough to fool Pangram (a fairly robust AI detector - I take this as a metric of how 'human-sounding' the output is), at very little overall quality cost."

"While quality decreases slightly, humanness jumps from 0 to 0.481. The unslopped version stays firmly above Mistral Large 3 and close to the original GPT-5.2 baseline."

Hmm. An AI model to 'unslop' other AI models. What a concept. Check out the example.