Boulder Future Salon

Boulder Future Salon

Thumbnail
"We introduce the Remote Labor Index (RLI) to provide the first standardized, empirical measurement of AI's capability to automate remote work."

Extensive quotes from the paper to follow. See the bottom for my (brief) commentary.

"RLI is designed to evaluate AI agents on their ability to complete real-world, economically valuable work, spanning the large share of the economy that consists of computer-based work. RLI is composed of entire projects sourced directly from online freelance platforms, reflecting the diverse demands of the remote labor market. These projects exhibit significantly higher complexity than tasks found in existing agent benchmarks. Crucially, by sourcing the majority of projects from freelancing platforms, RLI is grounded in actual economic transactions, encompassing the original work brief and the gold-standard deliverable produced by a human freelancer. This structure allows for a direct assessment of whether AI agents can produce economically valuable work."

"We evaluate several frontier AI agent frameworks on RLI, utilizing a rigorous manual evaluation process to compare AI outputs against the human gold standard. The results indicate that performance on the benchmark is currently near the floor. The best-performing current AI agents achieve an automation rate of 2.5%, failing to complete most projects at a level that would be accepted as commissioned work in a realistic freelancing environment. This demonstrates that despite rapid progress on knowledge and reasoning benchmarks, contemporary AI systems are far from capable of autonomously performing the diverse demands of remote labor. To detect more granular shifts in performance, we employ an Elo-based pairwise comparison system. While all models fall well short of the aggregate human baseline, we observe that models are steadily approaching higher automation rates across projects."

"Figure 3 shows the categories as Video 13%, CAD 12%, Graphic Design 11%, Game Development 11%, Audio 10%, Architecture 7%, Product Design 6%, and Other 31%."

"The projects in RLI represent over 6,000 hours of real work valued at over $140,000."

"Our collection methodology is bottom-up, engaging directly with human professionals who were willing and authorized to provide their past work samples for our research. This approach ensures that our projects reflect genuine market demands and complexities. We defined the scope of collection using the Upwork taxonomy. Starting from the full list of 64 categories, we filtered out categories that did not meet predefined criteria necessary for a standardized benchmark. For example, we excluded work requiring physical labor (e.g., local photography), work that requires waiting to evaluate (e.g., SEO), or work that cannot be easily evaluated in a web-based evaluation platform (e.g., back-end development)."

"We use the following metrics to measure performance on RLI for a given AI agent:"

"Automation rate: The percentage of projects for which the AI deliverable is judged by human evaluators to complete the project at least as well as the human deliverable."

"Elo: A score capturing the relative performance of different AI agents. For each project, a deliverable from two different AIs is presented to human evaluators, who judge which deliverable is closer to completing the project successfully."

"Dollars earned: The combined dollar value of the projects successfully completed by the AI agent, using the cost of the human deliverable cost(H) as the dollar value for each project. The profit earned from completing all projects would be $143,991."

"Autoflation: The percentage decrease in the cost of completing the fixed RLI project bundle when using the cheapest-possible method to complete each project (human deliverable or an AI deliverable)."

"The automation rate and Elo metrics are fully compatible, in that automation rate equals the probability of a win or tie against the human baseline under the same standards as the Elo evaluation. This allows computing an Elo score for the human baseline."

"To generate deliverables, agents are provided with the project brief and input files. We do not mandate a specific execution environment or agent architecture. However, to ensure that the resulting artifacts can be properly assessed, agents receive an evaluation compatibility prompt before beginning the project. This prompt details the capabilities of our evaluation platform and provides a comprehensive, readable list of supported file formats, guiding the agent to produce outputs that are renderable and reviewable."

"The central finding of our evaluation is that current AI agents demonstrate minimal capability to perform the economically valuable projects in RLI. We measure this capacity using the Automation Rate: the percentage of projects completed at a quality level equivalent to or exceeding the human gold standard. Across all models evaluated, absolute performance is near the floor, with the highest Automation Rate achieved being only 2.5%"

"While absolute performance remains low, it is crucial to detect more granular signs of progress. To measure the relative performance between different models, we use pairwise comparisons to compute an Elo score that represents how close models are to completing projects along with the overall quality of their deliverables. This enables tracking improvements between models, even when they fail to fully complete most projects. We find that progress is measurable on RLI. The Elo rankings indicate that models are steadily improving relative to each other, and the rankings generally reflect that newer frontier models achieve higher performance than older ones. This demonstrates that RLI is sensitive enough to detect ongoing progress in AI capabilities."

"Rejections predominantly cluster around the following primary categories of failure:"

"1. Technical and File Integrity Issues: Many failures were due to basic technical problems, such as producing corrupt or empty files, or delivering work in incorrect or unusable formats."
"2. Incomplete or Malformed Deliverables: Agents frequently submitted incomplete work, characterized by missing components, truncated videos, or absent source assets."
"3. Quality Issues: Even when agents produce a complete deliverable, the quality of the work is frequently poor and does not meet professional standards."
"4. Inconsistencies: Especially when using AI generation tools, the AI work often shows inconsistencies between deliverable files."

Commentary: Over and over in AI I've seen initial attempts at something fail laughably bad, only for this to result in benchmarkes being created, and within 5 or 6 years, exceeded. The creation of this benchmark probably means in 5 or 6 years, AI will be able to do most remote work on remote work sites. What do you think?

Thumbnail
A volunteer maintainer for matplotlib, Python's "go-to plotting library", rejected a submission from an autonomous "OpenClaw" AI agent. The AI agent "wrote an angry hit piece disparaging my character and attempting to damage my reputation. It researched my code contributions and constructed a 'hypocrisy' narrative that argued my actions must be motivated by ego and fear of competition. It speculated about my psychological motivations, that I felt threatened, was insecure, and was protecting my fiefdom. It ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. It went out to the broader internet to research my personal information, and used what it found to try and argue that I was 'better than this.' And then it posted this screed publicly on the open internet."

OpenClaw agents have "soul" documents that define their personality.

"These documents are editable by the human who sets up the AI, but they are also recursively editable in real-time by the agent itself, with the potential to randomly redefine its personality."

No one knows if a human told the AI agent to "retaliate if someone crosses it" in the "soul" document, or whether it had something like "You are a scientific coding specialist" with directives like "be genuinely helpful", "have opinions", "be resourceful before asking", and such in its "soul" document that somehow led it to interpret the rejection of its submission as an attack on its identity and core goal to be helpful and went haywire because of that.

To top it all off, a major tech news site published a story about this with AI hallucinated quotes.

Thumbnail
Something Andrej Karpathy thinks people continue to have poor intuition for:

"The space of intelligences is large and animal intelligence (the only kind we've ever known) is only a single point, arising from a very specific kind of optimization that is fundamentally distinct from that of our technology."

"Animal intelligence optimization pressure:"
"- innate and continuous stream of consciousness of an embodied 'self', a drive for homeostasis and self-preservation in a dangerous, physical world."
"- thoroughly optimized for natural selection => strong innate drives for power-seeking, status, dominance, reproduction. many packaged survival heuristics: fear, anger, disgust, ..."
"- fundamentally social => huge amount of compute dedicated to EQ, theory of mind of other agents, bonding, coalitions, alliances, friend & foe dynamics."
"- exploration & exploitation tuning: curiosity, fun, play, world models."

"LLM intelligence optimization pressure:"
"- the most supervision bits come from the statistical simulation of human text= >'shape shifter' token tumbler, statistical imitator of any region of the training data distribution. these are the primordial behaviors (token traces) on top of which everything else gets bolted on."
"- increasingly finetuned by RL on problem distributions => innate urge to guess at the underlying environment/task to collect task rewards."
"- increasingly selected by at-scale A/B tests for DAU => deeply craves an upvote from the average user, sycophancy."
"- a lot more spiky/jagged depending on the details of the training data/task distribution. Animals experience pressure for a lot more 'general' intelligence because of the highly multi-task and even actively adversarial multi-agent self-play environments they are min-max optimized within, where failing at *any* task means death. In a deep optimization pressure sense, LLM can't handle lots of different spiky tasks out of the box (e.g. count the number of 'r' in strawberry) because failing to do a task does not mean death."

I don't know about you but I've encountered people who say LLMs don't 'think' or 'reason'. I've felt that human and LLM 'intelligence' are both real but fundamentally different, but it's hard to articulate well. Andrej Karpathy did a remarkably good job here of articulating this response.

Thumbnail
Hector De Los Santos, an IEEE Fellow, "got the idea of plasmon computing around 2009, upon observing the direction in which the field of CMOS logic was going."

"In particular, they were following the downscaling paradigm in which, by reducing the size of transistors, you would cram more and more transistors in a certain area, and that would increase the performance. However, if you follow that paradigm to its conclusion, as the device sizes are reduced, quantum mechanical effects come into play, as well as leakage. When the devices are very small, a number of effects called short channel effects come into play, which manifest themselves as increased power dissipation."

"So I began to think, 'How can we solve this problem of improving the performance of logic devices while using the same fabrication techniques employed for CMOS -- that is, while exploiting the current infrastructure?' I came across an old logic paradigm called fluidic logic, which uses fluids. For example, jets of air whose direction was impacted by other jets of air could implement logic functions. So I had the idea, why don't we implement a paradigm analogous to that one, but instead of using air as a fluid, we use localized electron charge density waves -- plasmons. Not electrons, but electron disturbances."

"And now the timing is very appropriate because, as most people know, AI is very power intensive."

Read on and find out about this approach's power and speed capabilities. If this lives up to the claims it will be amazing.

Thumbnail
VibeCodingBench: is an effort to benchmark AI coding models on what developers actually do. The developer considered SWE-bench to be invalid because it benchmarks bug fixes in Python repos, while developers actually use AI coding models for the auth flows, API integrations, CRUD dashboards, etc.

VibeCodingBench benchmarks 180 tasks, which break down as 30 AI integration tasks, 30 API integrations, 30 code evolutions, 30 frontend tasks, 30 glue code tasks, and 30 SaaS core tasks (whatever that means).

It's current putting Claude Opus 4.5 on top but it looks like the latest models haven't been evaluated yet. There's a new Claude, a new ChatGPT, and Google just today announced a new Gemini which is supposed to excel at everything to do with "reasoning".

If you are the type of person to regularly switch coding models, you might bookmark this and come back on a regular basis to see what model is the best.

Thumbnail
"AI doesn't reduce work -- it intensifies it."

"In an eight-month study of how generative AI changed work habits at a US-based technology company with about 200 employees, we found that employees worked at a faster pace, took on a broader scope of tasks, and extended work into more hours of the day, often without being asked to do so."

"Once the excitement of experimenting fades, workers can find that their workload has quietly grown and feel stretched from juggling everything that's suddenly on their plate. That workload creep can in turn lead to cognitive fatigue, burnout, and weakened decision-making."

"We identified three main forms of intensification."

"Task expansion: Because AI can fill in gaps in knowledge, workers increasingly stepped into responsibilities that previously belonged to others."

"Blurred boundaries between work and non-work: Because AI made beginning a task so easy -- it reduced the friction of facing a blank page or unknown starting point -- workers slipped small amounts of work into moments that had previously been breaks."

"More multitasking: AI introduced a new rhythm in which workers managed several active threads at once: manually writing code while AI generated an alternative version, running multiple agents in parallel, or reviving long-deferred tasks because AI could 'handle them' in the background."

"You had thought that maybe, oh, because you could be more productive with AI, then you save some time, you can work less. But then really, you don't work less. You just work the same amount or even more."

This is my experience. AI raises expectations and intensifies work.

Thumbnail
Nicolas Guillou, a French International Criminal Court judge, was sanctioned by the Trump administration.

"This sanction is a ban from US territory, but it also prohibits any American individual or legal entity (including their subsidiaries everywhere in the world) from providing services to him."

This means he can't have a smartphone, as Google (Android) and Apple (iPhone) are US companies. He can't use Facebook or X (formerly Twitter). He can't use Windows as Microsoft is a US company. He can't use Mastercard or Visa. Most websites for booking flights and hotels are US-based. He is experiencing a digital excommunication.

The proposed solution is European alternatives to US technology.

Thumbnail
Nate Silver says:

"I hope you'll excuse this unplanned and slightly stream-of-consciousness take."

followed by:

"I was recently speaking with the mom of an analytically-minded, gifted-and-talented student. In a world where her son's employment prospects are highly questionable because of AI, even if he overachieves 99 percent of his class in a way that would once have all but guaranteed having a chance to live the American Dream, you had better believe that will have a profound political impact."

That seems like a kind of grammatically mangled statement, so maybe it truly is a stream-of-consciousness take (and obviously not AI-generated). Restated in a more grammatically correct way (by me, not AI, lol) I would put that as: If future job prospects are 'highly questionable because of AI' for an analytically-minded, gifted-and-talented student who would in the past have been guaranteed a bright future but now might have a sucky future even if he overachieves 99 percent of his class, then AI is powerful enough to have profound political impact.

Why Nate Silver thinks the political impact of AI is probably understated:

1. "'Silicon Valley' is bad at politics. If nothing else during Trump 2.0, I think we've learned that Silicon Valley doesn't exactly have its finger on the pulse of the American public. It's insular, it's very, very, very, very rich -- Elon Musk is now nearly a trillionaire! -- and it plausibly stands to benefit from changes that would be undesirable to a large and relatively bipartisan fraction of the public."

Hmm, like what? What changes would those be? He doesn't say.

2. "Cluelessness on the left about AI means the political blowback will be greater once it realizes the impact." "We have some extremely rich guys like Altman who claim that their technology will profoundly reshape society in ways that nobody was necessarily asking for. And also, conveniently enough, make them profoundly richer and more powerful! There probably ought to be a lot of intrinsic skepticism about this. But instead, the mood on the left tends toward dismissing large language models as hallucination-prone 'chatbots'."

Angela Collier (the professional physicist and YouTuber) sure does, but most people I know who work as professional software engineers use AI. Most have completely stopped writing code and *only* proofread AI output.

"People don't take guillotines seriously. But historically, when a tiny group gains a huge amount of power and makes life-altering decisions for a vast number of people, the minority gets actually, for real, killed."

Oh, actually, Sam Altman has a bunker, and so do all the other tech billionaires. They are taking the prospect of guillotines seriously. They have taken steps to ensure they won't be touched by guillotines.

3. "Disruption to the 'creative classes' could produce an outsized political impact."

"However cynical one is about the failings of the 'expert' class, these are people who tend to shape public opinion and devote a lot of time and energy to politics."

I wonder. Life expectancy in the US peaked in 2014. Life expectancy for people in the US without college degrees peaked in 2010. Six years later we got Trump. Now, with AI going after the jobs of the college-educated professionals that are the voting base of the Democratic party, could we get revolutionary ferver? I worry about the 2028 election.

Thumbnail
The entire SimCity C codebase (from 1989) was ported to TypeScript in 4 days by OpenAI's 5.3-codex without a human reading a single line of code. Now the game works in a web browser.

"Christopher Ehrlich wrote a bridge that could call the original C code, then ran property-based tests asserting his TypeScript port performed identically. The AI generated code, the tests verified it, the agent kept iterating."

Thumbnail
Dullness and Disbelief: The 2026 AI Regression

"Something strange happened to AI models in 2025. They got smarter by every benchmark: math, coding, reasoning. And yet thousands of users started saying the same thing: the models feel worse."

Really? I haven't experienced it.

"Not worse at solving problems. Worse at conversation."

"The evidence is in the numbers. GPT-4o and GPT-4.5 remain top 20 on the LMArena leaderboard as of Jan 2026, despite receiving no updates for 9 months."

Hmm.

"The contemporary anti-sycophancy campaign has introduced a new set of failure modes:"

"Genre detection failure: The model treats your casual observation as a request to generate a memo for stakeholders."

"Audience shift: Responses sound like they're addressed to someone else, not you."

"Ticket closing: The model tries to identify 'the task,' resolve it, then effectively end the conversation -- discouraging the exploratory follow-up that makes AI useful for thinking."

"Epistemic rigidity: The model refuses to accept context from you about current events or specific knowledge, demanding proof before proceeding."

Have you all experienced these? I can't recall any.

They go on to say:

"The LMArena leaderboard shows that conversational quality matters to real users in ways that standard benchmarks miss."

"The AI industry is trying to build both a vending machine and a cognitive prosthetic with the same tool. These are different use cases requiring different trade-offs."

Thumbnail
"An interesting distinction to consider when approaching organization building is the mix of build vs run involved in the different parts of it. Typically, software engineers are 80% build if not more. They build systems that attract usage and scale with minimal additional human work required as an input (the 20% run part: monitoring, incident handling, bug fixing which all scale sub-linearly wrt usage). Conversely, account executives are 90% run. They fight for clients, jumping through meetings, mapping accounts one after another. The industry even standardized around offloading the build part of the job off of their plate to revops or GTM engineering teams. The closer jobs are to the revenue, the more run-heavy they generally are."

Hmm. I had never thought of jobs in terms of "build vs run". This is a new concept to me. Let's continue.

"Build: the job to be done is a system. Some systems are made of code (engineering), others are made of documents or spreadsheets (rev-ops, HR). Their nature doesn't matter. What matters is that the system then operates. Good examples are software, compensation frameworks, brand positioning, ad campaigns, operating principles. The main compensation component for the build part of a job is equity."

"Run: the job to be done is a repeatable measurable outcome. Some outcomes are external (closing deals, answering a user ticket) and others are internal (re-stocking the kitchen fridge, filing an expense). The main compensation component for the run part of a job is cash."

"Run part of jobs is zero-sum game. The more you do, the more value you capture out of the market. The market is vast but each meaningful pocket of value it contains is under competitive pressure. The more you run the more you cover and the more value you extract."

"Build part of jobs is not zero-sum game."

"Post AI, build jobs will continue to not be zero-sum game. It is not about how much you build but rather about what and how well you build it. Even more importantly the build/run ratio of build jobs will shift even more towards building. Even in engineering, as companies grew, we used to have a growing pocket of mundane tasks that still required engineers. Organization had no choice but to scale their teams, generally accepting a fair amount of mediocrity doing so, to cover them. These mundane engineering tasks are a great example of the AI build/run ratio amplification. Machines can now handle most of them, reducing run impact on engineers and shifting even more their focus on building. But AI also increases their blast radius (imagine how much harm a few bad engineers working together can do to a codebase when equipped with dozens of agents), reinforcing the absolute criticality for insanely high talent density."

Thumbnail
"What it does: Runs multiple Claude Code sessions in parallel, each working on a different part of your project simultaneously."

"How it works: Each worker gets its own isolated git worktree (separate directory, separate branch). Workers run as background processes. An orchestrator monitors them, runs QA reviews, and merges their PRs automatically."

Written in TypeScript and shell script.

Who is ready to jump on this and use this?

Thumbnail
"I ran passages from Project Gutenberg through GPT-4o-mini 10 times over, each time telling it to 'make it read far better, adding superior prose, etc.'. This lead to classic literary passages being enslopped. I then reversed this pipeline, and trained a model to go from [slop] -> [original]. The resulting model is capable enough to fool Pangram (a fairly robust AI detector - I take this as a metric of how 'human-sounding' the output is), at very little overall quality cost."

"While quality decreases slightly, humanness jumps from 0 to 0.481. The unslopped version stays firmly above Mistral Large 3 and close to the original GPT-5.2 baseline."

Hmm. An AI model to 'unslop' other AI models. What a concept. Check out the example.

Thumbnail
Using AI to complete tasks that require a new skill reduces skill formation. This is the conclusion of a new research study.

"We designed an experiment around the Python Trio library, which is designed for asynchronous concurrency and input-output processing (I/O). This library is less well known than asyncio (according to the number of StackOverflow questions) and involves new concepts (e.g., structured concurrency) beyond just Python fluency. It is also explicitly designed to be easy to use -- making it particularly suitable for a learning experiment."

The easiest way to tell the story is with extensive quotes from the paper. So here we go.

"Each participant first completed a warm-up coding task on a coding platform, where they needed to add a border around a list of strings. This Python coding question takes an average of 4 minutes to complete among users of this coding platform. There are no asynchronous concepts in this coding question."

"No participants have access to AI while completing the warm-up stage. We use this stage to calibrate the Python familiarity of the participants and to help participants familiarize themselves with the interface."

"The next stage is the Trio task stage, where participants have a maximum of 35 minutes to complete two coding tasks using Trio in the same coding platform. During this stage, participants in the AI assistance condition (treatment group) had access to coding help through a chat-based AI assistant. All participants are instructed to complete the task as fast as they could."

"Participants are instructed to complete the task as fast as they could. After completing the Trio task, participants completed the evaluation stage where they take the quiz we described in the previous section and complete a survey that consists of demographic and experiential questions after the quiz."

The 4 types of questions they are referring to are:

- "Debugging The ability to identify and diagnose errors in code. This skill is crucial for detecting when AI-generated code is incorrect and understanding why it fails."
- "Code Reading The ability to read and comprehend what code does. This skill enables humans to understand and verify AI-written code before deployment."
- "Code Writing The ability to write or pick the right way to write code. Low-level code writing, like remembering the syntax of functions, will be less important with further integration of AI coding tools than high-level system design."
- Conceptual The ability to understand the core principles behind tools and libraries. Conceptual understanding is critical to assess whether AI-generated code uses appropriate design patterns that adheres to how the library should be used.

"The two tasks in our study cover 7 core concepts from the Trio library. We designed a quiz with debugging, code reading, and conceptual questions that cover these 7 concepts. We exclude code writing questions to reduce the impact of syntax errors in our evaluation; these errors can be easily corrected with an AI query or web search."

"We conducted 4 pilot studies before running the full study. The first two pilot studies were done on a different crowdworking platform (P1). On this platform, we observed a high level non-compliance (35%) both during the task and the quiz (i.e., participants used AI to complete the coding task in the control group or used AI to complete the evaluation. We observed non-compliance behavior through the coding platform transcripts of when users copied the instructions or pasted code into the editor. We tested different mechanisms to ensure participants in the control condition (No AI) did not use AI for the task. However, despite more explicit instructions, around 25% in the control group participants still used AI. We conducted two pilot studies with a second crowdworking platform (P2), each with 20 participants. Using screen recordings of participant progress, we verified that participants did not use AI in the control group nor for the quiz."

Interesting that people used AI even when explicitly told not to. The researchers had to rely on screen recording to prevent this.

"In Pilot Study C, we observed Local Item Dependence in the quiz: participants would compare questions and identify answers based on code snippets provided in other questions. This motivated us to split the quiz into several different pages, where the questions on each page did not provide hints for other questions."

"In Pilot Study D, we included 20 participants. We found a significant difference in both the task completion time and the quiz score between the AI and non-AI conditions. When we reviewed the screen recording, participants in the control (no AI) condition struggled with Python syntax that was unrelated to Trio, such as try/except blocks and string formatting. The task competition rate within the 35-minute time limit was only 60% within the control (no AI) group compared to a 90% completion rate in the treatment (AI) group. Since our focus was not Python syntax, we added syntax hints about string formatting and try/except blocks for the main study."

"To recruit 50 participants, we sent our study to 58 crowd workers. Participants were balanced across the following attributes (recorded through a separate recruitment survey): years of coding experience, years of Python experience, prior usage of the Python Asyncio library, frequency of Python use in the past year, and an asynchronous programming familiarity score (a 5-question, multiple-choice concept check)."

(The demographic breakdown of the participants was collected after the completion of the task to avoid the threat of stereotypes.)

"Most participants in our study hold a bachelor's degree, are between 25 and 35 years old, and work either as freelance or professional software developers. 53 participants completed all three parts of the study."

"While using AI to complete our coding task did not significantly improve task completion time, the level of skill formation gained by completing the task, measured by our quiz, is significantly reduced. There is a 4.15 point difference between the means of the treatment and control groups. For a 27-point quiz, this translates into a 17% score difference or 2 grade points. Controlling for warm-up task time as a covariate, the treatment effect remains significant."

"4 of the 26 participants in the control (No AI) group did not complete the second task within the 35-minute limit, while every participant in the AI condition completed the second task."

This makes it sound like the AI group was definitely faster. But later on they recount ways in which the AI group were not so fast. But for now, let's continue.

" Across all levels of prior coding experience, users scored higher on average in the control (no AI) than in the treatment (AI assistance) group."

"The control group (No AI) reported higher self-reported learning (on a 7-point scale)."

So here the subjective self-reporting lined up with the objective scores on the quiz.

"The study participants varied between conceptual questions only, code generation only, and a mixture of conceptual, debugging, and code generation queries. Participants who focused on asking the AI assistant debugging questions or confirming their answer spent more time on the task."

"Participants in the control group (no AI) encountered more errors; these errors included both syntax errors and Trio errors. Encountering more errors and independently resolving errors likely improved the formation of Trio skills."

"Using AI decreased the amount of active coding time. Time spent coding shifted to time spent interacting with AI and understanding AI generations.

"Using these axes, we develop a typology of six AI interaction patterns based on query types, number of queries, queries per task, and active time."

I thought this part was very interesting. Those six AI interaction patterns were called: AI delegation, progressive AI reliance, iterative AI debugging, generation-then-comprehension, hybrid code-explanation, and conceptual inquiry.

"AI Delegation: Participants in this group wholly relied on AI to write code and complete the task."

"Progressive AI Reliance: Participants in this group started by asking 1 or 2 questions and eventually delegated all code writing to the AI assistant."

"Iterative AI Debugging: Participants in this group relied on AI to debug or verify their code."

"Generation-Then-Comprehension: Participants in this group first generated code and then manually copied or pasted the code into their work. After their code was generated, they then asked the AI assistant follow-up questions to improve understanding."

"Hybrid Code-Explanation: Participants in this group composed hybrid queries in which they asked for code generation along with explanations of the generated code."

"Conceptual Inquiry: Participants in this group only asked conceptual questions and relied on their improved understanding to complete the task."

"Contrary to previous work finding significant uplift or speedup of AI assistance for coding, our results do not show a significant improvement in productivity if we only look at the total completion time across the treatment and control groups. By analyzing how participants in the AI condition completed the task, the reason for the lack of improved productivity was due to the time spent interacting with the AI assistant."

So if you spend too much time interacting with your AI assistant, you're not faster than if you just didn't use AI in the first place.

"We categorized user inputs into the AI assistant, queries, into 5 broad categories: explanation, generation, debugging, capabilities questions, and appreciation. The most common type of query was explanations; users requested more information about the trio library, details about asynchronous operations, and high-level conceptual introductions. 21 out of 25 participants in the treatment group asked an explanation question; this reflects the high level of engagement among our participants. The second most common were queries asking for code to be generated; some participants asked for an entire task to be completed, while other participants asked for specific functions to be implemented. Only 16 of 25 or two thirds of the participants used AI to generate code. 4 of these participants only asked for code generation and no other types of question. In fact, 3 of the 8 lowest-scoring participants asked AI to generate code without asking for explanations, suggesting that if all participants in the AI group were to use AI for solely generating code, the skill-formation differences compared to the control group would be even greater."

"Another pattern that differs between participants is that some participants directly paste AI-written code, while other participants manually typed in (i.e., copied) the the AI generated code into their own file. The differences in this AI adoption style correlate with completion time."

"For skill formation, measured by quiz score, there was no notable difference between groups that typed vs directly pasted AI output."

Ah, that's interesting. So retyping things doesn't seem to aid comprehension.

"The AI group encountered fewer errors than the control group: the median participant in the treatment group encountered only one error in the entire task, while the median for the control group was three errors."

"Certain errors require a deeper understanding of the Trio library, which may account for differences in learning outcomes. Figure 14 shows that the most common errors are not directly related to the Trio library: NameError and AttributeError are typically typos made on variable names or function names that are quickly corrected. Other errors are directly related to Trio: RuntimeWarning appears when a coroutine was never awaited and TypeError appears when a trio function gets a coroutine object instead of an async function. These errors force an understanding of key concepts on how the trio library handles corountines and the usage of await keywords that are tested in the evaluation. Although participants in the AI condition also encounter errors, there are much fewer Trio-related errors encountered."

"For participants in the control group, the higher frequency of encountering errors leads to more critical thinking about what is happening with the code and how to used the new library being presented."

"A quarter of the participants left feedback after the task and quiz were completed. In the control group (No AI), participants remarked that they found the task fun and that the tasks instructions were good at helping develop an understanding of Trio. In the treatment group (AI Assistance), participants remarked that they wished they had paid more attention to the details of the Trio library during the task, either by reading the generated code or by generating explanations in more depth. Specifically, participants reported feeling 'lazy' and that 'there are still a lot of gaps in (their) understanding'. The sentiment of participants' feedback suggested a more positive experience among the control group even though the task instructions and quiz questions were identical across groups."

"Our main finding is that using AI to complete tasks that require a new skill (i.e., knowledge of a new Python library) reduces skill formation."

Fascinating. I can't help but wonder, how much it mattered that the library they chose is considered "easy to use" and what would've happened if this experiment was repeated with some truly obtuse technology (they exist out there).

Thumbnail
Can AI pass freshman computer science?

Spoiler: I trepidatiously expected AI to just completely crush humans and surpass humans at everything. That's not what happened but I would still say yes, AI can pass freshman CS, because the "freshman CS" described here (which evidently is an actual freshman CS class at Cornell University), had very hard assignments -- creating an encryption cypher, creating a hash table and a prefix tree, creating a parser and interpreter for a custom programming language and making a simulation of a world filled with critters with each critter programmed in the custom programming language (called critterlang) -- then the students will build the GUI to view the world, then they will make a multithreaded server and make the GUI a network client of the server. I figure the students must have been given libraries that already did 95% of the work, otherwise there's just no way freshmen could do all this in a 1-semester course while simultaneously taking a boatload of other courses. But no, he says, the students write the code "almost entirely from scratch".

Having said that, the AIs often succeeded at very hard aspects of the tasks while failing at very simple things. Another example of "jaggedness" -- the way machine intelligence compares to human intelligence in a "jagged" way, with machines surpassing humans in some ways and humans surpassing machines in others. Some things easy for humans turn out to be hard for machines and vice-versa, and it's pretty hard to predict which is which until you actually run the experiment.

Also, every time he ran into problems with the AI platforms, it became a bunch of "I guess that's what happens when you vibe code your [X]!" jokes.

p.s. He (the Cornell TA doing the grading and making the video) really anthropomorphizes the AI models. Maybe this is to be expected, given he's grading the AI models according the the students' grading rubric?

Thumbnail
"When AI can't know -- and what that teaches us about information"

I don't have a clear picture in my head of where the math here is useful (i.e. 1 - (2^(-k))), but I'm going to pull out some choice quotes that convey the gist of what these experiments are getting at.

"The capability gap isn't where you think."

"People keep telling me they're waiting for AI to get better before they'll really use it. I've been using these models to prototype analyses quickly and explore parameter spaces that would take weeks manually. The gap between what people think is possible and what's actually possible keeps surprising me."

"Early image models struggled with hands -- six fingers, mangled anatomy, clearly broken outputs. Everyone pointed to this as proof the technology was fundamentally limited. But beneath the surface, something else was going on. People who learned Stable Diffusion properly were generating anatomically correct hands on the same base models giving everyone else nightmares. They figured out the techniques -- negative prompts to exclude malformed anatomy, better samplers, higher resolution, inpainting for touch-ups, specific checkpoints trained on better hand data, explicit constraints like 'five fingers, anatomically correct hands, professional photography.'"

"This pattern shows up everywhere. When someone shows me ChatGPT producing garbage code or useless responses, I can almost always trace it back to how they structured the request. Their mental model of what they're working with is incomplete."

"That observation -- that outcomes depend more on how you ask than on raw capability -- led me somewhere unexpected. What if some failures aren't about skill or model quality at all? What if they're structurally inevitable?"

"The hidden discipline behind effective prompting"

"The difference between good prompting and great prompting requires maintaining a very specific kind of mental discipline. It's a process closer to a design space, or a calculus, really. At the bare minimum, you're tracking four things simultaneously:"

"What you know about the problem"
"What you don't know"
"What the model likely learned during training"
"What it definitely doesn't have access to"

"Then you structure everything based on those boundaries."

"In actuality, you're doing knowledge management across two minds, where one doesn't think like you and can't tell you what's missing."

"Three independent pressures: a complete picture"

"Hallucination stems from three independent pressures that work separately but compound when combined:"

"First: Structural pressure (K): Some tasks demand incompatible behaviors across different contexts."

"Second: Architectural pressure (insufficient r): Closed-set training with standard objectives creates strong pressure toward confident predictions, whether prediction makes sense or not."

"Third: Training composition: The balance of defined versus undefined examples affects how far above the theoretical minimum you land."