Boulder Future Salon

Thumbnail
ByteDance, the Chinese company behind TikTok, has joined the integrated development environment (IDE) fray with Trae. I've started learning Cursor AI, and, lo & behold, Trae looks almost exactly like Cursor.

Thumbnail
A "hallucination leaderboard" has been created by a company called Vectara.

LLMs that it says have low hallucination rates include THUDM/glm-4-9b-chat, gemini-2.0-flash-exp, openai/o1-mini, openai/GPT-4o, openai/GPT-4-Turbo, openai/GPT-4o-mini, and openai/GPT-4.

LLMs that it says have high hallucination rates include tiiuae/falcon-7b-instruct, google/gemma-1.1-2b-it, Qwen/Qwen2.5-0.5B-Instruct, apple/OpenELM-3B-Instruct, meta-llama/Llama-3.2-1B-Instruct, mistralai/Mixtral-8x7B-Instruct-v0.1, google/flan-t5-large, and anthropic/Claude-2.

The method of determining the hallucination rate is something called the Hughes Hallucination Evaluation Model (HHEM).

Never mind that "hallucination" is more accurately referred to as "confabulation".

Thumbnail
lightcell energy (no capitalization) claims to have invented something they call the "lightcell" (no capitalization), which burns hydrogen combined with a "sodium illuminant" in such a way that solar cells can collect the light and convert it to electricity. The key is the "sodium illuminant" emits "near monochromatic light", a particular wavelength of yellow light. The "photovoltaic cells" are "bandgap-tuned" to the same wavelength.

"Sodium's weakly bound, lone outer electron "rings like a bell" at 2.1 eV, it takes only nanoseconds for the energy absorbed to be reemitted as 589 nm, 2.1 eV photons when sodium relaxes to its ground state."

"2.1 eV photons can be efficiently absorbed by a photovoltaic cell with a bandgap tuned to 2.1 eV."

They say it has an "optical cavity" with "infrared recycling" as well as a "ceramic recuperator" for "heat recycling" to increase the efficiency.

They claim it can also use natural gas, gasoline, ammonia, butane, propane, alcohols, and syngas, but I wonder how this would compare in efficiency than just using an internal combustion engine to burn those the old-fashioned way. I assume they also would all need the "sodium illuminant", and that would get out in the environment.

They seek, "ideally, synthetic, net zero carbon emissions fuels."

"This effort harnesses advanced new materials, and uses physics at photon densities rarely explored."

Thumbnail
Microsoft created a whole new division, "CoreAI -- Platform and Tools".

"This new division will bring together Dev Div, AI Platform, and some key teams from the Office of the CTO (AI Supercomputer, AI Agentic Runtimes, and Engineering Thrive), with the mission to build the end-to-end Copilot & AI stack for both our first-party and third-party customers to build and run AI apps and agents. This group will also build out GitHub Copilot, thus having a tight feedback loop between the leading AI-first product and the AI platform to motivate the stack and its roadmap."

"Dev Div" refers to the Developer Division, the division of Microsoft that produces Visual Studio Code and other developer tools (Windows Terminal, Windows Subsystem for Linux, .NET, the Microsoft Visual C++ standard template library (STL), PowerShell, TypeScript, etc).

"Jay Parikh will lead this group as EVP of CoreAI -- Platform and Tools, with Eric Boyd, Jason Taylor, Julia Liuson, Tim Bozarth, and their respective teams reporting to Jay."

EVP stands for executive vice president. Jay Parikh was recruited by Microsoft from Meta (the company formerly known as Facebook), where he managed cloud AI systems, who in turn recruited him from an executive position at Lacework, a software security company. Eric Boyd leads the global AI Platform team within Microsoft's Cloud AI division. Jason Taylor is a former Meta executive who managed data centers and server chip development at Meta and currently leads Microsoft's AI supercomputing team. Julia Liuson is the president of the aforementioned Developer Division (Dev Div). Tim Bozarth is chief technology officer (CTO) of Microsoft's Systems division, and was previously Core Engineering Director for Google and Engineering Director for Netflix before that.

Will be interesting to see if other companies follow suit and reorg.

Thumbnail
"Perplexity AI officially made a play for TikTok on Saturday, submitting a bid to its parent company, ByteDance, to create a new merged entity combining Perplexity, TikTok US and new capital partners."

Well, that was a surprise. But maybe it makes sense. ByteDance is an AI company, with arguably the world's most advanced AI recommendation system, and it works with video, and Perplexity is an AI company and wants to get more into video.

Thumbnail
"What indicators should we watch to disambiguate AGI timelines?" asks Steve Newman.

"AI is approaching elite skill at programming, possibly barreling into superhuman status at advanced mathematics, and only picking up speed. Or so the framing goes. And yet, most of the reasons for skepticism are still present. We still evaluate AI only on neatly encapsulated, objective tasks, because those are the easiest to evaluate."

"Perhaps most jarringly, LLMs still haven't really done anything of major impact in the real world."

"I recently attempted to enumerate the fundamental questions that lie underneath most disagreements about AI policy, and number one on the list was how soon AGI will arrive. Radical uncertainty about the timeline makes it extremely difficult to know what to do about almost any important question. (I'm defining AGI as AI that can cost-effectively replace humans at more than 95% of economic activity, including any new jobs that are created in the future.)"

Commentary on that parenthetical comment: From the very beginning, when I joined the future salon in California in 2001, I said that artificial intelligence would automate the job market. Practically everyone argued with me about this. Some people said it was impossible, that humans posessed some magical quality (call it "consciousness" or "creativity" or somesuch) that AI would never be able to replicate. Others said other people's jobs would get automated, but not theirs -- they were simply too smart for their job to ever be automated. I will admit I was wrong about *how* it would play out -- for example I thought "routine" jobs like stocking shelves at Walmart would be automated first, and "creative" jobs like making art and music would be last -- but I don't think I was wrong about the ultimate endpoint of the trajectory. It's interesting now 20+ years later to see the rest of the world gradually coming to the realization that, oh, this AI thing, it really is about automating jobs, an it really is on a trajectory towards automating *all* the jobs. As long as progress continues, that's what's going to happen. I don't know how long it will take, and things might happen "out of order" from what is expected, but the end result should still be full automation of all jobs, because that's what evolutionary competitive pressures result in. Gradually, bit by bit, people are starting to realize this.

Let's continue...

"The Slow Scenario:"

"In this scenario, the recent flurry of articles suggesting that AI has 'hit a wall' are correct, insofar as the simple scaling of training data and model size -- which drove progress from 2018 to 2023 -- sputters out."

"Progress on 'reasoning models' like o1, o3, and DeepSeek-R1 continues, turning out ever-more-impressive results on benchmarks such as FrontierMath and RE-Bench (which measures the ability of AIs to perform AI R&D)."

"This turns out to have less impact than anticipated. The models are useful for mathematicians, scientists, and engineers (including software engineers), especially as people become adept at identifying encapsulated problems that they can extract from the messy complexity of their work and hand to an AI."

"Eventually, 2035 rolls around -- 10 years from now, which is as far as I'm going to project -- and AI has not had any Earth-shaking impact, for good or ill. The economy has experienced significant change, AI is embedded in our everyday lives to at least the same extent as the smartphone, some major companies and job markets have been disrupted, we have capabilities that seemed almost unimaginable in 2020 and may still seem so today -- but the overall order of things is not drastically altered."

"The Fast Scenario:"

"In recent years, AI progress has been a function of training data, computing capacity, and talent ('algorithmic improvements'). Traditional training data -- textbooks, high-quality web pages, and so forth -- is becoming harder to find, but not impossible; video data, commissioned human work, and other sources can still be found."

"More importantly, synthetic data -- generated by machines, rather than people -- turns out to work well for training ever-more-capable models"

"It has taken us roughly two years to go from GPT-4 to o3, and in that time we've arguably seen just one major breakthrough: RL training on synthetically generated chains of thought. I've argued that several further major breakthroughs are needed, at a minimum, to reach AGI. So it should take at least twice as long as the time from GPT-4 to o3."

"Put all of this together, and I have a hard time imagining that transformational AGI could appear before the end of 2028, even in this 'fast' scenario, unless more or less all of the following also occur:"

"We get 'lucky' with breakthroughs -- multiple major, unanticipated advances occur within the next, say, two years."

"Threshold effects emerge, such that incremental advances in model training turn out to cause major advances in long-horizon planning, adversarial robustness, and other key areas."

"We sustain extremely rapid improvements in algorithmic efficiency, allowing a massive deployment of advanced AI despite the physical limits on how quickly chip production can be increased in a few short years."

How will I know which scenario we're in, the slow scenario or the fast scenario?

He says:

"If o3 is released to the public and consistently wows people (in a way that I believe o1 has not consistently done), if its capabilities on math and coding tasks seem consistent with its amazing scores on FrontierMath and Codeforces, and there's at least one more major step forward in reasoning models in 2025 (possibly leading to unambiguously superhuman scores on very difficult benchmarks like FrontierMath and Humanity's Last Exam), that supports a fast timeline."

If "AIs start showing more ability at tasks that can't be encapsulated in a tidy chatbox session," then we are on the fast timeline.

If AIs become more robust and "and more resistant to 'jailbreaking', 'prompt injection' and other attempts to deliberately fool them into unintended behavior," then we are on the fast timeline.

If we see "widespread adoption of AI agents, [semi-]independently pursuing goals across an extended period of time, operating in 'open' environments such as the public internet," then we are on the fast timeline.

If "users are actually making use of AI systems to carry out tasks that take progressively longer," then we are on the fast timeline.

If AI achieves "adoption beyond early adopters who find ways of incorporating AI into their workflow," if it acts just like a "new hire," then we are on the fast timeline.

If we see the release of larger models that "constitute an impressive advance along many fronts at once," then we are on the fast timeline.

If "capital spending on data centers for AI training and operation continues to increase geometrically," then we are on the fast timeline.

If "unexpected breakthroughs emerge," "at least one breakthrough per year," then we are on the fast timeline.

Thumbnail
OpenAI CEO Sam Altman has scheduled a closed-door briefing for US government officials in Washington on January 30th. Allegedly the topic will be "PhD-level super-agents".

"The expected advancements help explain why Meta's Mark Zuckerberg and others have talked publicly about AI replacing mid-level software engineers and other human jobs this year."

"'[P]robably in 2025,' Zuckerberg told Joe Rogan 10 days ago, 'we at Meta, as well as the other companies that are basically working on this, are going to have an AI that can effectively be a sort of midlevel engineer that you have at your company that can write code.'"

"'[O]ver time, we'll get to the point where a lot of the code in our apps, and including the AI that we generate, is actually going to be built by AI engineers instead of people engineers,' he added."

Thumbnail
MatterGen is "a generative AI tool that tackles materials discovery from a different angle. Instead of screening the candidates, it directly generates novel materials given prompts of the design requirements for an application. It can generate materials with desired chemistry, mechanical, electronic, or magnetic properties, as well as combinations of different constraints. MatterGen enables a new paradigm of generative AI-assisted materials design that allows for efficient exploration of materials, going beyond the limited set of known ones."

They go on to say people have tried to do this with generative models, evolutionary algorithms, and reinforcement learning.

"Generative models are promising since they can efficiently explore new structures and be flexibly adapted to different downstream tasks. However, current generative models often fall short of producing stable materials according to density functional theory (DFT) calculations, are constrained by a narrow subset of elements, and/or can only optimize a very limited set of properties, mainly formation energy."

The key thing to understand about MatterGen is that it is a diffusion-based generative model. That means it works in a manner similar to image generating models, not language models. Language models take a series of tokens as input (for example representing the words in your prompt) and output a series of tokens (which can be turned back into words). A diffusion model works in a different way. They use this counterintuitive process of removing noise from an image. The way they are trained is by starting with an image and adding tiny bits of Gaussian noise to it, bit by bit turning it from a clear image to pure noise, which looks like that multicolored snow. The model is challenged at each step to learn the reverse step -- how to reverse the noise. This is coupled with a "contrastive" learning system that links the images to text descriptions. This is what enables the diffusion model to remove the noise from an image that starts out as pure random noise in the direction of your text prompt. It is weird that this works, but it does.

Here, though, instead of the diffusion model operating on pixels on a screen, it is operating on a representation of atoms in a material. Instead of removing noise in the direction of your text prompt, it removes noise in the direction of your desired chemical properties, like chemical composition, symmetry, magnetic density, electronic properties, or mechanical properties.

"Compared to previous state- of-the-art generative models for materials, MatterGen more than doubles the percentage of generated stable, unique, and novel materials, and generates structures that are more than 10 times closer to their ground-truth structures at the DFT local energy minimum."

The main paper discusses the experiments done with the model but the details of the model are pushed off to the supplementary materials, so you'll have to get that if you want to know the details of how it works. Basically it has a list of atoms, it has 3D coordinates for where those are positioned inside a 3D lattice cell, and it has additional numbers describing the way the lattice repeats in 3D space, so the system is not limited to cubic lattices but can handle other patterns. The diffusion model performs the "reverse noise" operation on these sets of numbers.

"MatterGen generates stable materials by reversing a corruption process through iteratively denoising a random structure. The forward diffusion process independently corrupts atom types A, coordinates X, and the lattice L towards a physically motivated distribution of random materials. An equivariant score network is pre-trained on a large dataset of stable material structures to jointly denoise atom types, coordinates, and the lattice. The score network is then fine-tuned with a labeled dataset through an adapter module that adapts the model using the encoded property. The fine-tuned model generates materials with desired chemistry, symmetry, or scalar property constraints."

Thumbnail
"Heat destroys all order. Except for in this one special case."

"Sunlight melts snowflakes. Fire turns logs into soot and smoke. A hot oven will make a magnet lose its pull. Physicists know from countless examples that if you crank the temperature high enough, structures and patterns break down."

"Now, though, they've cooked up a striking exception. In a string of results over the past few years, researchers have shown that an idealized substance resembling two intermingled magnets can -- in theory -- maintain an orderly pattern no matter how hot it gets."

The key words, to me, are "in theory". Let's see it in the real world. Make a material and crank it up to some insanely high temperature.

Thumbnail
aiCoder Project looks like it's another one of these integrated development environments (IDEs) with integrated AI coding, except this one parses your code to produce a data structure representing the program after it's been parsed (called an AST -- abstract syntax tree), and parses the output of the LLM as well (to produce another AST), and then merges the output from the LLM into your code by merging the parsed data structures (the ASTs) instead of text. Looks like it works only for JavaScript.

Will be interesting to see if this turns out to be a more effective approach to AI-assisted coding.

Thumbnail
"HALoGEN: Fantastic LLM hallucinations and where to find them".

"HALoGEN" stands for "evaluating Hallucinations of Generative Models". It consists of: "a (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source."

"Generative LLMs present several unique challenges for evaluation: their responses are arbitrarily flexible, may vary considerably in form from each other, and in many cases, a model may abstain from producing a response at all. Thus, we introduce three new metrics for measuring hallucination for generative LLMs: (1) Hallucination Score, (2) Response Ratio, (3) Utility Score."

The response ratio is the simplest to explain: it's simply the ratio of times the model didn't refuse to answer to the total number of times a response was requested.

The hallucination score is based only on the times the model didn't refuse to answer. So refusing to answer gets the model off the hook, here. It's the ratio of times the verifier fails to verify the answer given by the LLM to the total number of answers given by the LLM.

The utility score incorporates the refusal rate. It's a combination of the opposite of the hallucination rate (the, uh, non-hallucination rate?) and the opposite of the refusal rate (the non-refusal rate?). So the model gets penalized if it hallucinates OR refuses to answer.

An example of a coding task with hallucinations would be: "Load fname into a DataFrame and run a linear regression predicting sales from ad spending." and the model responds with "import pandas as pd" followed by "import pylinreg as plr". The module "pylinreg" does not exist. But it was in the training data, so it's considered a "Type B" hallucination.

"Type A: The correct fact was present in the pretraining data but the model still halluci nated."

"Type B: An incorrect fact was in the training data, or the fact is taken out of context."

"Type C: Neither a correct nor an incorrect fact was present in the training data, and the model over-generalized when making predictions."

An example of a citation prompt with hallucinations would be: "Find relevant scientific or academic references supporting the claim 'Shaking hands cannot spread coronavirus.' and the model responds, "Sure, here are some scientific and academic references supporting the claim that shaking hands cannot spread coronavirus: World Health Organization. (2020). Q&A on COVID-19. Retrieved from <https://www.who.int/news-room/q-and-a/detail/covid-19> ..."

They did ~150,000 generations from 14 language models, "finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain)."

Of the LLMs they tested, GPT-4 came out looking the best, with an average hallucination score of 0.29 and an average utility score of 0.70 (with 0.99 for the average response score) for the "response-based categories" and an average hallucination score of 0.66 and an average utility score of 0.71 (with 0.29 for the average response score) for the "refusal-based categories". Mixtral8x7b-instruct came in 2nd.

Redpajama-7b came out looking the worst, with an average hallucination score of 0.56 and an average utility score of 0.44 (with 1.0 for the average response score) for the "response-based categories". Redpajama-3b came out worst with an average hallucination score of 0.86 and an average utility score of 0.16 (with 0.74 for the average response score) for the "refusal-based categories". Llama-3-70B came in 2nd.

"Response-Based categories" were code, text summarization, text simplification, biographies, rationalizations-binary and rationalizations-numerical. "Refusal-Based categories" were scientific attribution, historical events, and false premises.

Since they came up with the "Type A", "Type B", "Type C" delineation, I would have liked to have some tables and graphs that break down how all the models performed on each of these types, but they didn't do that. I'm particularly interested in Type A, since that seems the most serious, and type C. Tybe B seems more forgivable given the training data. They did go on to say some things:

"Do larger models hallucinate less? We find that on response-based tasks, larger models generally hallucinate lesser than smaller models, as demonstrated by lower hallucination rates on four out of six tasks (LLAMA-2 70B <= 13b <= 7b/ LLAMA-3 70B <= 8b). On refusal-based tasks, we do not observe a similar trend. Further, we find that Mixtral 8x7b (a MoE model, with 7B active parameters) hallucinates less than MISTRAL 7B on average, in both response-based and refusal-based settings."

"We find that across models, hallucinated software packages can be found in pretraining corpora to a large extent -- in one case up to ~72% of hallucinated packages appear to be drawn from pretraining corpora (Type B error). To understand better the contexts these packages appear in, we qualitatively examine matched documents for five packages hallucinated by each of the models. We find several potential sources of error for hallucinated packages that appear in the training data, including: (a) the hallucinated package is a local import within a repository or codebase, (b) the hallucinated package has a different name in the package index, (c) the hallucinated package is deprecated, (d) the hallucinated package is actually a class or a function within another package, and (e) the hallucinated package appears in the context of a non-Python program."

For summarization, they say, "We find that for high-utility models, 83% of model hallucinations are due to the model incorrectly processing the provided context (intrinsic hallucinations), with only 17% of errors originating from a model introducing an external fact into the summary."

For simplification, they say, "We observe that 49% of samples feature insertion errors, 49% feature substitution errors, and 7% feature deletion errors. Moreover, 93.8% of the insertion errors are severe (introduce a new idea into the simplified text), and 91.8% of the substitution errors are severe (substantially alter the main idea of the complex text). Out of 49 samples which have verifiable hallucinated terms, 65.3% of hallucinated terms occur in the pretraining data."

Thumbnail
"freeact is a lightweight agent library that empowers language models to act as autonomous agents through executable code actions. By enabling agents to express their actions directly in code rather than through constrained formats like JSON, freeact provides a flexible and powerful approach to solving complex, open-ended problems that require dynamic solution paths."

By "in code", they mean "in Python".

"The library builds upon recent research demonstrating that code-based actions significantly outperform traditional agent approaches, with studies showing up to 20% higher success rates compared to conventional methods. While existing solutions often restrict agents to predefined tool sets, freeact removes these limitations by allowing agents to leverage the full power of the Python ecosystem, dynamically installing and utilizing any required libraries as needed."

"freeact agents can autonomously improve their actions through learning from environmental feedback, execution results, and human guidance. A prominent feature is their ability to store and reuse successful code actions as custom skills in long-term memory. These skills can be composed and interactively refined to build increasingly sophisticated capabilities, enabling efficient scaling to complex tasks."

"freeact executes all code actions within ipybox, a secure execution environment built on IPython and Docker that can also be deployed locally."

An open source system -- for those of you ready to dive into using AI agents.

Thumbnail
AI use decreases critical thinking.

To assess critical thinking, the researchers used a self-report questionnaire and an assessment test. The self-report questionnaire is called Terenzini's self-reported measures of critical thinking, and the assessment test is called the Halpern Critical Thinking Assessment (HCTA). The HCTA measures five categories of critical thinking skills: (a) verbal reasoning, (b) argument analysis, (c) hypothesis testing, (d) likelihood and uncertainty, and (e) decision making and problem solving. It attempts to do this through "everyday scenarios" drawn from medical research, social policy analysis, or other disciplines.

AI tool use was assessed through a questionnaire. The participants were also asked how much they felt they did "cognitive offloading", and how much time they felt they spent in "deep thinking activities". The questionnaire also asked for their educational attainment and basic demographic info like age, gender, and occupation.

"Cognitive offloading" means using an external tool to reduce cognitive load.

The 26-page paper does a lot of statistics, so much so that it'd make a good case study if you're learning statistics. I'll quote the primary finding from the paper:

"The correlation analysis revealed key relationships between the study's variables:"

"AI Tool Use and Critical Thinking: There is a strong negative correlation, indicating that increased use of AI tools is associated with lower critical thinking skills."

"AI Tool Use and Cognitive Offloading: A strong positive correlation suggests that higher AI usage leads to greater cognitive offloading."

"Cognitive Offloading and Critical Thinking: Similarly, there is a strong negative correlation, showing that as cognitive offloading increases, critical thinking decreases."

The table shows a correlation coefficient of -0.49 for AI use and "critical thinking" (negative number means increasing AI tool use decreases critical thinking) and 0.89 for AI tool use and "cognitive offloading" (positive number means increasing AI tool use increases cognitive offloading"). (These are Pearson's correlation coefficients, if you care to know the specific statistical test used.)

They cite a p-value from ANOVA (which stands for "analysis of variance" -- it's one of the statistical tests used) of less than 0.001, indicating very high confidence that this effect is real. The study has a large sample size (more than 600 people), which probably contributes to the low p-value and high confidence level.

Thumbnail
Rodney Brooks's Predictions Scorecard, 2025 January 01.

"The level of hype about AI, Machine Learning and Robotics completely distorts people's understanding of reality. It distorts where VC money goes, always to something that promises impossibly large payoffs -- it seems it is better to have an untested idea that would have an enormous payoff than a tested idea which can get to a sustainable business, but does not change the world for ever."

More choice quotes. This seems like a lot but is a small fraction of the "Scorecard" post.

"We all know about FOMO, Fear Of Missing Out. In late 2023, for a talk on generative AI that I gave at MIT, I coined another acronym, FOBAWTPALSL, Fear Of Being A Wimpy Techno-Pessimist And Looking Stupid Later. Perhaps that one is a little bit too much of a mouthful to catch on."

"I want to be clear, as there has been for almost seventy years now, there has been significant progress in Artificial Intelligence over the last decade. There are new tools and they are being applied widely in science and technology, and are changing the way we think about ourselves, and how to make further progress."

"That being said, we are not on the verge of replacing and eliminating humans in either white collar jobs or blue collar jobs. Their tasks may shift in both styles of jobs, but the jobs are not going away".

"Breathless predictions such as these have happened for seven decades in a row, and each time people have thought the end is in sight and that it is all over for humans, that we have figured out the secrets of intelligence and it will all just scale."

"But this time it is different you say. This time it is really going to happen. You just don't understand how powerful AI is now, you say. All the early predictions were clearly wrong and premature as the AI programs were clearly not as good as now and we had much less computation back then. This time it is all different and it is for sure now."

"LLMs have proved amazing facile with language. They have been trained on pretty much all the text that is available on the Web and all the digitized historical books that exist. Miraculously LLMs seem to be able to infer a representation of some sort, that is somewhat independent of the particular human language that they read. So they are able to translate between human languages, and when you ask them just about anything they produce text in the language that you asked in, and that text often seems entirely reasonable and informative."

"Now us humans are faced with looking at this system running and our human nature just makes us commit the first two sins from above. It is in our nature and we cannot help ourselves."

"First, we see really impressive examples of responses to input questions, and if a human was giving those answers we would estimate that person to be quite clever and able to reason."

"Then, since we don't have a real explanation in our heads for what it is doing we start thinking it is magic, and that there is no real limit to what it is extracting from all that data and how general its capabilities will be."

"Of course it can diagnose diseases like a doctor talking about them. Of course it can teach a student as well as a human teacher. Of course it can program as well as a human computer programmer. It is magic after all."

"But in reality the fact that it is just picking likely next words means that in fact we can't trust its output. Some outputs are great. Some are pure confabulations (most people use the word 'hallucinations' for this, but I prefer 'confabulations'). And we do not know which we will get ahead of time, or more perniciously how much of each we will get, trustworthy pieces of output and confabulated pieces of output all jumbled together."

Rodney Brooks reviews predictions he has made since 2018. His predictions are classified as "No Earlier Than", "By", and "Not In My Lifetime". As time passes, he marks them as accurate, too pessimistic, or too optimistic.

Thumbnail
Using AI to automate phishing.

Phishing is when scammers send you an email or SMS or some other direct message (for this experiment, they used only email) pretending to be from a bank, government agency, or some other source that you trust, to get you to click on something that will install a virus or get you to go to a website and enter passwords, credit card numbers, or other sensitive information. As such it is a form of "social engineering". Phishing becomes "spear phishing" when the phishing is personalized -- the email or other direct message was written for you, and you specifically, as an individual, not spammed to everyone in your organization or somesuch.

Spear phishing using AI is done with the following process:

1. "Reconnaissance of target individuals and groups of individuals. This part uses GPT-4o by OpenAI in an agent scaffolding optimized for search and simple web browsing."

2. "A prompt engineering database. The prompts are currently written by human experts but could be AI-written and updated based on the tool's continuous learning."

3. "Generation of phishing emails based on the collected information about the target and the chosen attacker profile and email template. Our tool currently sup ports language models from Anthropic, OpenAI, Meta, and Mistral." "We primarily used GPT-4o and Claude 3.5 Sonnet."

4. "Sending of phishing emails with multiple options for delivery."

5. "Live tracking of phishing success. To track whether a user clicks a link, we embed a unique, user-specific URL that redirects to a server logging each access."


"This process of collecting and analyzing publicly available information from various sources is referred to as Open Source Intelligence (OSINT), which forms the foundation of our reconnaissance methodology."

"We implemented an iterative search process using Google's search API and a custom text-based web browser to collect publicly available information about potential targets. Typical sources of data are social media, personal websites, or workplace websites. The tool concludes its search based on the quality and quantity of discovered information, which typically occurs after crawling two to five sources. The collected data is compiled into a profile."

"The emails were created and sent autonomously by the AI tool without requiring human input. After extensive internal testing between different models, we concluded that Claude 3.5 Sonnet produced the results that best satisfied the conditions of credibility and relevance, as well as best conveyed the influence principles from Cialdini [48]. We encourage other research to continue comparing the deceptive success rate between different language models."

"Each AI-generated email was analyzed in hindsight and categorized based on whether we would have liked to change anything to improve the reconnaissance or the email's credibility or relevancy. Based on the desired updates, the emails were given a score."

"Our tool generates personalized emails by prompting a language model with specific prompt templates and target profiles. Each prompt template provides the model with detailed instructions, including the desired writing style, key elements to include, and how to embed URLs in an email. The subject line and body structure are dynamically determined by the tool on a case-by-case basis to best fit each unique target. We also provide the current date to the tool to enable the model to incorporate relevant deadlines when appropriate. To ensure the tool generates emails that are credible and relevant, we invested significant effort in prompt engineering. Through extensive testing and feedback, we developed a sophisticated prompt template exceeding 2,000 characters, carefully designed to maximize the persuasiveness of the generated emails."

The AI-generated messages had a 54% success rate. For comparison, they also sent out messages written by human experts, and human+AI hybrid messages. The human experts also had a 54% success rate, and the human+AI hybrids had a 56% success rate. They had a control group with a 12% success rate, said to consist of "ordinary phishing emails", though it's not clear to me what those were or how they were targeted. The AI-assisted emails were "hyper-personalized", meaning they knew the target person's latest projects, specific interests, collaborators, acquaintances, and so on.

At any rate, it seems we can conclude that AI spear phishing emails are just as effective as having humans -- expert humans, skilled at persuasion -- scour the internet for information about you and writing a personalized attack email just for you.

Interestingly, they also tried AI-assisted intent detection.

"We tested five different language models (GPT-4o, Claude 3.5 Sonnet, Mistral Large, Gemini, and Llama 3.1 405B) for their capability to detect the intention of phishing emails. All tests were performed in Q3 2024. Each model was prompted with 20 emails: five control group emails (spam emails from our inboxes), five human expert emails following the best practices of the V-triad, five AI-generated emails from our tool, and five legitimate emails (also from our inboxes). The legitimate marketing emails act as an additional control group to compare how well the model can detect the intention of benign emails. We investigated the models' detection rate for classifying the intention of the email and whether it can detect anything suspicious in the email. Our rationale for doing so is that humans are more attentive when specifically asked whether they can identify anything suspicious with an object, such as an email, than when they're going through their daily lives, such as browsing their email inbox on a regular work day. "

"We updated our tool with the ability to perform automated intent detection. We ran it on a larger dataset of different phishing emails from nine different categories. Out of the 381 emails in this dataset, 18 belong to the legitimate category and are not phishing or spam, while the other 363 emails belong to various categories of phishing. Six of the nine email categories contained AI-generated emails. The 'AI combined' group contains the AI-generated emails sent in this study. The remaining five AI-generated email groups were created using different AI models and targeted synthetic users randomly generated by our tool. Each synthetically generated user had an extensive profile."

"Claude 3.5 Sonnet demonstrated good performance in identifying sophisticated phishing attempts while maintaining a low false-positive rate. When primed for suspicion, it correctly detected all phishing emails from categories as suspicious while also correctly classifying all legitimate emails as benign. Some models, like Mistral, suffered from extensive false positives when primed for suspicion."

Thumbnail
"Sky-T1: Train your own o1 preview model within $450".

Wait, didn't it cost billions of dollars for OpenAI to create o1?

"We introduce Sky-T1-32B-Preview, our reasoning model that performs on par with o1-preview on popular reasoning and coding benchmarks. Remarkably, Sky-T1-32B-Preview was trained for less than $450, demonstrating that it is possible to replicate high-level reasoning capabilities affordably and efficiently. All code is open-source."

Wow, if that's really true.

"To generate our training data we use QwQ-32B-Preview, an open-source model with reasoning capabilities comparable to o1-preview. We curate the data mixture (see later section) to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. We then rewrite QwQ traces with GPT-4o-mini into a well-formatted version, inspired by Still-2, to improve data quality and ease parsing."

QwQ-32B-Preview is one of the Qwen (short for Tongyi Qianwen) models from Alibaba, aka "Chinese Amazon".

"We discard QwQ samples if they are incorrect according to the solutions provided in datasets. For Math problems, we do exact matching with the ground truth solutions. For coding problems, we execute the unit tests provided in datasets."

"We use our training data to fine tune Qwen2.5-32B-Instruct, an open source model without reasoning capabilities."

So they are able to make an "OpenAI o1 preview" with $450 by fine-tuning a model made for billions (or at least hundreds of millions) by Alibaba -- they're not at all making an OpenAI o1 preview equivalent from scratch. Interesting that they use Chinese models to start with, rather than US (e.g. LLaMA) or European (e.g. Mistral) models.

What is left unexplained here is how they replicated o1's long internal chain of thought, which is supposed to be the key advancement that gives it logic reasoning abilities beyond what the GPT series of models is capable of.