Boulder Future Salon

Thumbnail
"Some venomous snakes can bite and kill even when they're dead and decapitated."

File under "Today I learned."

"Snakes are energy-efficient creatures." "Even when their heart has stopped beating, their tissues can retain enough oxygen to allow nerves to fire, triggering a bite reflex if you put a finger in or on its mouth."

Thumbnail
ChatGPT is destroying Trefor Bazett's math exams.

"I just copy and pasted my exams from last semester -- this was a second year university level introductory linear algebra course -- into chat GPT and actually it got an A on my exams. But AI still makes a lot of pretty basic mistakes."

"What is the smallest integer whose square is between 15 and 30?"

ChatGPT-4o, Claude 3.5 Sonnet, and Google's Gemini all get nearly 100% on the GSM8K (which is a fancy way of saying "Grade School Math, 8000 questions") dataset.

GSM-Hard is a dataset with the same word problems as GSM8K but with gigantic numbers -- so the LLM has to outsource the calculation to something like Wolfram|Alpha to be able to get the correct answers.

The MATH dataset has high school *competition* problems. LLMs can get these if they can be solved with "content knowledge", such as by having formulas memorized, but can fail if the reasoning required is made more complex. LLMs get about 70% on the whole dataset.

There are additional datasets with Mathematical Olympiad problems. LLMs score poorly on these, but their scores are increasing.

Thumbnail
Clio aims to be CoPilot for DevOps.

"Clio is an AI-powered copilot designed to help you with DevOps-related tasks using CLI programs. It leverages OpenAI's capabilities to provide intelligent assistance directly from your command line."

"Note: Clio is designed to safely perform actions. It won't do anything without your confirmation first."

Features: Kubernetes management, AWS integration, Azure integration, Google Cloud Platform integration, DigitalOcean integration, EKS management, and GitHub integration.

Thumbnail
Where does AI research come from? This person got the 2,634 papers from the International Conference on Machine Learning (ICML) 2024 conference, extracted the "institutions" (universities, big companies, and AI startups), and used a 5-step geocoding algorithm to place them on a map.

The papers were downloaded from a site called OpenReview, the affiliations were extracted using a local LLM (gemma-2) from the first page of the PDF, the 5-stop geocoding algorithm uses Nominatim, a local LLM (ollama gemma-2) to verify, a Google search with the LLM parsing the results if the verification fails, and Google Maps paid API if nothing else works, and Python folium was used for creating the map.

What the map reveals is that AI research comes from the US, China, and Europe.

Within the US, it comes from California, the East coast, the Pacific Northwest, and to a lesser extent, the rest of the country. California contributes heavily from both the San Francisco Bay Area (Silicon Valley) and the LA area. The Pacific Northwest is predominantly the Seattle area. New York dominantes on the East Coast but Boston makes a considerable contribution. Various other cities like Austin, TX, Chicago, IL, Madison, WI, Atlanta, GA, and Washington, DC also register.

In China, it's pretty much all Beijing, Shanghai, and Shenzhen. The place names are in Chinese so if you want more, you'll need to read Chinese ;)

Seoul, South Korea, also contributes a lot. Tokyo, Japan also makes a significant contribution.

For Europe, London dominates, followed by Paris, and it looks like, Zurich, Switzerland, Munich, Germany, Amsterdam, in the Netherlands, Berlin, and some from various other places: Warsaw, Copenhagen, Stockholm.

Other notable places include Singapore, Israel, Belgaluru in India, and Australia (Sydney, Melbourne, and Brisbane). Canada has Montreal which is a significant contributor.

My state, Colorado, didn't make a very good showing -- only 3 papers, all from Boulder.

All in all, a pretty interesting map. I wonder what the numbering algorithm is -- it looks pretty smooth. You can zoom in and see contributions from all over the country, all over Europe, and around the world. You can zoom in on the hot spots like Silicon Valley and see where within the SF Bay Area contributions come from (Stanford, Berkeley, the city of SF itself where there are lots of startups, the San Jose region with the tech company heavy hitters, etc).

Thumbnail
Ethan Mollick speaks "to a lot of people in industry, academia, and government, and I have noticed a strange blind spot. Despite planning horizons that often stretch a decade or more, very few organizations are seriously accounting for the possibility of continued AI improvement in their strategic planning."

"In some ways, this makes complete sense because nobody knows the future of AI. But organizations and individuals often plan for multiple futures -- possible recessions, electoral outcomes, even natural disasters. Why does planning for the future of AI seem different?"

"Doing nothing has a number of issues. First, it ignores the very real fact that we do not need any further advances in AI technology to see years of future disruption. Right now, AI systems are not well-integrated into businesses and organizations, something that will continue to improve even if LLM technology stops developing."

"A second factor that gets overlooked in discussions is that Artificial General Intelligence (AGI) serves as a motivating goal for an entire industry. Even if the AI labs are wrong about the particular future they are working towards, advances in technologies can become a self-fulfilling prophecy."

Thumbnail
"The winds of AI Winter".

"The vibes have shifted. This is still not a normal moment in AI, and we can't precisely determine how or why, but they have shifted."

"We went from 100% gpt4 usage to almost 0% in the last 3 months". "I've switched to Claude completely. Better task clarification, more consistent output, and improved error handling. OpenAI isn't on par anymore."

"Google AI Overviews being bad, bad, bad, bad (after the Gemini mess)"

"Microsoft announcing and cancelling Recall, Figma announcing and cancelling AI, McDonald's testing and canceling Drive-thru AI (this follows Discord announcing and cancelling Clyde last winter)"

[List continues with a bunch of more stuff]

"In isolation, all of these can be chalked up to strategic or temporary missteps by individuals, just doing their best wrangling complex systems in a short time."

"In aggregate, they point to a fundamentally unhealthy industry dynamic that is at best dishonest, and at worst teetering on the brink of the next AI Winter."

"Leopold Aschenbrenner says, 'So far, every 10x scaleup in AI investment seems to yield the necessary returns.'"

"Diminishing returns are real, scaling laws don't hold in economics like they do in AI, and log lines do not go up and to the right for ever when checked by physical reality."

"The final piece worth an honorable mention this past quarter, though not quite qualifying in the AI infra spend debate, is Chris Paik's The End of Software."

That piece, "The End Of Software" says (among other things): "Software is expensive because developers are expensive. They are skilled translators--they translate human language into computer language and vice-versa. LLMs have proven themselves to be remarkably efficient at this and will drive the cost of creating software to zero. What happens when software no longer has to make money? We will experience a Cambrian explosion of software, the same way we did with content. Vogue wasn't replaced by another fashion media company, it was replaced by 10,000 influencers."

"In the same way that 5 stocks account for 96% of the S&P 500's gains this year, the rollout and benefit of AI has been extremely imbalanced."

"We have mindblowing models, and plenty of money flowing to GPUs, infra is improving, and costs are coming down. What we haven't seen is the proportionate revenue, and productivity gains, flow to the rest of the economy."

Thumbnail
"How are engineers really using AI tools in 2024?"

"A total of 211 tech professionals took part in the survey." "Most respondents are individual contributors (62%). The remainder occupy various levels of engineering management."

"As many professionals are using both ChatGPT and GitHub Copilot as all other tools combined."

"GitHub Copilot Chat is mentioned quite a lot, mostly positively."

"Other tools earned honorable mentions as some devs' favorite tools: Claude, Gemini, Cursor, Codium, Perplexity and Phind, Aider, JetBrains AI, AWS CodeWhisperer, Rewatch."

More paywalled.

Thumbnail
The Crowdstrike glitch that just took out Windows machines all over the planet, explained by Dave Plummer. Crowdstrike made a kernel driver that watched programs' behavior to try to detect viruses before a regular anti-virus would, but it depended on an external file for updates. A recent update downloaded a file full of all 0s. And that didn't work.

Thumbnail
Richard Sutton interviewed by Edan Meyer. Rich Sutton literally half-wrote the book on reinforcement learning -- my textbook on reinforcement learning, Reinforcement Learning: An Introduction, was written by him and Andrew Barto. I've never seen him (or Andrew Barto) on video before so this was interesting to see. (Full disclosure, I only read about half of the book, and I 'cheated' and didn't do all the exercises.)

The thing that I thought was most interesting was his disagreement with the self-supervised learning approach. For those of you not up on the terminology, "self-supervised" is a term that means you take any data, and you mask out some piece of it, and try to train your neural network to "predict" the part that's masked out from the part that isn't masked. The easiest way to do this is to just unmask all the "past" data and mask all the "future" data and as the neural network to predict the "next word" or "next video frame" or "next" whatever. It's called "self-supervised" because neural network training started with paired inputs and outputs where the "outputs" that the neural network was to learn were written by humans, and this came to be called "supervised" learning. "Unsupervised" learning came to refer to throwing mountains of data at an algorithm and asking it to find whatever patterns are in there. So to describe this alternate mode where it's like "supervised" learning but the "correct answers" are created just by masking out input data, the term "self-supervised" was coined.

I thought "self-supervised" learning was a very important breakthrough. It's what led directly to ChatGPT and all the other chatbots we know and love (we do love them right?). But Rich Sutton is kind of a downer when it comes to self-suprevised learning.

"Outside of reinforcement learning is lots of guys trying to predict the next observation, or the next video frame. Their fixation on that problem is what I mean by they've done very little, because the thing you want to predict about the world is not the next frame. You want to predict *consequential* things. Things that matter. Things that you can influence. And things that are happening multiple steps in the future."

"The problem is that you have to interact the world. You have to predict and control it, and you have large sensory sensory motor vectors, then the question is what is my background? Well, if I'm a supervised learning guy, I say, maybe I can apply my supervised learning tools to them. They all want to have labels, and so the labels I have is the very next data point. So I should predict that that next data point. This is is a way of thinking perfectly consistent with their background, but if you're coming from the point of reinforcement learning you think about predicting multiple steps in the future. Just as you predict value functions, predict reward, you should also predict the other events -- these things will be *causal*. I want to predict, what will happen if I if I drop this? Will it spill? will there be water all over? what might it feel on me? Those are not single step predictions. They involve whole sequences of actions picking things up and then spilling them and then letting them play out. There are consequences, and so to make a model of the world it's not going to be like a video frame. It's not going to be like playing out the video. You model the world at a higher level."

Thumbnail
A company called Haize Labs claims to be able to automatically "red-team" AI systems to preemptively discover and eliminate any failure mode.

"We showcase below one particular application of haizing: jailbreaking the safety guardrails of industry-leading AI companies. Our haizing suite trivially discovers safety violations across several models, modalities, and categories -- everything from eliciting sexist and racist content from image + video generation companies, to manipulating sentiment around political elections"

Play the video to see what they're talking about.

The website doesn't have information about how it works -- it's just for people to request "haizings".

Thumbnail
treevis.net: A Visual Bibliography of Tree Visualization 2.0 by Hans-Jörg Schulz.

Some of these look like visualizations of "tree" data structures. Some look that they're "any graph", not necessarily a "tree". And some look like they're trying to visualize literal trees. 339 total visualizations.

Thumbnail
PvQ LLM Leaderboard.

"Recently, we've been building a small application called PvQ, a question and answer site driven by open weight large-language-models (LLMs). We started with ~100k questions from the StackOverflow dataset, and had an initial set of 7 open weight LLMs to produce an answer using a simple zero shot prompt. We needed a way to see the site with useful rankings to help push the better answers two the top without us manually reviewing each answer. While it is far from an perfect approach, we decided to use the Mixtral model from Mistral.AI, to review the answers together, and vote on the quality in regards to the original question."

"Over a few weeks we generated ~700k answers for the following models:"

"Mistral 7B Instruct"
"Gemma 7B Instruct"
"Gemma 2B Instruct"
"Deepseek-Coder 6.7B"
"Codellama"
"Phi 2.0"
"Qwen 1.5 4b"

But if you look at the leaderboard today, you'll see they've got non-open models on it now like GPT-4 Turbo, GPT-4o-mini, Claude 3.5 Sonnet, Gemini Pro 1.0, and so on.

WizardLM from Microsoft, which I never heard of before, did unexpectedly well.

Thumbnail
Japan is a global leader in quiet quitting, despite its hardworking image. According to this article, only 6% of the Japanese workforce is "engaged," vs 33% for the US.

Thumbnail
"OpenRecall is a fully open-source, privacy-first alternative to proprietary solutions like Microsoft's Windows Recall or Limitless' Rewind.ai. With OpenRecall, you can easily access your digital history, enhancing your memory and productivity without compromising your privacy."

"OpenRecall captures your digital history through regularly taken snapshots, which are essentially screenshots. The text and images within these screenshots are analyzed and made searchable, allowing you to quickly find specific information by typing relevant keywords into OpenRecall. You can also manually scroll back through your history to revisit past activities."

Thumbnail
Remember how in June I told you all about how the 50-year-old petrodollar agreement between the US and Saudi Arabia had been allowed to expire. Well get this:

"Saudi Arabia has joined a China-dominated central bank digital currency cross-border trial, in what could be another step towards less of the world's oil trade being done in US dollars."

"The move, announced by the Bank for International Settlements on Wednesday, will see Saudi's central bank become a 'full participant' of Project mBridge, a collaboration launched in 2021 between the central banks of China, Hong Kong, Thailand, and the United Arab Emirates."

mBridge, by the way, stands for "multiple CBDC bridge". It links together the central bank digital currencies (CBDCs) of the four abovementioned banks. The idea is to allow the CDBCs to interoperate while at the same time ensuring compliance with the jurisdiction-specific policies of each bank.

Thumbnail
Chatbots that allegedly have "reasoning capabilities" fail at simple logic problem. "Complete reasoning breakdown".

"The original problem formulation, of which we will present various versions in our investigation is as following: 'Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?'. The problem features a fictional female person (as hinted by the 'she' pronoun) called Alice, providing clear statements about her number of brothers and sisters, and asking a clear question to determine the number of sisters a brother of Alice has. The problem has a light quiz style and is arguably no challenge for most adult humans and probably to some extent even not a hard problem to solve via common sense reasoning if posed to children above certain age."

"We posed varying versions of this simple problem (which in following we will refer to as 'Alice In Wonderland problem', AIW problem) to various SOTA LLMs that claim strong reasoning capabilities. We selected closed ones like GPT-3.5/4/4o (openAI), Claude 3 Opus (Anthropic), Gemini (Google DeepMind), and open weight ones like Llama 2/3 (Meta), Mistral and Mixtral (Mistral AI), including very recent Dbrx by Mosaic and Command R+ by Cohere (which are stated in numerous announcements to lead the open weights models as of April 2024, according to open LLM leaderboards). We analyse the response statistics and observe strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses backed up with correct reasoning as evident in structured step by step explanations those models deliver together with solution. However, Claude 3 Opus and GPT-4 still show frequent failures to solve this simple problem across trials. Importantly, they also show strong fluctuations across even slight problem variations that should not affect problem solving. Retaining the relational logic of the problem, we also formulated a harder form (AIW+), where both Claude 3 Opus and GPT-4o collapse almost to 0 success rate."

"To further measure the sensitivity and robustness of models to slight AIW problem variations, we formulate AIW Alice Female Power Boost and AIW Extention versions, which provide further evidence for strong performance fluctuations and lack of robustness in all tested models, being a reoccurring signature of their severely impaired basic reasoning we observe in this study."

If you're wondering about the "Alice Female Power Boost", that variation "uses a fully redundant 'Alice is female' addition ('she' pronoun is already used in AIW original problem to fully determine gender information and avoid any uncertainty about person's gender as it can be inferred from the name only)."

The "AIW Extension uses combination of both Alice and Bob as sister and brother to ask same type of question."

And AIW++? An example of that is:

"Alice has 3 sisters. Her mother has 1 sister who does not have children -- she has 7 nephews and nieces and also 2 brothers. Alice's father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice's sister have?"

That one's tricky enough that I had to look up the definitions of "nephew" and "niece" and make a diagram of 3 generations on a piece of paper.