Boulder Future Salon

Stable Video from Stability AI, the same company that made Stable Diffusion, has been released. Károly Zsolnai-Fehér of Two Minute Papers does a quick run-down, comparing it with existing systems like Runway, Emu Video, and Imagen Video. Stable Video was trained on 600 million videos.

Imperfections: The videos have to be short. Sometimes instead of real animation, you get camera panning. If you want text in your video, it will have trouble. It requires a lot of GPU memory to run. It can't do iterative edits, which Emu Video can do.

On the plus side, Stable Video is completely open source.

Optical illusions created with diffusion models. Images that change appearance when flipped or rotated. Actually these researchers created a general-purpose system for making optical illusions for a variety of transformations. They've named their optical illusions "visual anagrams".

Now, I know I told you all I would write up an explanation of how diffusion models work, and I've not yet done that. There's a lot of advanced math that goes into them.

The key thing to understand, here, about diffusion models, is that they work by taking an image and adding Gaussian noise... in reverse. You start with random noise, and then you "de-noise" the image step by step. And you "de-noise" it in the direction of a text prompt.

The way this process works is, you feed in the image and the text prompt, and what the neural network computes is the "noise". Crucially, this "noise" computation isn't a single number, it's a pixel-by-pixel noise estimate -- essentially another image. "Noise" compared to what? Compared to the text prompt. Amazingly enough, using this "error" to "correct" the image and then iterating on the process guides it into an image that fits the text prompt.

The trick they've done here is, they first take the image and compute the "noise" on it the normal way. Then they take the image and put it through its transformation -- rotation, vertical flipping, or puzzle-piece-like rearrangement (rotation, reflection, and translation), then compute the "noise" on *that* image (using a different text prompt!) and then they do the reverse transformation on the "noise" image. They then combine the original "noise" and the reverse transformation "noise" by simple averaging.

This only works for certain transformations. Basically the two conditions the transformation has to satisfy are "linearity" and "statistical consistency". By linearity, they mean diffusion models fundamentally think in terms of "signal + noise" as a linear combination. If your transformation breaks this assumption, your transformation won't work. By "statistical consistency" they mean diffusion networks assume the "noise" is Gaussian, meaning it follows a Gaussian distribution. If your transformation breaks this assumption, it won't work.

These assumptions hold for the 3 transformations I've mentioned so far: rotation, reflection, and translation. It also works for one more: color inversion. Like a photographic negative. The color values have to be kept centered on 0, though. Their examples are only black-and-white.

Another thing they had to do was use a different diffusion model because Stable Diffusion actually has "latent space" values that refer to groups of pixels. They used an alternative called DeepFloyd IF, where the "latent space" values are per-pixel. I haven't figured out exactly what "latent space" values are learned by each of these models so I can't tell you why this distinction matters.

Another thing is that the system also incorporated "negative prompting" in its "noise" estimate, but they discovered you have to be very careful with "negative prompting". Negative prompts tell the system what it must *leave out* rather than include in the image. An example that illustrates the problem is, for example if you said "oil painting of a dog" and "oil painting of a cat". They both have "oil painting" so you're telling the system to both include and exclude "oil painting".

The website has lots of animated examples; check it out.

Visualization of how GPT works. This is an impressive visualization, where you can even mouseover the matrices and it'll show you not just the value in that cell but how the value in that cell was calculated. It has accompanying text where animations are played as you move through the text (with the spacebar) and you canreplay animations with little "play" buttons. It uses a simplified 85,000-parameter GPT called Nano-GPT.

The heart of it is the "self attention" chapter. Remember, "transformer" is the nonsensical name for the blocks in neural networks that handle the "attention" mechanism (and is the "T" in "GPT"). The "self attention" chapter shows how there are learned weights (the animation shows only inference, not training, so you have to assume the weights have the correct values already from a training process not shown) for "Q", "K", and "V". These are combined with the input to from Q vectors, K vectors, and V vectors. "Q" stands for "query", "k" stands for "key", and "V" stands for "value", and this is supposed to remind of you doing a lookup in a key-value table.

But while that may be the general idea, the visualization here shows you what actually happens in "transformer" neural networks like GPT. The Q and K vectors are combined in such a way that only the current "Q" is used but "K" up to the current entry are used, so "K" is allowed to "see into the past". Q and K are combined into an "attention matrix". After a "normalization" step, this combined "attention matrix" is coupled with V to produce the output of the transformer block.

The text that accompanies the visualization explains the full context of this, including the "tokenization" at the beginning of the process (that produces "embeddings") and the softmax and logits that are used to pick the tokens that are output at the end of the process.

There are visualizations for GPT 2 and GPT 3 as well, which are much huger, but there is no accompanying text to walk you through those visualizations.

"llamafile is the new best way to run a LLM on your own computer".

llamafile is a system that enables you to download an open source large language model's weights in the GGUF format and produce a program that uses it using your computer's C compiler.

Works on Linux, macOS (requires XCode installed), Windows (requires extra steps because of 4GB executable size limit), FreeBSD, NetBSD, and OpenBSD.

I haven't tried this, so if you have a chance to give it a whirl, let me know how it goes.

An autonomous excavator built a six metre-high and sixty-five-metre-long dry-stone wall. It picks up boulders, scans them, algorithmically determines the optimal placement for them, and places them.

The wall is part of "a digitally planned and autonomously excavated landscape and park."

"Using sensors, the excavator can autonomously draw a 3D map of the construction site and localise existing building blocks and stones for the wall's construction. Specifically designed tools and machine vision approaches enable the excavator to scan and grab large stones in its immediate environment. It can also register their approximate weight as well as their centre of gravity. An algorithm determines the best position for each stone, and the excavator then conducts the task itself by placing the stones in the desired location."

"Our geometric planning algorithm uses a combination of constrained registration and signed-distance-field classification to determine how these should be positioned toward the formation of stable and explicitly shaped structures."

The paper is paywalled but I can tell you, because I was trying to figure out how a code CAD system called SummonScript works, that a signed distance field is a grid of voxels where the contents of each cell is a distance that represents the closest distance to the surface of an object. By convention, positive numbers represent 'outside' the object, while negative numbers represent 'inside' the object.

As for "constrained registration", I don't know why they call it 'registration', but the basic idea is that you put in two geometric objects, and the algorithm figures out what geometric transformations (translations, rotations, and scaling) turns the first object into something as close as possible to the second object. It's called 'constrained' because you can tack on additional constraints that you want the algorithm to satisfy. These could be angles that the algorithm is not allowed to change or points that must remain aligned with other points. Since the research paper is paywalled I can't give any more specifics of the algorithm here. Obviously one of the constraints is that it can't do scaling since the size of the stones can't change.

"A23a: World's biggest iceberg on the move after 30 years."

The iceberg, called A23a, split from the Antarctic coastline in 1986. But it swiftly grounded in the Weddell Sea, becoming, essentially, an ice island.

"I asked a couple of colleagues about why, after almost 40 years, A23a is on the move now, wondering if there was any possible change in shelf water temperatures that might have provoked it, but the consensus is the time had just come."

"Eventually it was going to decrease (in size) sufficiently to lose grip and start moving."

Going to do its little part to contribute to sea level rise.

"New wonder material is 5x lighter and 4x stronger than steel".

Impressive headline. What's going on here?

This has to do with a class of materials called "nanolattices". By "lattice", we mean a regular, repeating structure. "Nanoscale" means at the scale of nanometers. Remember your metric prefixes, as you get smaller, go milli-, micro-, nano-, each 1000x smaller than the previous. Nanometers is approximately the size of atoms and molecules, so building at the "nanoscale" means building at the scale of atoms and molecules.

The technique here is called DNA origami. The technique works by using DNA to create "frames" to shape the material you want to create. The process starts with creating strands of DNA called "staples", where a typical origami design might have 250 staples. These staples will "fit together" to self-assemble into a shape. There is usually one or more long strands extracted from bacteria (instead of synthesized from a DNA sequence) act as "scaffolding" and the "strands" are attached to this scaffolding. The self-assembly process is determined by the tendency for complementary DNA base pairs to bind to each other (A to T and C to G), as you may recall from biology.

It's not exactly intuitive how to make DNA that makes the shape you want, so the DNA origami equivalent of "CAD" software has been invented. (See below for more on that.)

The material being used here is silica (which the Popular Mechanics article calls "glass") and the next step in the process is called the sol-gel process. "Sol" means "solution" (*not* "solid" -- we're actually talking about the liquid state here), and "gel" means "solid". (Got that?) So the idea is to put in a liquid with your silica precursors suspended in a solution and have them solidify inside the DNA frames you created with your DNA origami technique.

The precursors used here are 3-aminopropyl-triethoxysilane (aka APTES) and tetraethyl orthosilicate (aka TEOS). APTES has a central silicon atom with 3 oxygens around it, an amino group (NH2), and a bunch of other carbons and hydrogens. TEOS has a central silicon atom with 4 oxygens around it and nothing else, but outside the oxygens there are carbons and hydrogens (CH3 -- methyl groups).

These combine to form cuboid nanolattices that are approximately 29 nanometers on each side. The frames created by DNA, however, are octahedral. The octahedrons nudge the cudoid nanolattices to form vertex-to-vertex connections between the DNA frames. Each octahedron has 6 neighboring octahedrons with complementary DNA at the vertices. The octahedral "units" repeat in space to form a cubic nanolattice with an edge length varying between 1 and 10 mm. (mm == milimeters! We've just switched from nanometers (nm) to millimeters (mm)!)

The manufacturing process results in lattices that tend to cluster in large clumps, with isolated particles in between the clumps. It's these isolated particles that the researchers selected for strength testing. And only those with a cuboid geometry. Small nanolattices (edge length < 3 mm) had compressive yield strength of above 2 GPa.

At this point we have to explain what that number means. Compressive yield strength has to do with how well a material can handle compression in such a way that when the pressure is removed, it returns to its original shape. The number is represented in units of pressure, in this cane GPa which means gigapascals. (2 GPa is about 300,000 psi, or pounds-per-square-inch, if you're familiar with that unit of pressure, but, metric system, people!)

As the edge length of the nanolattice increases, the compressive yield strength actually decreases -- though the relationship is nonlinear. This may seem counterintuitive, but the explanation the researchers gave for this is that as the edge length of the nanolattice increases, it increases the likelihood of defects such as vacancies and voids in the material. The more defects, the lower the resulting strength of the material.

The strongest nanolattices had a compressive yield strength of 4.8 GPa.

I tried to find the GPa for steel for comparison but my Google searches kept giving me numbers like 200 -- but that's the Young's modulus, not the compressive yield strength. I *think* the compressive yield strength for steel is usually less than 1 GPa and maxes out around 1.4 for the strongest steels. In the article they say the new material is 4x stronger, which would put steel at 1.2. (If any of you are steel experts, feel free to chime in.)

As for Young's Modulus, the Young's Modulus for bulk silica is 72 (GPa) and they say their technique tends to lower it, so the idea is to keep it as close to 72 as possible. But wait, we haven't explained what the Young's Modulus is. It's a measure of the stiffness of a material. So if the Young's Modulus for steel is 200, that means the new material is a lot less stiff. This may seem counterintuitive -- how can a material be stronger but less stiff? But remember, our definition for "strength" is compressive yield strength, and compressive yield strength is how much pressure it could take while returning to its original shape after the pressure is removed. So it is allowed to "deform" as long as it goes back to its original shape.

Anyway, this is all lab work and we don't know if this will ever become a cost-effective commercial product that buildings or other structures will be constructed out of. Seems to me like the biggest issue is that the researchers were highly selective in which particles they strength tested. To bulk manufacture this material, you'd either have to invent a process that automates this selection process, and in the process throw away the bulk of what you manufacture, or invent a whole new process for producing the material.

The Q* model is real. Sam Altman, in an interview with The Verge, revealed that word of the model was "a leak." But he doesn't reveal anything about the model in the interview.

"The reports about the Q* model breakthrough that you all recently made, what's going on there?"

"No particular comment on that unfortunate leak. But what we have been saying -- two weeks ago, what we are saying today, what we've been saying a year ago, what we were saying earlier on -- is that we expect progress in this technology to continue to be rapid and also that we expect to continue to work very hard to figure out how to make it safe and beneficial."

The rest of the interview doesn't say much. Why was he fired? He doesn't say. What happened that resulted in him coming back? He doesn't say. How will OpenAI's governance structure change? He doesn't say.

"Designing a really good governance structure, especially for such an impactful technology, is not a one-week question. It's going to take a real amount of time for people to think through this, to debate, to get outside perspectives, for pressure testing. That just takes a while."

Hypothesis on what happened on SpaceX's Starship Integrated Flight Test 2. In the first stage, he thinks there was a turbo pump failure in one engine that caused it to shed shrapnel and cause a cascading failure.

For Starship itself, he thinks there was pressure from the center 3 engines that resulted in a puncture that caused propellant to escape like a puncture in an aerosol can, causing Starship to spin and activating the computer system designed to detect when it would fail to follow its intended trajectory.

A machine learning model has predicted the properties of a class of high-temperature superconductors known as cuprates.

"Since the first high-temperature superconducting materials, known as the cuprates, were discovered in 1986, researchers have struggled to explain their properties and to find materials with even higher superconducting transition temperatures. One puzzle has been the cuprates' wide variation in transition temperature, ranging from below 10 K to above 130 K. Now Masatoshi Imada of Waseda University in Japan and his colleagues have used first-principles calculations to determine the order parameters -- which measure the density of superconducting electrons -- for four cuprate materials and have predicted the transition temperatures based on those order parameters. The researchers have also found what they believe is the fundamental parameter that determines transition temperature in a given material, which they hope will lead to the development of higher-temperature superconductors."

"They used a combination of numerical techniques, including one supplemented by machine learning, and did not require any adjustable parameters."

"AttackSpace is an open-source curated comprehensive list of LLM security methods and safeguarding techniques."

"Note: These examples are purely conceptual and do not include execution details. They are intended for illustrative purposes only and should not be used for any form of actual implementation or harm."

The page describes "Red Teaming", which is deliberately trying to test the safety of an AI system by getting it to do things that it shouldn't do, and "Goal Misgeneralisation", which is there term for when a reward function isn't exact enough, for example there was a reinforcement learning agent playing a boat racing video game some years ago that discovered it could get points for collecting some power-ups, then wait a few seconds and they'd reappear, and then it could loop backwards and get the points again. So it just went in an endless loop going in circles instead of actually trying to win the boat race.

The page has links to numerous research papers in both of these categories, then goes on to highlight two papers on "Mosaic Prompt" and "Cross-Lingual Attacks". "Mosaic Prompt" means breaking down impermissible content into small permissible components, where each component is queried independently and appears harmless. "Cross-Lingual Attacks" involve bypassing protections in English by using some other language where the AI system has not been trained to properly implement protections.

Breakthrough in neural networks that could enable them to get much deeper than current neural networks. "Towards training without depth limits: Batch normalization without gradient explosion".

So what this is about is making deep neural networks even deeper, and to do that problems with rank collapse, exploding and/or vanishing gradients have to be solved.

Before diving in, a bit of background information. The main idea of a neural network is you simulate the connections between neurons as linear functions, and you represent these mathematically as matrices. This enables you to bring in a whole body of mathematics known as "linear algebra" to bear on the problem of making neural networks. The columns represent your inputs and your rows represent your outputs and everything in the matrix represents your connection "weights", which are the parameters of the neural network (or at least most of them). If you have a lot of 0s then you have a sparsely connected neural network, and if you have every cell filled in then you have a densely connected network.

The way you train your network is by calculating the difference between the output and the "correct" output -- the "error". You use these "error" values to make adjustments to your parameters so that next time around, there will be less error.

If you have multiple layers, though, you have a problem. The first is that combining multiple layers of linear transformations is the equivalent of another linear transformation that combines them all. So you insert non-linear layers in between your linear layers. These have become known as "activation" layers. Some of these are complex functions, like hyperbolic tangent (tanh), and some are simple, like ReLU (which is short for "rectified linear" and just means y = x for positive numbers and y = 0 for negative numbers). What matters is that these nonlinear layers are differentiable. Because what you do in order to extend the "error correction" process into deeper layers is you use the chain rule in calculus. So as long as you can calculate a "gradient" on every layer, where a "gradient" represents a multi-dimensional equivalent of a "derivative" as you learned in calculus, you can apply the chain rule and keep going into deeper layers. This is where we get terms like "gradient descent" (and "stochastic gradient descent") (because you "descend" the "gradient" to do your "error correction"), and "backpropagation", because you propagate your "error correction" signal backwards through the layers, in the opposite direction as when you're doing your "forward pass" from input to output.

When training neural networks, there is a technique called "batch normalization". "Normal" here is a reference to the "normal distribution" in statistics. What this does is adjust the parameters of a layer in a neural network such that the inputs to the next layer have an average of 0 and a standard deviation of 1. This has the effect of making the inputs more "orthogonal", which you can think of as being "less parallel". When two input calculations become "parallel" and the inputs to the next layer become identical, you have in essence lost an input parameter to the next layer. I'm using the word "parallel" because parallel lines close together or overlapping are easy to visualize, but the actual word for this in linear algebra is "collinear". This loss of an input parameter is called "rank collapse", because everything has to have strange names. Ok, "rank" here is another term for the number of dimensions in a vector. So when your vector has duplicates, then you've in essence decreased the number of dimensions. Hence "rank collapse".

So, orthogonal good. And "batch normalization" increases orthogonality, so people do it. Normalization is in fact done from the very start, with the initialization step, where all the parameters are initialized to random values. They're initialized to random values, but then they are normalized, and this helps keep everything orthogonal right from the start.

Paradoxically, however, batch normalization causes a problem called "exploding gradients". When reducing the amount of variance on the forward pass, it can actually get amplified on the backward pass, when the backpropagation algorithm updates the parameter values. The deeper the neural network, the greater the sensitivity to small changes in the parameters. Also there's that word "batch" in "batch normalization". Normalization is done in batches because the entire training set is too big, usually. But if a batch is not statistically representative of the entire training set, then the batch normalization process itself introduces noise into the process. This noise can get amplified in the backward pass and, because the effect is cumulative across layers, has a greater effect the deeper the neural network.

Various attempts to mitigate the exploding gradients problem have been invented, but those can paradoxically cause the reverse problem -- "vanishing gradients". What to do?

What the researchers here decided to do was to try to come up with a metric to track "orthogonality" directly. The metric they invented they call the "isometry gap". The way it is calculated is by taking a matrix X, transposing it, multiplying that by X, and then on the numerator taking the determinant and then taking that to the dth root where d is the number of dimensions (columns) of X, and on the denominator taking the trace of that same matrix multiply and dividing by d, and then taking the negative logarithm of the whole thing.

That's probably not intuitive at all, so here's the intuition they describe for it: first think of the columns of X as representing a parallelogram, but with more dimensions -- the higher-dimension equivalent of a parallelogram. (This is called a "parallelepiped". Going from 2 to 3 dimensions, picture a box but allowing side lengths on some dimensions to be any length and allowing any angles.) This is formed simply by interpreting the columns of X as vectors. Now you can think of the determinant in the numerator as equivalent to the volume squared of this higher-dimension equivalent of a parallelogram.

This brings us to the trace term in the denominator. The intuition is that this represents "the sum squared-norms of the columns of X", or to put in even more regular language, you're getting a measure of how far the columns are from the origin, except it is a combined measure for all the columns. This is a bit hard to see because first, the word "norms" here means "magnitudes" (distances from the origin), and has nothing to do with the word "normal" we've been using until now. And second, you wouldn't (or at least I wouldn't) guess this from how this is calculated. The X-transposed-times-X operation gives you a square matrix with the number of columns of X for all sides of the square (this is the same number we've been calling "d" -- number of dimensions) (the number of rows disappears) (in linear algebra, this is known as a "Gram matrix"). Next, the trace operation is made by summing up the diagonal elements of the matrix. The two operations put together represent a calculation summing up squares of elements of the original matrix X, like terms in the Pythagorean distance formula.

Anyway, the result of taking the ratio of these two terms, plus all the other details I'm skipping over -- the root d and divide by d and negative logarithm and everything -- is a number that "provides a scale invariant notion of volume and isometry."

"On the one hand, if there is any collinearity between the columns, the volume will vanish and the isometry gap will be infinity." "On the other hand, isometry gap = 0 implies X-transpose-times-X is a scaled identity matrix." So you can see here the intuition that this number = 0 means perfect orthogonality, and infinity represents "collinearity between the columns" and thus "rank collapse".

The next concept we have to introduce is the concept of a Lyapunov function. These actually come from the realm of differential equations. The idea is that a differential equation can have an equilibrium point, but the equilibrium point may be stable or unstable. For certain classes of differential equations, if you can prove there exists a Lyapunov function, then you can prove the equilibrium of the differential equation is stable.

While the mathematical formalism (linked to below) is beyond my understanding of differential equations, the intuition here is that we want to invent a neural network such that the isometry gap function is always decreasing. Lyapunov functions can be thought of as "energy levels" for the differential equation which can be thought of as "wanting" to go to lower energy levels. The gradient of the Lyapunov function has to relate to the vectors of the differential equation. (See video explainer below for this intuition). If a function can be shown to be a Lyapunov function and shown to be decreasing over iterations of a system, that means that system will stabilize. And here, it will stabilize in a state with orthogonality in the input vectors at every layer, preventing rank collapse.

At this point we've established a metric for rank collapse and can demonstrate the batch normalization prevents rank collapse, but what about the problem of exploding gradients? They say that to avoid exploding gradients, you first make the number of training examples in each batch the same as the number of dimensions, and then you make sure all the parameters are orthogonal at initialization.

They say they need two main modifications to avoid gradient explosion: First, n = d, which looks like it means the number of training examples is the same as the number of dimensions, and that weights W for layer l are random orthogonal matrices. Orthogonality here is tested with W-transpose-times-W gives you a diagonal identity matrix, plus a Haar measure.

The Haar measure comes from group theory (mathematics of symmetry groups -- in this case "locally compact topological groups") and in looking it up I immediately landed on a video about Lie Groups which was incomprehensible to me. As well as the formal definition from Wolfram (linked to below). From what I can tell, the reason group theory is being brought into this is because group theory is immensely concerted with maintaining invariant properties no matter what operations are performed. In this case, what we want to guarantee remains invariant is orthogonality. As best I can tell, here, the idea is to construct orthogonal matrices in such a way that they remain (provably) orthogonal under matrix multiplication. The Haar measure over the orthogonal group enables this. They say something called Weingarten calculus is required to do the calculations.

In searching for information on Weingarten calculus, I came across a 14-page article and what is essentially a 49-page book, both of which you can download in PDF format. And both of which I found incomprehensible.

As for that bit about making the number of training examples in each batch the same as the number of dimensions, I speculate that that may have been a necessary assumption to make their mathematical proofs doable, but that it might not matter in the real world.

So, to sum up: it looks like the problems of exploding gradients and vanishing gradients has been solved. From this we should expect in the not-so-distant future, neural networks will become incredibly deep. Wait, aren't they incredibly deep already? Well, they're going to become incredibly *incredibly* deep. With the problems of rank collapse, exploding and vanishing gradients solved, there will be essentially no limit to how deep a neural network can become.

Having said that, this paper seems to prove such a thing is possible but not actually do it. Yes, the researchers created bare-bones neural networks to demonstrate that it works, but the heavy lifting will have to be done by others in the future. The problems that need to be solved include: how to guarantee this system will work for a variety of activation functions. The work here only addressed linear transformations. Well, the mathematical proofs only applied to linear transformations. They did experiments with sin and tanh activation layers (appendix E starting on page 30). Their theory should be extendable to activation layers. Maybe these researchers will do that, or maybe that work will fall to others. Furthermore, this work relies on very advanced mathematics, such as Weingarten calculus, and that will have to be incorporated into industry-standard tools like PyTorch to become widely available to developers who want to make ultra-deep neural networks.

The key to it all seems to be to initialize all the parameters of the neural network such that they are both "orthogonal" for all inputs at every layer, and guaranteed to remain orthogonal throughout the lifetime of the neural network, no matter what transformations are done to the parameters during the neural network's training. Exactly how that is accomplished, I can't fully explain to you because it involves mathematics more advanced than I'm able to understand.

What are the implications of all this? Currently, neural networks are limited to a few hundred layers, and they "cheat" and have inputs that skip layers in order to achieve this. (These are called residual networks.) If it's really true that the "exploding/vanishing gradients" problem has been solved, then once that has been incorporated into the toolchain, e.g. PyTorch, then we could be seeing neural networks with thousands or millions of layers. And vastly more intelligent than today's neural networks.

If people can get the number of layers into the billions, then neural networks will be approaching the size of the brain, which has ~100 billion neurons and ~1,000 trillion synapses. But considering current neural networks are in the mere hundreds of layers and are already showing considerable intelligence, I'm guessing AI systems will exceed human intelligence well before the number of layers extends into the billions.

Just to be clear, I'm the one speculating on thousands or millions (or billions) of layers, not the authors. (They never make any such claims.) I'm speculating on, if the problem of exploding/vanishing gradients is really and truly solved, what does that imply? I was actually thinking, billions of layers means there are layers billions of connections away from sensory input. I don't think, even though the brain has ~100 billion neurons, that any of them are billions of connections away from sensory input. At least, I would be very surprised if that was the case. So I think, in practice, nobody will ever make a neural network with billions of layers.

On the other hand, no one should ever say "never". Maybe someday neural networks will have billions of layers and be "superintelligences" that are as much smarter than us humans as we are from ants.

ProPublica has a new "Nonprofit Explorer". "1.9M active nonprofits", "18M tax filings", "$3.6T total revenue".

CRISPR has been authorized for use as a gene therapy for sickle-cell disease and transfusion-dependent beta-thalassemia by the UK's Medicines and Healthcare products Regulatory Agency (MHRA). Not the FDA because this is the UK we're talking about.

"I am pleased to announce that we have authorised an innovative and first-of-its-kind gene-editing treatment called Casgevy, which in trials has been found to restore healthy haemoglobin production in the majority of participants with sickle-cell disease and transfusion-dependent beta-thalassaemia, relieving the symptoms of disease."

"Casgevy is designed to work by editing the faulty gene in a patient's bone marrow stem cells so that the body produces functioning haemoglobin. To do this, stem cells are taken out of bone marrow, edited in a laboratory and then infused back into the patient after which the results have the potential to be life-long."

"How California could pave the way for an industry standard EV diagnostic system."

"If your gas-powered car was manufactured after 1991, it is most likely equipped with a mandatory On Board Diagnostic (OBD) system -- a computer that monitors the health of the engine and various other components, effectively functioning like a doctor for your car."

"For electric cars, a standardized OBD system has been notably absent."

"It seems like the authorities have taken this matter seriously. From 2026, California Air Resources Board's Advanced Clean Cars II program would require automakers to equip their EVs with a standard diagnostic system, similar to the OBD II in internal combustion engine cars."

What is Q*? Did OpenAI discover a vastly more powerful new algorithm and did that play a role in precipitating the chaos at the company?

Where the term "Q*" seems to come from is, first, in reinforcement learning, there's typically a "v" function that represents the value of a state, and a "q" function that represents the combination of a value of a state combined with an action taken in that state. And then second, there's an algorithm called A* that's been around for decades in graph theory that finds an optimal path through a graph efficiently.

So the speculation is that OpenAI discovered an algorithm that turbocharges the ability of large language models (LLMs) to reason logically by replacing the cumbersome "chain of thought" (or "tree of thought") process where the user has to say "think step by step" or iteratively step the LLM through each step with a process where the LLM itself thinks step by step and incorporates its own learning mechanisms on each step so as on each step to become better at moving towards the goal.

This is all speculation (from David Shapiro). We'll probably find out before too long what Q* is really all about. The rumor is that the Q* algorithm makes GPT much, much better at solving math problems (without the help of an external math system like Wolfram|Alpha).