Boulder Future Salon

"Data centers achieve low Power Usage Effectiveness by operating on larger shares of renewable energy, like solar and wind. Facebook's low Power Usage Effectiveness is even more impressive when you consider that it maintains 18 data center campuses across the world, which occupy a total of 40 million square feet, all run solely on renewable energy. Many data centers operated by the most visible digital companies are similar. Google, Amazon, Microsoft, and Netflix all invest in renewable energy to fuel their data centers, but nowhere near 100% of the time. Most are still depending upon fossil fuels some of the time. Meanwhile, the majority of data centers in the United States are run by companies you probably haven't heard of. The six largest, publicly-traded data center operators in the United States are Equinix, Digital Realty, CyrusOne, CoreSite, QTS Realty Trust, and Switch Inc. Equinix alone maintains over 85 individual data center campuses. By size, the top ten largest data centers in the United States are mostly run by these relatively unknown companies and serve private industry and government, not consumers, except for two: Facebook, sixth on the list, and Microsoft, ninth. Ranked fifth is the NSA's largest data center known as Bumblehive." "According to Nature, data centers 'use an estimated 200 terawatt hours each year."

Altering emotions in video footage with AI. File under "Deepfakes". The way the system works is, it first performs face detection to find the face, then segments the face and generates a 3D model, then determines expression parameters for the face and captures the original lip movements, then feeds all this into the "manipulator" which alters the expression based on the emotion label you put in ("surprised"), then outputs the resulting 3D model which gets rendered by a "neural face renderer".

Well, they say they capture the original lip movements, but to me, the lip movement on the result seems wrong for the words people are speaking. They also have a system where instead of using an emotion label, you give in a "reference clip" and it matches the expression in the "reference clip", but again, at least to me, the lip movement seems wrong for the words people are speaking. Their "speech preserving loss function" wasn't good enough. Once again, machines have trouble simulating the subtleties of real humans.

"It's sometimes possible for scientists to predict the behavior of an animal from a knowledge of its connectome." A "connectome" is a 3D map of the way neurons are connected, made by slicing pieces of brain into thousands of thin slices, staining them with heavy metals, and imaging them under an electron microscope, and assembling the results into a 3D brain map. By overlaying "functional" imaging on the maps, researchers can learn how the network of connections fires during complex behaviors. It was used to predict the mating behavior of the roundworm C. Elegans.

"In addition to explaining the underpinnings of behaviors, connectomics studies can also reveal subtle details about how those behaviors are wired into brains." What they're getting at here is different individuals have different connectomes, even if they're genetically identical, and the same individual has different connectomes over the course of their life, because the connections between neurons dramatically reorganize themselves between birth and adulthood, even in C. elegans. By comparing the connectomes of eight genetically identical roundworms ranging between larval and adult stages, they were able to figure out what connections were consistent from animal to animal and what neural activity was essential for survival.

Human brains are too big for this process to scale up to human brains. But in human brains, new types of cells never seen in other animals, such as neurons with axons that curl up and spiral atop each other and neurons with two axons instead of one, have been found, but no one knows if these are normal or one-offs due to the unique history and genetic makeup of the one person the tiny snippit of human brain came from. "If they could map equivalent samples from 100 human brains, then they would get some clarity on these unknowns, but at 1.4 petabytes per brain, that is unlikely to happen anytime soon."

"The human brain is often described in the language of tipping points: It toes a careful line between high and low activity, between dense and sparse networks, between order and disorder. Now, by analyzing firing patterns from a record number of neurons, researchers have uncovered yet another tipping point -- this time, in the neural code, the mathematical relationship between incoming sensory information and the brain's neural representation of that information.

"In search of an explanation, they turned to previous mathematical work on the differentiability of functions. They found that if the power law mapping input to output decayed any slower, small changes in input would be able to generate large changes in output."

"Conversely, if the power law decayed any faster, the neural representations would become lower-dimensional. They would encode less information, emphasizing some key dimensions while ignoring the rest."

A developer using OpenAI's GitHub Copilot says, "It feels a bit like there's a guy on the other side, which hasn't learned to code properly, but never gets the Syntax wrong, and can pattern match very well against an enormous database. This guy is often lucky, and sometimes just gets it."

He then goes on to complain about Grammarly, which he uses as a non-native English speaker, but it's not as good as GPT-3 (the underlying architecture of Copilot, trained on software source code rather than English text). "As a result of using the Copilot, I'm now terribly frustrated at any other Text Field. MacOS has a spell-checker that is a relic from the 90s, and its text-to-speech is nowhere near the Android voice assistant. How long will it take before every text field has GPT-3 autocompletion? Even more, we need a GTP-3 that keeps context across apps."

NeuroGen is a commercial service that generates art using AI from short text descriptions.

AI training speed is increasing faster than Moore's Law. (Not to be confused with Cole's Law: thinly sliced cabbage.) The performance of AI training is measured with a benchmark called MLPerf ("machine learning performance"), which more precisely is a set of benchmarks spanning many machine learning tasks, including computer vision, language, recommender systems, and reinforcement learning. It's developed by an industry consortium called ML Commons. Having said that, let's look at the performance increase.

"The increase in transistor density would account for a little more than doubling of performance between the early version of the MLPerf benchmarks and those from June 2021. But improvements to software as well as processor and computer architecture produced a 6.8-11-fold speedup for the best benchmark results. In the newest tests, called version 1.1, the best results improved by up to 2.3 times over those from June."

"According to Nvidia the performance of systems using A100 GPUs has increased more than 5-fold in the last 18 months and 20-fold since the first MLPerf benchmarks three years ago."

The article goes on to detail the accomplishments of industry heavyweights like Google, Microsoft, Intel, and Nvidia, as well as new startups making AI chips including Cerebras Systems, Graphcore, Habana Labs, and SambaNova Systems.

100 lessons from 1 year of AI research. Categorized as "research", "learning", "reinforcement learning", "workflow", "motivation", "support" and "mindset".

Looks like they are all pretty hard-won lessons. "Research" has such lessons as "When evaluating ideas in papers, consider the compute and space required, and whether you can afford to try those ideas given your constraints" and "Simple, elegant ideas are preferable to complex, 'Frankenstein' ideas that can seem more like a series of hacks." "Learning" has such lessons as "Ensure a strong mastery of foundations" and "Don't shy away from difficult topics -- break them down into more manageable components, focusing on the years-long goal of mastery." "Reinforcement learning" has such lessons as "Consider the large number of possible hyperparameter configurations in experiment formulation" and "List all possible modes of failure after every experiment so that you can diagnose later problems easier." "Workflow" has such lessons as "Justify experiment steps with the hypothesis it aims to answer, so that you don't waste time on experiments that don't generate insights." "Motivation" has such lessons as "Remember that the goal of publishing should only be used for motivation and not your main objective of research." "Support" has such lessons as "Be daring to reach out for help from seniors and more experienced people, even if it may seem trivial." "Mindset" has such lessons as "Don't let yourself be too affected by the opportunity costs of doing research."

Automatic text summarization with machine learning -- an overview. Not a summary? Ba dum tiss.

I'll give you a summary of the article. It starts off by talking about reasons why people want to make summaries -- legal texts summarization, news summarization, and headline generation, or just a desire to reduce the length of the document. Then it says, "In general, there are two different approaches for automatic summarization: extraction and abstraction."

Since machines can't do what humans do, which is to read the text entirely to develop understanding and then write a summary highlighting the main points, the extractive approach just extracts sentences from the text in their entirety. Usually this is done by some variation of latent semantic analysis used to identify semantically important sentences. Another technique uses convolutional neural networks to rank sentences.

The abstractive summarization approach involves interpreting the text using advanced natural language techniques. Since this can be regarded as a sequence mapping task where the source text should be mapped to the target summary, sequence to sequence neural network models are used.

Most use an encoder-decoder model, where the original text is transformed into some internal representation, which is then in turn used to generate the summary. Some use attention-based systems. Sometimes convolutional neural networks are added to these to reduce repetition and semantic irrelevance in the summary.

A variant called "Pointer-Generator" copies words "pointed to" with words from a fixed vocabulary of 50k words. "The architecture can be viewed as a balance between extractive and abstractive approaches."

A variation on this theme uses a reinforcement learning system to guide another neural network to the most salient parts of the input.

If you've heard of generative adversarial networks (GANs), such as are used to generate photorealistic images, people have also tried adding "adversarial processes" to reinforcement learning-based systems.

Another combination of the extractive and abstractive approaches is to use an extractive system to select salient sentences and then an abstractive system to write the final summary using those.

Machine Learning for Art is a collection of tools for machine learning for art. Models for DeepDream, neural style transfer, salient object detection (detecting foreground images), image-to-image translation, StyleGAN2 (generates photorealistic images), super-resolution (also known as upsampling), cartoonization, semantic segmentation (pixel-by-pixel labeling of what's in an image), text-to-speech synthesis, reversible generative models and GAN inversion (going from images to parameters of a generative model rather than the other way around), processing faces (detecting faces in images, identifying people, track faces from image to image), photo sketching (going from photographs to sketches), lip-syncing videos, and optical flow (detecting and measuring motion in video).

You know how "automation levels" are defined for autonomous cars? Someone decided to do the same for kitchens. Level 0 is nothing, level 1 is "smart ovens and fryers remove the need to be vigilant in the cooking process and to monitor the oil temperature", level 2 is "assembling bowls, putting toppings on pizzas or stirring rice in a wok", level 3 is "starting from dough and ending into a piping hot sliced pizza" with staff there to catch mistakes, level 4 is the same but no mistakes, staff needed only for custom orders, not routine quality control, and level 5 is you can ask for "a large pie with bacon-pineapple and cheesy crust" and it's done!

Robots that reproduce! Not really like you're thinking, though. These "robots" have a biological component, made from frog cells, a species called Xenopus laevis, in fact, hence the name "Xenobots". "'We have the full, unaltered frog genome, but it gave no hint that these cells can work together on this new task,' of gathering and then compressing separated cells into working self-copies." "These are frog cells replicating in a way that is very different from how frogs do it."

Combining neural network generated faces with traditional 3D animation rendering. From Disney Research. The idea is to fill in parts that are hard to do with traditional 3D modeling, such as the eyes and the inside of the mouth, to make a fully photorealistic render. In addition, the neural rendering from the face is blended with the 3D model to make the whole face more photorealistic.

As long as it's just still photos, it looks totally photorealistic, but the photorealism stops as soon as the animations start. The traditional 3D models don't perfectly match the movement of real humans. As a result, the animations look like regular 3D movie animations, maybe just a little better. So we haven't yet reached the point where we can fire all the human actors and generate all our movies with only computers.

They allude to an "optimization technique" but don't spell out what it is. I checked out the paper to see what it is. Basically they don't train the neural network, which is StyleGAN2, on the details of the face, just the details of the parts they want to fill in, such as the eyes. So the face is approximately right but not exact, and the neural rendering is blended with the traditional 3D rendering. The optimization process additionally partitions the parameters which puts further constraints on them.

Integrating attention and convolution. You know, it never ceases to amaze me how people keep surprising me with stuff I never would've thought of doing. Convolution and attention are totally different design paradigms, and I never would've expected putting them together would result in anything good. Convolution, you'll recall, is used in vision processing so you have one set of parameters that is applied all over an image, so the result is the same regardless of whether the cat is on the left side of the image or the right side of the image (in mathspeak we would say convolution is "translation invariant"). Attention, on the other hand, was invented for language processing, so that, as the output sentence is generated, different parts of the input could be "paid attention to" by the attention mechanism, allowing translations where word order is different in different languages to be accounted for and so on. These "attention-based" neural networks are called "transformers" and they have somehow escaped the domain of language translation and are now used in vision processing and lots of other things. Tesla cars rely on vision transformers (ViTs as they are called) for all their vision.

A simple way you might put convolutional neural networks and attention-based neural networks together is simply to chain them in a sequence. Split your input off into 3 copies, run each through a convolutional neural network, and then use those as your query, key, and value. (One of these days I'm going to have to really dig into transformers and understand how the parameters learned in the "query", "key", and "value" networks result in this high-level phenomena of "attention" -- I admit right now it's not intuitive to me. For now, suffice to say, "attention" works by having 3 sets of parameters, "query", "key", and "value", and when used, the output of the "attention" system is the thing to be paid attention to. If the thing to be paid attention to is part of the original input, the "attention" is called "self-attention". We need a term for it because people have invented other stuff to be paid attention to, of course.) The output is then used to refer back to the original image to tell you what you need to pay attention to in your original input image. Oh, I forgot to mention, let's say you're making a neural network that's an image classifier. So you put in an image and it will tell you if it's a cat, or whatever.

Anyway, this simple model isn't what they do here. Well, it almost is. In addition to what's above, they also take the 3 convolutions and feed them into a fully connected network, and then a "shift operation". The "shift operation" does a shift operation where the amount shifted depends on the "kernel size" of the convolutions -- larger kernels have more parameters and look at a larger portion of the image. They then take the output of this and tack it on to the output from the attention system described above.

You wouldn't think this would work but this network outperforms a boatload of others on the ImageNet classification test. (The boatload of networks have names like ResNet, AA-ResNet, SASA-Resnet, LR-Net, BoTNet, SAN, PVT, T2T-ViTT, ConT, CVT, and Swin). It also outperforms other networks on semantic segmentation, which is where the neural network assigns each pixel in an image to a semantic meaning (roadway, sidewalk, tree, building, etc).

"Does this sound familiar? Your organization wants to adopt AI. You invest in a team to identify use cases where AI can help. Based on the effort and potential impact, you prioritize a few cases for a deep dive. You line up the right data, pilot the use case. If it performs well enough, it is moved into production and becomes part of a business process."

"How often does this work out? Conventional wisdom is to cast a wide net, and start with a big list of use cases. Many will be cut because they don't seem that valuable, many are technologically impossible from the outset, don't have sufficient data, or ultimately don't demonstrate predictive power or sufficient performance. But maybe 1 in 50 make it to being deployed in production."

AI is an underwhelming substitute for previous models, even the best AI gives uncertain results, "use cases" aren't transformative to the business, and modern AI usability is still a nascent and academic field.

Imagioo is an image search engine based on OpenAI's CLIP system. It searches images from Wikipedia.

To give it a whirl, I put in the word "motorcycle" and all the pictures I got were in fact motorcycles. I tried "border" and got borders. Being a little more daring I put in "respect" just to see what would happen and I got soccer (football) players. And plaques in various languages. Go figure.