A good year and a half ago, Yann LeCun realized that he was wrong. He is considered one of the most influential AI researchers on earth. As a senior scientist in Meta’s AI lab, he had tried to give machines a basic understanding of how the world works. This kind of common sense was supposed to come from training neural networks to predict what would happen next in video clips of everyday events. But it turned out that guessing the upcoming sequences of a video pixel by pixel was simply too complex. LeCun ran into a wall.
Hope for the AGI
Now, after months of searching for what was missing here, he has a bold new vision for the next generation of AI. In a draft shared with MIT Technology Review, LeCun outlines an approach that he believes will one day give machines the common sense they need to navigate the world. For LeCun, the idea could be a first step towards developing machines capable of thinking and planning ahead like humans – what many refer to as general artificial intelligence (AGI). In doing so, the expert is also moving away from the hottest machine learning trends of the moment and reviving some old, outdated ideas.
But his vision is far from comprehensive; it may raise more questions than it answers. The biggest question mark, as LeCun himself states, is that he may not yet know how to build what he is describing. At the heart of the new approach is a neural network that can learn to look at the world at different levels of detail. Because this network does not require pixel-perfect predictions, it only focuses on the features of a scene that are relevant to the task at hand. LeCun couples this core network with another, called the configurator, which determines what level of detail is required to work correctly – and adjusts the overall system accordingly.
For LeCun, an AGI will be part of how we interact with technology in the future. His vision is shaped by that of his employer, Meta, who is pushing a metaverse in virtual reality. In his opinion, in 10 or 15 years people will no longer be carrying smartphones in their pockets, but augmented reality glasses equipped with virtual assistants that guide users through their day. “Basically, for these assistants to be of any use to us, they have to have more or less human intelligence,” he believes.
“Yann has been talking about a lot of these ideas for a while,” says Yoshua Bengio, an AI researcher at the University of Montreal and scientific director at the Mila Quebec Institute. “But it’s good to see everything together in one stitched image.” Bengio thinks LeCun is asking the right questions. He also finds it exciting that LeCun is willing to release a document that contains so few answers. It’s more of a research proposal than a set of real results, he says. “People talk about these things privately, but they’re not usually made public,” says Bengio. “Because that’s risky.”
A matter of common sense
LeCun has been involved with AI for almost 40 years. In 2018, he shared the Turing Award, arguably the most important prize in computer science, with Bengio and Geoffrey Hinton for his groundbreaking work on deep learning. “Making machines behave like humans and animals was the goal of my life,” he says.
LeCun believes that the brains of humans and animals run a type of simulation of the world, which he calls a world model. This model is learned in infancy and is how we manage to make good guesses about what’s going on around us. Infants learn the basics in the first few months of life by observing the world, says LeCun. It is enough for a child to see a ball fall a few times to get a feel for how gravity works.
“Common sense” is the collective term for this type of intuitive thinking. This also includes understanding simple physical relationships: for example, knowing that the world is three-dimensional and that objects do not disappear when they are out of sight. So we can predict where a bouncing ball or a speeding bike will be in a few seconds. And it helps us connect the dots between incomplete information: When we hear a metallic crack coming from the kitchen, we can make an educated guess that someone dropped a pan because we know what types of objects are making that noise and when that happens.
In short, common sense tells us which events are possible and which are impossible—and which events are more likely than others. It allows us to anticipate the consequences of our actions and make plans—ignoring irrelevant details. But it’s difficult to teach machines common sense. Today’s neural networks would have to be shown thousands of examples before they start recognizing such patterns.
So, in many ways, common sense boils down to the ability to predict what will happen next. “That’s the essence of intelligence,” says LeCun. Because of this, he and several other researchers have used video clips to train their models. However, with previous machine learning techniques, the models had to accurately predict what would happen in the next frame – and generate it pixel by pixel.
“Imagine holding up a pen and releasing it,” says LeCun. Common sense tells us that the pen will fall, but not the exact position it will land. To predict that would require cracking some difficult physics equations.
That’s why LeCun is now trying to train a neural network that only focuses on the relevant aspects of the world: it’s supposed to predict that the pen will drop, but not how exactly. He sees this trained web as the equivalent of a world model that living beings rely on.
LeCun explains that he built an early version of this world model that can do basic object recognition. Now he’s working on training it to make predictions. But how the configurator, which is also necessary, is supposed to work remains a mystery. LeCun envisions this neural network as the controller for the entire system. It would decide what kind of predictions the world model should be making at any given point in time and what level of detail it needs to focus on to make those predictions possible. He also has to adjust the world model as needed.
#Common #Sense #Machines #Metas #Road #General