Common Sense and Natural Interaction: A Turning Point for Robotics?

It feels like we’re at a critical stage in robotics at the moment: we’re a year or two into the very rapidly scaling push to leverage and integrate recent advances in AI with embodied robots: not a long time in historical terms, but long enough to get an updated feel for where things may be going.

In just the last two weeks there have been major new demos, partnerships, funding announcements and new startups: a new video demo overnight from Figure (in collab with OpenAI), a new startup Physical Intelligence out of Stanford, U Cal, Deepmind and others and ongoing dev by Tesla on Optimus, Agility and non-humanoid players like Covariant.

I got a few of my thoughts onto video, particularly emphasizing that:

➡️ the two, combinatorial game changers here, are the provision of a workable “common sense” for robots about the world around them, plus a sophisticated, natural iterative interaction experience with the people they’re working with or working for

➡️ I think there’s currently too much focus on the robot capabilities, and not enough on the flexibility in how (whatever their capability) these technologies can be useful. Unlike other non-forgiving applications like autonomous vehicles, there’s a lot more flexibility in what combination of capabilities and weaknesses make for a commercially viable product. In some applications, robots can work slower, more clumsily, make regular mistakes, be supported by teleop etc… and still be of sufficient use – contrast with AVs where viability across these dimensions is much more restricted

➡️ some sense of progress is masked / not readily apparent because of “more contrived” past demos, where more scripting (sometime outright fraudulent), 10x speed ups, video cuts & offscreen teleop has enabled it – these new demos may still be cherry picked but they are likely much more “representative” of what’s actually happening

➡️ there are valid concerns around moving to more realistic, cluttered “real world” environments, and the technical problems of grasping, manipulation etc… but these new AI-driven capabilities (understanding, interaction etc…) may enable commercially acceptable workarounds, even if the way it achieves its task is not graceful and it’s still not as capable or fast as a person

➡️ some of the current expectation flow around this tech is: companies make big predictions / hype about specific use cases – specific use cases (may) end up being underwhelming – people can react by disregarding the entire technology’s promise – the last bit is not something I’d advise doing here: the initial use cases guesses can be wrong but the technology can still be incredibly disruptive – just not always in the ways that these companies are spruiking.

#Figure update: https://www.youtube.com/watch?v=Sq1QZB5baNw

Full Video Notes

If you’ve been keeping an eye on the robotics news in just the last couple of weeks, you will have seen a large number of major funding announcements, new startup announcements, new collaboration announcements, and in particular, new robot demos that have hit the internet. I want to spend just a little bit of time today getting ahead of some of these developments, providing my perspective on them, and especially on some of the response that we’ve seen in the community, both positive and negative, to them.

So, what’s happening? Well, companies who develop robot platforms—humanoids and other types of robots—have been making a lot of progress in pairing these systems and integrating them with the recent developments or recent popularized developments in large language models, foundation models, and the so-called new breed of artificial intelligence systems. This is really significant for two reasons. The first is that robots have typically struggled to act as if they have some degree of common sense, and these systems, basically for free—not without some tweaking and modifications—give robots a lot of common sense. Now, we can argue philosophically about whether they really understand things, but from a commercial point of view, what really matters is that they can act accordingly, as if they have a common sense understanding of the world.

There’s been huge advances in the ability of robots to act in this manner, both in terms of understanding tasks and objects and relationships, and also visually assessing what is happening in the scene. The second aspect of this advance is the fact that these large language models, and the interfaces over the top of them like ChatGPT, which you may have played with, but also more sophisticated or specialized versions, give these systems a very intuitive way to interact with people, co-workers, or just people who own these robots in a sophisticated iterative manner. And that’s something else that we really haven’t had before in robotics at this scale. And it’s a combination of these two things happening at the same time because, of course, no robot is going to be perfect at everything it can do. One of the obvious solutions being having people tell it, or teach it, or interact with it, and having that mechanism available through these interactive sophisticated interactive mechanisms. It’s that combination of things that makes this really groundbreaking.

So if you’ve logged online, you probably will have seen a variety of videos of armed robots and humanoid robots doing various tasks. Now, one way to emphasize how impressive a lot of the advances are is to criticize past demos. So a lot of the populist portrayals of robots in the past, especially in the last 10 or 20 years, have involved videos that were highly scripted. So, videos that were shot 58 times and then cut together from different components of when the robots seemed to work, videos that were sped up, videos where it was not initially disclosed that there was a substantial teleoperation or offscreen aspect, and perhaps most infamously, videos where the robot was basically reading from a rigid script and there was no actual genuine interaction occurring, and all of these have had a lot of attention and exposure.

So it’s understandable that people look at some of these more recent videos and perhaps aren’t as impressed because they’ve seen this stuff, I guess, faked, before. But these new videos, whilst I’m sure there is still some cherry-picking, and they’re showing us the better runs, and even the poorer runs that they show are probably cherry-picked in of themselves, these videos are far more organic and genuine in terms of showing what the robot can do in the context in which it has been presented.

Now, we’re looking at these videos, and we can notice a few things. So we notice that these are fairly clean, uncluttered environments. The objects that they’re handling, like the dishes and the plates, are relatively lightweight. It’s a very clean environment in terms of there aren’t a lot of distractor objects or clutter that the robot has to move through, and these are valid observations. We will see over time whether these systems can become more capable at dealing with realistic, more challenging environments, but what they can already do is quite impressive, especially given there are a lot less cheating going on in these videos compared to say 10 or 20 years ago.

But the most important point, or couple of most important points I want to make in this video is firstly, that there has been a lot of backlash around the proposed use cases of some of these robots. So one of the main use cases of these robots is proposed as being a replacement for workers in high-throughput factory and manufacturing environments. There’s obviously a substantial financial incentive for the companies to do this, but they are attempting to get robots to do something that people are already very, very good at, very efficient at, and very quick at, and just whether that becomes true or not is not actually the ultimate point. The technology and what it could do in terms of disruptive impact is separate from the predictions of the people who are creating it

. They have a hunch and a financial incentive to target certain areas, but that doesn’t necessarily mean that they’ll be correct. But just because they’re not correct does not mean that the technology won’t be incredibly or potentially incredibly disruptive in other areas.

One of the areas that has had a lot of backlash in recent years has been autonomous vehicles, and there’s been many billions of dollars invested in this area, and there’s still no concrete impression that this will be a widely deployed ubiquitous technology in the near future. Now, there’s a lot of interesting differences to note between some of these recent developments in humanoid robotics paired with large language models and other technologies versus autonomous vehicles. The first is that you can employ a lot more crutches that are economically acceptable for a robot.

So, let’s take the example of a relatively expensive robot that does the cleaning and tidying in your house. That robot has a lot of options that can compensate for it being good but not perfect at what it does. First of all, it can work slowly if it’s working during the day, and this is the model that robot vacuum cleaners use. It can work over many hours to achieve something that a human cleaner might do in half an hour or an hour, and that’s still acceptable in many cases from a user perspective because the current model is that people are inside your house looking at everything inside your house. It is likely acceptable to have a remote teleoperation mode that kicks in when the robot gets into trouble. The extent to which this is economically viable will depend on the context and the utility of the robot, the utilization factor of the robot.

But we can see that this is the model that some of the autonomous vehicles are settled on rather than entirely automate everything. They had a highly automated process that had regular human touchpoints to check in on the robot system. Utilization rate is another key factor here. So these robot platforms will come down in hardware price quite quickly, maybe you pay a service subscription for them, but if they are stuck in just your home and not doing stuff a lot of the day, then you’re not utilizing them a lot, and that could be a relatively expensive asset to have. So this is a concern for robots deployed in sort of bursty activities like processing agricultural proceeds and a lot of other domains. And so people are looking at ways to either make the platform cheap enough that this isn’t a concern, or look at ways to achieve a higher utilization rate, or get the system to do more useful things, which means it’s used more of the time.

There’s been a lot of commentary about the fact that some of the tactile, touching, grasping, manipulation technical aspects of these systems are not directly solved or addressed by a lot of these recent developments in AI, but the key here is that they may be indirectly improved. In terms of the robot can compensate for clumsiness or not being as dextrous as a human would be, in terms of how it does the task or doing the task more slowly, or using coping strategies that it can now potentially do because it has a richer, more functionally relevant understanding of the environment in which it’s operating.

That’s not the whole story with all these new technological developments. One of the key factors we’ll have to look out for is a lot of people will be assuming a continuous, fairly rapid rate of progress and all the different components of what these systems are doing. In some areas, that faith in progress will be justified. There will likely be very rapid progress in some aspects of what these systems are doing. You can already have seen some of the speed-ups in compared to demos from only a few weeks or a few months ago, but in other areas, for whatever reason, performance will plateau, and the key will be whether there are widespread useful deployments of these technologies that can make use of the imperfect skills and capabilities of these platforms.

In the digital domain, there’s lots of examples where AI has been deployed where the AI is not particularly sophisticated or capable, and it’s still being commercially viable. This is also going to be true in the robotics domain, but it may not be in the target domains, for example, high-throughput manufacturing and logistics that a lot of the companies developing the systems are motivating as the primary reason. It may be in other domains, domains where regular interaction with people is more acceptable, where the speed of completing the tasks is not as critical, and where making the occasional error is not catastrophic and doesn’t shut down operations or result in an injury, for example.

So it’s a very exciting space. It’s going to develop very quickly. There is a number of key players operating in the field right now, which of that it’ll be interesting to see the responses from some of these other players because the pressure is on with some of these recent video demonstrations, and the real test will come when these companies start deploying these at scale in some of the target environments they’re envisaging these will make a big difference in. And they won’t really be able

to hide; it will be very clear whether these systems are sufficiently capable and useful to really move beyond the pilot stage and go into enduring deployment or not.

We’ll also know over the next 1-2 years which of the areas have rapidly and continue to improve and which areas have plateaued somewhat. We can predict what they might be, but we really don’t know. I doubt the people in these companies know for sure, but they’re trying to improve everything. We will know though, in a year or two, we’ll have a much better snapshot of where things are at.

So it’s a super exciting space. Don’t let the hype carry you away, but also some of the criticisms about people being wrong perhaps about the predicted initial use cases of these robots doesn’t mean the technology won’t be incredibly transformative; it just might be transformative in somewhat different ways to what the initial scenarios that were pitched were. Let’s watch this space together and see what happens.