Robot Technology News
ROBO SPACE
Multiple AI models help robots execute complex plans more transparently
The HiP framework developed at MIT CSAIL develops detailed plans for robots using the expertise of three different foundation models, helping it execute tasks in households, factories, and construction that require multiple steps. Credits:Image: Alex Shipps/MIT CSAIL
Multiple AI models help robots execute complex plans more transparently
by Alex Shipps | MIT CSAIL
Boston MA (SPX) Jan 16, 2024

Your daily to-do list is likely pretty straightforward: wash the dishes, buy groceries, and other minutiae. It's unlikely you wrote out "pick up the first dirty dish," or "wash that plate with a sponge," because each of these miniature steps within the chore feels intuitive. While we can routinely complete each step without much thought, a robot requires a complex plan that involves more detailed outlines.

MIT's Improbable AI Lab, a group within the Computer Science and Artificial Intelligence Laboratory (CSAIL), has offered these machines a helping hand with a new multimodal framework: Compositional Foundation Models for Hierarchical Planning (HiP), which develops detailed, feasible plans with the expertise of three different foundation models. Like OpenAI's GPT-4, the foundation model that ChatGPT and Bing Chat were built upon, these foundation models are trained on massive quantities of data for applications like generating images, translating text, and robotics.

Unlike RT2 and other multimodal models that are trained on paired vision, language, and action data, HiP uses three different foundation models each trained on different data modalities. Each foundation model captures a different part of the decision-making process and then works together when it's time to make decisions. HiP removes the need for access to paired vision, language, and action data, which is difficult to obtain. HiP also makes the reasoning process more transparent.

What's considered a daily chore for a human can be a robot's "long-horizon goal" - an overarching objective that involves completing many smaller steps first - requiring sufficient data to plan, understand, and execute objectives. While computer vision researchers have attempted to build monolithic foundation models for this problem, pairing language, visual, and action data is expensive. Instead, HiP represents a different, multimodal recipe: a trio that cheaply incorporates linguistic, physical, and environmental intelligence into a robot.

"Foundation models do not have to be monolithic," says NVIDIA AI researcher Jim Fan, who was not involved in the paper. "This work decomposes the complex task of embodied agent planning into three constituent models: a language reasoner, a visual world model, and an action planner. It makes a difficult decision-making problem more tractable and transparent."

The team believes that their system could help these machines accomplish household chores, such as putting away a book or placing a bowl in the dishwasher. Additionally, HiP could assist with multistep construction and manufacturing tasks, like stacking and placing different materials in specific sequences.

Evaluating HiP
The CSAIL team tested HiP's acuity on three manipulation tasks, outperforming comparable frameworks. The system reasoned by developing intelligent plans that adapt to new information.

First, the researchers requested that it stack different-colored blocks on each other and then place others nearby. The catch: Some of the correct colors weren't present, so the robot had to place white blocks in a color bowl to paint them. HiP often adjusted to these changes accurately, especially compared to state-of-the-art task planning systems like Transformer BC and Action Diffuser, by adjusting its plans to stack and place each square as needed.

Another test: arranging objects such as candy and a hammer in a brown box while ignoring other items. Some of the objects it needed to move were dirty, so HiP adjusted its plans to place them in a cleaning box, and then into the brown container. In a third demonstration, the bot was able to ignore unnecessary objects to complete kitchen sub-goals such as opening a microwave, clearing a kettle out of the way, and turning on a light. Some of the prompted steps had already been completed, so the robot adapted by skipping those directions.

A three-pronged hierarchy
HiP's three-pronged planning process operates as a hierarchy, with the ability to pre-train each of its components on different sets of data, including information outside of robotics. At the bottom of that order is a large language model (LLM), which starts to ideate by capturing all the symbolic information needed and developing an abstract task plan. Applying the common sense knowledge it finds on the internet, the model breaks its objective into sub-goals. For example, "making a cup of tea" turns into "filling a pot with water," "boiling the pot," and the subsequent actions required.

"All we want to do is take existing pre-trained models and have them successfully interface with each other," says Anurag Ajay, a PhD student in the MIT Department of Electrical Engineering and Computer Science (EECS) and a CSAIL affiliate. "Instead of pushing for one model to do everything, we combine multiple ones that leverage different modalities of internet data. When used in tandem, they help with robotic decision-making and can potentially aid with tasks in homes, factories, and construction sites."

These models also need some form of "eyes" to understand the environment they're operating in and correctly execute each sub-goal. The team used a large video diffusion model to augment the initial planning completed by the LLM, which collects geometric and physical information about the world from footage on the internet. In turn, the video model generates an observation trajectory plan, refining the LLM's outline to incorporate new physical knowledge.

This process, known as iterative refinement, allows HiP to reason about its ideas, taking in feedback at each stage to generate a more practical outline. The flow of feedback is similar to writing an article, where an author may send their draft to an editor, and with those revisions incorporated in, the publisher reviews for any last changes and finalizes.

In this case, the top of the hierarchy is an egocentric action model, or a sequence of first-person images that infer which actions should take place based on its surroundings. During this stage, the observation plan from the video model is mapped over the space visible to the robot, helping the machine decide how to execute each task within the long-horizon goal. If a robot uses HiP to make tea, this means it will have mapped out exactly where the pot, sink, and other key visual elements are, and begin completing each sub-goal.

Still, the multimodal work is limited by the lack of high-quality video foundation models. Once available, they could interface with HiP's small-scale video models to further enhance visual sequence prediction and robot action generation. A higher-quality version would also reduce the current data requirements of the video models.

That being said, the CSAIL team's approach only used a tiny bit of data overall. Moreover, HiP was cheap to train and demonstrated the potential of using readily available foundation models to complete long-horizon tasks. "What Anurag has demonstrated is proof-of-concept of how we can take models trained on separate tasks and data modalities and combine them into models for robotic planning. In the future, HiP could be augmented with pre-trained models that can process touch and sound to make better plans," says senior author Pulkit Agrawal, MIT assistant professor in EECS and director of the Improbable AI Lab. The group is also considering applying HiP to solving real-world long-horizon tasks in robotics.

Ajay and Agrawal are lead authors on a paper describing the work. They are joined by MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum, and Leslie Pack Kaelbling; CSAIL research affiliate and MIT-IBM AI Lab research manager Akash Srivastava; graduate students Seungwook Han and Yilun Du '19; former postdoc Abhishek Gupta, who is now assistant professor at University of Washington; and former graduate student Shuang Li PhD '23.

The team's work was supported, in part, by the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, the U.S. Army Research Office, the U.S. Office of Naval Research Multidisciplinary University Research Initiatives, and the MIT-IBM Watson AI Lab. Their findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS).

Research Report:"Compositional Foundation Models for Hierarchical Planning"

Related Links
Computer Science and Artificial Intelligence Laboratory (CSAIL)
All about the robots on Earth and beyond!

Subscribe Free To Our Daily Newsletters
Tweet

RELATED CONTENT
The following news reports may link to other Space Media Network websites.
ROBO SPACE
Adecco chief says AI will create new jobs
Zurich (AFP) Jan 14, 2024
Artificial intelligence raises serious concerns for jobs but it will also create new positions, the head of Adecco, the world's biggest temporary staffing agency, told AFP. From meteorologists to lawyers and screenwriters, generative AI capable of creating content - such as the chatbot ChatGPT - could change the contours of many professions. But it will also create new positions, according to Denis Machuel, the chief executive of Zurich-based Adecco. - How will AI disrupt the world of wor ... read more

ROBO SPACE
Aerovel Joins Airbus, Bolstering Tactical Unmanned Aerial Capabilities

Drone attack on anti-IS coalition in Iraq thwarted

Mitsubishi Electric unveils AnyMile for enhanced drone logistics and fleet management

US, British forces shoot down 21 drones and missiles fired from Yemen

ROBO SPACE
Epic says Apple court fight is 'lost'

NASA's Cryo Efforts Beyond the Atmosphere

Skeyeon unveils novel patent for Enhanced VLEO satellite communication

Researchers release open-source space debris model

ROBO SPACE
Taiwan's TSMC to launch Japan chipmaking plant in February

Solid-state qubits: Forget about being clean, embrace mess

Breakthrough in controlling magnetization for spintronics

Towards realizing eco-friendly and high-performance thermoelectric materials

ROBO SPACE
Innovative use of antineutrinos in monitoring nuclear reactors for non-proliferation

Uranium Energy Corp to Resume Uranium Production in Wyoming's Powder River Basin

UK unveils plans for 'biggest nuclear power expansion in 70 years'

Jeumont Electric joins forces with Framatome and Naval Group

ROBO SPACE
El Salvador court orders ex-president's arrest over 1981 massacre

On anniversary of Lockerbie bombing, Joe Biden says 'pursuit of justice' continues

U.S. announces charges against alleged Hezbollah member in 1994 bombing

Anti-IS coalition forces targeted in Iraq and Syria: US official

ROBO SPACE
Trade barriers can slow energy transition: IEA chief

EU debates 2040 milestone towards carbon-neutral future

US reduces emissions in 2023 - but not fast enough: report

Private sector funding key to climate transition, World Bank chief says

ROBO SPACE
Using idle trucks to power the grid with clean energy

Dirt-powered fuel cell runs forever

Smooth operation of future nuclear fusion facilities is a matter of control

Study reveals a reaction at the heart of many renewable energy technologies

ROBO SPACE
Tianzhou 6 cargo spacecraft to return to Earth

Tianxing 1B satellite launched by Kuaizhou 1A to conduct space environment survey

China begins 2024 with key Kuaizhou 1A satellite launch

Shenzhou XVII astronauts set for their first spacewalk

Subscribe Free To Our Daily Newsletters




The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.