Combining next-token prediction and video diffusion in computer vision and robotics

مجتبی اسد

In the current AI zeitgeist, sequence models have skyrocketed in popularity for their ability to analyze data and predict what to do next. For instance, you’ve likely used next-token prediction models like ChatGPT, which anticipate each word (token) in a sequence to form answers to users’ queries. There are also full-sequence diffusion models like Sora, which convert words into dazzling, realistic visuals by successively “denoising” an entire video sequence.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a simple change to the diffusion training scheme that makes this sequence denoising considerably more flexible.

When applied to fields like computer vision and robotics, the next-token and full-sequence diffusion models have capability trade-offs. Next-token models can spit out sequences that vary in length. However, they make these generations while being unaware of desirable states in the far future — such as steering its sequence generation toward a certain goal 10 tokens away — and thus require additional mechanisms for long-horizon (long-term) planning. Diffusion models can perform such future-conditioned sampling, but lack the ability of next-token models to generate variable-length sequences.

Researchers from CSAIL want to combine the strengths of both models, so they created a sequence model training technique called “Diffusion Forcing.” The name comes from “Teacher Forcing,” the conventional training scheme that breaks down full sequence generation into the smaller, easier steps of next-token generation (much like a good teacher simplifying a complex concept).

Tags: superior intelligence

Discovery’s expertise and effort is to find the unknown for you and connect you to innovative teams, startups, universities, SMEs and large corporations across industries. Kashef’s effort is based on strengthening the atmosphere of innovation and entrepreneurship as much as possible and turning ideas into business, and hopes to be able to play the role of facilitator in the best possible way.

Quick access

Communication with the kashef

directions

Join Kashef company

All the contents of the Kashef site are reserved and any copying is permitted without mentioning the source. 2024