What I found interesting at NeurIPS 2024
NeurIPS 2024 was held in Vancouver from Dec 10 to Dec 15 2024
“Test of time for sequence to sequence learning with neural networks” by Ilya Sutskever (co-founder of OpenAI, co-founder of Safe SuperIntelligence)
This talk attracted a lot of attention, both at the conference and on social media, probably because of Ilya’s stature and the points he made.
Ilya talked about
Era of scaling pre-training (vis-a-vis postraining and inference time scaling) will end. Because even though compute is growing with better hardware, better algorithms and larger clusters, data is not growing as fast because there is only one internet
What’s next after the end of pre-training?
Agents
Synthetic data
Inference time compute - o1
There are some who believe that Ilya’s comment applies only to human-generated language data.
“Toward industrial artificial intelligence” by Sepp Hochreiter (inventor of LSTM, co-founder of NXAI, professor at University of Linz)
Sepp talked about
Three phases of AI - basic development, scaling up (e.g. ResNet, Transformer = ResNet and Attention) and industrialization.
In scaling up, we focus on few model architectures i.e. model varieties decrease. Now, we are at the end of scaling up and are entering the industrialization phase where we have to adapt ML to different domains and different applications, so there will be a variety of models.
He drew an analogy with “Smart Factories for mass individualization”, and cited the steam engine and Haber-Bosch process (Haber got the prize for scaling up, not inventing it. Bosch got Nobel prize for industrialization) that have undergone similar three phases.
He also believed that the scaling of pretraining will end as we have run out of language data.
He introduced xLSTM, an updated version of LSTM that was released recently, excels at applications beyond language e.g. biology, robotics because its inference speed is faster than Transformer or state-space model such as Mamba, and shared results that showed xLSTM being faster and more accurate than alternatives in those areas.
“AI for autonomous driving at scale” by Vincent Vanhoucke (Principal engineer at Waymo)
Personally, this is the talk that I enjoyed the most because I think there much to learn from Waymo since it is one of the AI systems that is most deployed in the physical world and that has to handle the complexities of interacting of humans at a close distance and at high frequency, in a safe and reliable manner.
Some of the salient points from Vincent’s sharing:
Waymo deployed in four cities, SF, LA, Austin and Phoenix and making 150,000 trips a week. Miles driven is also increasing rapidly
Even though Waymos have only two degrees of freedom (steering and acceleration), it operates in a real-time, high-dimension, multimodal and unforgiving environment where other actors don’t play nice, with variable weather conditions and where there is a long tail distribution of rare scenarios. The main challenge is in sensing and understanding the world.
Waymo used to use transformer (MotionLM) to model trajectories of other actors on the road but has recently started using diffusion that has shown even better results than MotionLM.
In terms of scaling laws, they still apply but MotionLM is more “efficient” where the difference is in orders of magnitude i.e. scaling laws depend on data distribution
In terms of reinforcement learning, it uses a combination of behavioral cloning with reinforcement learning (soft actor critic SAC) where SAC is used for both in-distribution and out-of-distribution but BC is used only for in-distribution
Waymo is currently looking into the use of VLMs and Chain-of-Thought (CoT) but this is still pretty much in the research phase.
ML for systems / systems for ML - Richard Ho (head of hardware, OpenAI), Jeff Dean (Chief scientist, Google Research), Tim Kraska (director of applied science, Amazon)
In terms of ML for systems, for Amazon, and Google, ML is already in production systems and is on critical path.
For example, at Amazon, Auto-WLM in Redshift intelligently schedules workloads to maximize throughput and horizontally scales clusters in response to workload spikes. While traditional heuristic-based workload management requires a lot of manual tuning (e.g. of the concurrency level, memory allocated to queries etc.) for each specific workload, Auto-WLM does this tuning automatically and as a result is able to quickly adapt and react to workload changes and demand spikes. In terms of ML, Auto-WLM uses locally-trained query performance models to predict the query execution time and memory needs for each query, and uses this to make intelligent scheduling decisions.
This is surprising because it was only 6 years ago at NeurIPS when the idea of using ML in databases e.g. for scheduling, indexing, was mooted.
Tim emphasised that in Amazon that there is a very strong bias towards ML solutions that are simple with 10x gains rather than complex solutions with marginal gains because of operational needs e.g. start with simpler, more understandable models such as boosting before progressing to reinforcement learning
Jeff also shared the the techniques that Google is using to the optimize inference cost and latency of LLMs e.g. distillation, speculative decoding and quantization
In terms of systems for ML, Richard mentioned that at OpenAI it’s not just about maximum FLOPs but the delicate balance between compute, memory and interconnect e.g. overprovisioning compute without sufficient memory bandwidth or I/O leads to poor utilization.
Energy is also a fixed and limited resource so he optimizes for Effective compute / Watt, instead of just FLOPs.
Richard also talked about optimizing systems end-to-end e.g. AlphaChip might be able to reduce macro placement from 6 weeks to 24 hours but that’s one part of a 20 to 24 months cycle of getting the whole systems into production. Vertically integrated optimization is challenging currently and no single human can do it alone because upstream and downstream components affect one another e.g. LEC affect DRC, CDC and RDC
At OpenAI, given how many components and how large these systems are, a small drop in MTBF leads to a big drop in uptime so more efforts need to go into increasing the RAS of the systems.
Video generation - amongst others, talks by Ishan Misra (Meta), Doyup Lee( Runway)
SOTA is on frames and shots generation (up to seconds) and we are still some way away from generating scenes → sequences → film
Empirically, it is observed that these models are much better at fluid dynamics than Newtonian mechanics. Humans are also much worse at detecting inconsistencies in fluid dynamics than Newtonian mechanics.
Release of SORA is described as a GPT-1 but there’s no clear consensus on how to get from to GPT-4 in video generation. Some believe that we need better representation, or better model architectures while the others believe that we just have to scale compute and data.
The debate on autoregressive vs diffusion for video generation is still unresolved, diffusion produces more realistic videos while autoregressive allows companies to reuse their existing pipelines.
For diffusion based models, it is trending towards flow matching i.e. solving ODE vs SDEs.
Text conditioning that has worked well for image generation might not work as well on video because 1. lack of good text captioning, especially for longer form videos 2. text might not be the best representation e.g. describing the moves made by a stuntman on a motorbike.
Interesting thought experiment proposed by Pauline Luc (Google) - no one would drive blindfolded and be instructed by only voice. Therefore, text conditioning is unlikely to be sufficient.
Runway has increased focus on integration with existing workflows of creators to discover, define and steer foundational model capability e.g. key frame conditioning, instead of training its own models
Data curation is very important for pre-training of models i.e. high quality, large-scale diverse videos but when asked, Doyup couldn’t reveal how it is done at Runway, possibly due to copyright issues
At Meta, compute remains a key issue in scaling MovieGen. According to Ishan, they have more data than compute.
Robotics - amongst others, talks by Sergey Levine (co-founder Physical Intelligence, prof at UC Berkeley), Chelsea Finna (co-founder Physical Intelligence, prof at Stanford)
In two separate talks, they covered the following points about their work at Physical Intelligence
Pi0 vision-language-action (VLA) model
A pre-trained vision-language model (VLM) with action expert diffusion flow modeling model attached at the end of the VLM
Independent development of post-training data (fewer, higher quality data) and pre-trained model (more, lower-quality data) leads to better performance
Even though the results from Pi0 are promising, Chelsea and Sergey admit that generalization (narrow, single-application) is limited, language following is limited and prompts are still relatively simple.
Embodied CoT and improving through real-time human voice feedback (experimentation)
Two-level hierarchical recipe with a higher-level vision-language model focused on reasoning and planning and a lower-level vision-language-action model focused on the execution of the plan (proposed)
On a separate note, Ted Xiao (Googe DeepMind) mentioned that the most surprising thing over the past few years is the large variance in sample efficiency required for different tasks, different embodiments and different technologies. We've seen everything from very efficient imitation learning from dozens to less than 100 road episodes to get very high dexterous success rates. We've also seen the need for hundreds of thousands of autonomous online and offline learning to get good bin picking. The huge variance is like you know five or six orders of magnitudes in time and cost is just I think makes research in robotics especially hard.
Planning / digital agents
Digital agents are still not very robust in terms of trip planning and calendar scheduling benchmarks, e.g. constraint violation, not meeting the goal.
Interestingly, the lack of robustness in digital agents would necessitate stronger AI safety measures as some actions cannot be easily reversed such as ordering something or might lead to undesirable consequences such as downloading of malware.
Given the inherent stochastic nature of foundation models and that we don’t understand their failure modes very well yet (LLMs are both powerful and puzzling at times), humans-in-the-loop would still be needed (but it’s better to term it AI-in-loop) for digital agent products.
I believe that we are going to see more innovation at the UI level for digital agents in order to accommodate humans-in-the-loop in the products.
Inference time scaling
This topic was widely discussed, given the launch of o-1 and its successors.
We have progressed in 3 waves from pre-training to post-training and now inference time scaling
Techniques include Chain-of-Thought and search (MCTS).
Expectedly, many folks were interested in the inference time scaling that is behind o1.
The researchers from OpenAI didn’t open kimono entirely but based on their presentations and the reverse engineering done by others, it is believed to be CoT plus RL on the CoT in post-training.