Elon Wuz Right

TikTok team proves Elon correct - images alone would be sufficient

Take a single photo, and get an understanding of the 3D positioning of the objects in the photo… better than LiDAR. That’s the promise of this work from TikTok, which was done during Lihe Yang’s PhD internship (!) at the company. In a tribute to Meta, titled Depth Anything.

Results on various datasets which the model was not trained for: for parking, home automation, gaming, driving, furniture placement, architecture

Elon of course spotted this way early:

While no one believed Elon when he said Tesla didn’t need LiDAR or radar, and images alone would be sufficient, it seems like the TikTok team has proven him correct.

Features of this work:

  • Goal was to build a foundation model for depth estimation from a single image

  • Did not use the classical method of getting accurate ground truth measured depth maps to train the model on

  • Instead obtained a large (62 mil) image unlabelled dataset, which would form the basis of the “student” model

  • Then built an annotation model to label this dataset

  • Annotation model was built from a labeled 1.5 mil image dataset, the “teacher” model

  • This worked because of scale! They had many failures along the way

The exciting part of all of this is that it looks like vision alone is enough for a lot of tasks in the physical world.

