Learning Large Touch-Vision-Language Models Using Self-Supervised Robot Learning

Abstract

Humans depend on the integration of multiple sensory inputs, including but not limited to vision, language, audio, and tactile, to successfully carry out daily tasks. Giving robots an analogous ability to perceive and process information from different sensory modalities enables a richer understanding of the physical environment. Although recent research has focused on the relationship between vision and language, as well as vision and tactile modalities, few explore the connection between language and tactile modalities. It is worth noting that humans often use language to articulate tactile experiences, describing the texture, material, softness, slipperiness, and roughness of objects. In addition, auditory cues can offer insight into tactile details, particularly in regards to object material properties, or the properties of the environment in contact with the object. In this ongoing collaboration, we will build a multi-modal latent space between vision, touch, language, and audio, which can be used for downstream tasks such as cross-modal generation, 0-shot deployment to tactile manipulation, and semantic identification from touch.

Overview
See the project website here: https://sites.google.com/berkeley.edu/ssvtp(link is external) for an overview, video, and links to the previous paper. In our current work, we are pursuing a new multimodal model that aligns tactile and data from other sensory modalities to language.

Links
Previous Project Page: https://sites.google.com/berkeley.edu/ssvtp(link is external)
arXiv: https://www.roboticsproceedings.org/rss19/p018.pdf(link is external)

Researchers
Max Fu, UC Berkeley, https://max-fu.github.io/(link is external)
Justin Kerr, UC Berkeley, https://kerrj.github.io/(link is external)
Raven Huang, UC Berkeley, https://qingh097.github.io/(link is external)
Gaurav Datta, UC Berkeley
Jaimyn Drake, UC Berkeley
Mustafa Mukadam, Meta AI, https://www.mustafamukadam.com/(link is external)
Joe Ortiz, Meta AI, https://joeaortiz.github.io/(link is external)
Roberto Calandra, TU Dresden, https://lasr.org/(link is external)
Mike Lambeta, Meta AI
Ken Goldberg, UC Berkeley, https://goldberg.berkeley.edu(link is external)

Learning Large Touch-Vision-Language Models Using Self-Supervised Robot Learning

Topics