Meta has announced a new open-source AI model, ImageBind, that brings together six different data streams, including, text, audio, visual, depth, temperature and movement. The model is a research project for now with no immediate consumer or practical applications — a part of Meta’s efforts to create multimode AI systems that “learn from all possible types of data around them”.

Additionally, Meta’s announcement goes on to show that it’s willing to share AI research at a time when generative AI technologies from companies like OpenAI and Google are becoming increasingly secretive. Not that Meta has had the perfect track record when it comes to transparency, but an AI that can audio, temperature and even movement data as input, having transparency and working in an open-source manner can save Meta a lot of hassle, even if it means having to share its research with everyone.

The approach is supposed to bring machines one step closer to the human ability to learn “simultaneously, holistically, and directly” from multiple information sources without the need for researchers to organise and label raw training data.

This is an image of meta imagebind 1 — *Meta’s ImageBind will be able to work backwards to generate images and audio from related audio or image sources, not just text prompts.| Source: Meta*

As mentioned before, there are no consumer or practical applications for OpenBind in the near future, but Meta is showing off the AI’s capabilities which include generating audio from images, using audio prompts to retrieve related images, using text to retrieve images and audio, using audio and images to retrieve related images and finally using audio prompts to generate images.

The core concept here is to combine different input types to give the AI even more data points to learn and analyse. The idea of linking multiple data types into a single “embedding space” is pretty much solely responsible for the recent rapid growth in generative AI technologies.

For context, AI image generators like DALL-E and Midjourney rely on systems that combine text and images during the training stage. The process involves the AI looking for patterns in visual data that correspond with text prompts which is what enables these AI systems to work backwards from text prompts to generate unique images. With an AI that has as much data to process as ImageBind, these capabilities will shoot up like anything.

In the News: Google allegedly doesn’t delete your sensitive location data