The multimodal evolution in AI: towards a comprehensive understanding of the world

Human beings use our 5 senses to perceive images, smells, sounds, textures and flavors, which we use to understand the world, act and gain knowledge about it. AI is used in a wide range of practical applications and its main goal is to solve specific problems or improve efficiency in specific tasks, as well as to better explore and understand intelligence. One of the ways in which AI achieves this goal is through the simulation of certain aspects of human intelligence, focusing mainly on imitating and processing human “perception” and response through data of a single type obtained from a single channel that can be visual or auditory. This is changing and more holistic models are being developed that are in line with the real world, which is multimodal. Not taking into account the diversity of information that we receive and process simultaneously is a limiting factor in applications to real problems. On the contrary, The ability to process data of different types in an integrated manner represents a qualitative leap within AI technology because it makes it adaptable to different situations, allows for a better understanding of the world and gaining knowledge, and because richer and more precise solutions can be obtained. .

Within generative AI, in particular, efforts are being made in this direction and multimodal data is increasingly being incorporated into learning models. Today, multimodal generative AI is among the main technological challenges of Artificial Intelligence for 2024.

From a historical perspective, multimodal AI is not something new. As early as 1968, Terry Winograd created a system that could manipulate and reason within a world of blocks following instructions from a user. Similarly, Siri (Apple, 2011) can be considered an example of multimodal AI, where the input is the human voice and the output can be an action or text.

Currently, multimodal AI is one of the focuses of large technology companies, which are continuously striving to achieve more capabilities for AI to be at the forefront of this technology. That’s how it is. Open AI developed DALL.E , a program that is capable of generating images from text descriptions and/or commands, and integrated it into ChaGPT PLUS earlier this year, allowing users to generate images using the DALL-E 3 model within the ChatGPT PLUS chatbot. More recently, Open-AI launched GPT-4V which is capable of interpreting images and voice along with text. GPT-4V allows users to upload images, ask questions about them, and get visuals.

Goal , another strong competitor in the development of AI technology, created the multimodal model Seamless M4T which has the ability to translate and transcribe nearly 100 languages by text and voice, allowing direct communication between two people who speak different languages.

The same idea of combining text and voice in multiple languages is WHISPER , a voice recognizer of Open AI , trained with 680,000 hours of data collected on the web, which is also capable of identifying language and translating into several languages.

Google , for its part, is close to launching GEMINI what would initially be ChatGPT’s competition. The bet on GEMINI lay in its multimodal capacity, and it was capable of processing and understanding different types of data such as audio, text, images and videos. Open AI went ahead and launched GPT’4V, but we must not forget that Google has the advantage of the huge repository of images and videos collected through its search engine and YouTube.

But multimodality should not be limited to just text, images and voice, and this is how it is seen Goal , which beyond Seamless M4T, is developing ImageBind , a multimodal system that incorporates text, images, videos, audio, and temperature and movement measurements. The vision is to eventually add sensory data such as touch and smell among others.

Finally, Code , of Microsoft , is a generative AI model that is capable of simultaneously processing different types of data and generating a coherent composition of several of these types.

In terms of applications, multimodal AI opens up a wide spectrum of possibilities. There are countless examples. In the following paragraphs a very small list of examples will be given to illustrate this.

One of the most commonly mentioned domains when talking about multimodality is health , where the combination of diverse data such as medical images, patient history and sensor data can improve both the diagnosis and treatment of diseases.

In the sector of automotive For example, multimodal AI improves safety by fusing data from cameras, radars, Lidar and other sensors to make quick decisions in complex driving environments.

In the area of the education Personalized, multimodal AI can analyze texts, class videos, and exams to adapt content and provide feedback.

In the sector of entertainment , multimodal AI is used to create immersive experiences in augmented reality applications by combining visual, auditory and tactile elements.

In the field of accessibility , multimodal AI can help people with disabilities by enabling more natural communication with computers, by translating spoken language into written language or vice versa and manipulating images and videos through spoken instructions.

In conclusion, Multimodal AI has the potential to take us to a new level of digital intelligence, making technology more inclusive and efficient in a wide spectrum of applications, without forgetting all the ethical implications of its implementation.

This discipline within AI is just beginning. There is still a long way to go, from solving the problem of the exponential growth in computational resource requirements that is generated each time a new data modality is incorporated, through the integration of diverse data itself, which is already quite challenging, to the incorporation of new sensory modalities whose digitalization is not yet fully developed, such as smell, taste and touch.

At the rate at which large technology companies are researching and advancing multimodal AI, it is quite likely that we will be seeing high-impact results in the coming months and years.

References

https://openai.com/research/whisper. (sf).
Beyond GPT-4: What’s New? Four major trends in Gen AI: LLM to… | by Luhui Hu | September, 2023 | Towards AI
The 10 Biggest Generative AI Trends For 2024 Everyone Must Be Ready For Now (forbes.com)
DALL-E 3 is now available on ChatGPT (hipertextual.com)
GPT-4V: the new version from ChatGPT launched for OpenAI – Planet Chatbot
Multimodal Artificial Intelligence: Revolution in AI Comprehension – Civilsdaily
Breaking cross-modal boundaries in multimodal AI: Introducing Code , composable diffusion by year-to-year generation – Microsoft Research

María Eugenia Fuenmayor

Scientific Director of Digital Technologies

Eurecat

The multimodal evolution in AI: towards a comprehensive understanding of the world

References

Other articles

Delegate yes, but with “INTELLIGENCE”…

The Future of Quantum Artificial Intelligence: Prospects and Challenges

The global race for Artificial Intelligence