Multimodal Conversational AI: The Future of Natural Language Interaction

road connection near green trees in suburb
Photo by Kelly on Pexels.com

<update> 15/03/23: It seems this article was spot on as OpenAI just released their GPT-4 (https://openai.com/product/gpt-4) which is Multimodal!! </update>

Multimodal Conversation is something I came across recently, so I decided to do a little bit of research!

What is Multimodal Conversational AI?

Multimodal conversational AI is a type of conversational AI that uses multiple modes to enable human-like communication. It can combine various modalities such as voice, text, images, and video to create a more seamless and natural interaction between humans and machines.

For example, imagine you are ordering food through a chatbot. Instead of typing out your order, you could take a picture of the menu and send it to the chatbot. The chatbot could then use image recognition to identify the items you want to order and confirm your order through voice interaction. This type of multimodal interaction is more natural and efficient than typing out your order, and it can also be more accessible for people with disabilities or those who prefer non-textual communication.

While this may not seem tremendously practical, it does show the idea behind the concept 🙂

Is Multimodal really more efficient?

I must say that the statement that “multimodal is more efficient than typing out an order” made me doubt its validity, but upon more research, it does tend to point in that direction.

https://www.researchgate.net/publication/221491208_The_efficiency_of_multimodal_interaction_a_case_study

Ultimately it will depend on the task at hand. Most likely it does bring efficiency to more complex tasks.

Why Multimodal Conversational AI Matters

Multimodal conversational AI has the potential to revolutionize the way we interact with technology. Combining multiple modalities, it can create a more natural and intuitive communication experience. It can also make communication more flexible and accessible, enabling people to interact with technology in a way that feels most comfortable and natural to them.

Multimodal conversational AI can also enable new use cases for conversational AI. For example, in healthcare, multimodal conversational AI could be used to enable patients to communicate with doctors and nurses through a variety of modalities, including voice, text, and video. This could help patients with disabilities or those who are unable to communicate effectively through traditional means.

Challenges and Opportunities

While multimodal conversational AI has great potential, there are also challenges that must be addressed. One challenge is the complexity of processing multiple modalities. Combining different modalities requires complex algorithms that can process and analyze data from different sources.

Another challenge is ensuring that multimodal conversational AI systems are accessible to all users, regardless of their abilities or communication preferences.

However, there are also many opportunities for innovation in this area. Multimodal conversational AI can enable new use cases, such as virtual shopping assistants that can help customers find and purchase products through a combination of voice and image recognition. It can also improve the accessibility of technology, enabling people with disabilities to interact with technology in a more natural and intuitive way.

The easy way in

Typical examples of multimodal AI include pictures and video, but we can have a sort of low-level multimodal approach. Imagine a typical conversational AI interaction, enhanced with forms or action buttons (voice or clickable). For some use cases, this can tremendously improve the efficiency of interaction, and it’s not as complex as those more advanced use cases.

Conclusion

Multimodal conversational AI is a very interesting concept, and there are some use cases where it can have a tremendous impact. It brings additional challenges and complexity, but it also paves the way to exploring new opportunities and use cases. I wouldn’t say multimodal is the future of conversational AI, far from that, but ultimately time will tell – It is safe to say that it offers another approach to the standard conversational AI and has the potential to push some boundaries. Time will tell.

Note: A couple of paragraphs were generated with the help of chatGPT