In Large Language Models (LLMs), modality refers to the different formats of information they can process and generate. Text, speech, image, audio, video are some examples of modalities. Some LLMs, called multimodal LLMs, are capable of handling more than one of those modalities in combination.
Here’s a list of different modalities that are being explored in the field of language model development:
Text: The primary modality for LLMs. These models are designed to understand, interpret, and generate text-based content. This includes tasks like answering questions, writing essays, composing emails, generating code, and more. Examples: GPT3, GPT4, Gemini, PaLM 2
Images: Some advanced models understand and generate image content. For example, they can create images based on text descriptions or analyze and describe the content of an image. Example: DALL-E 2
Audio: There’s ongoing research in enabling LLMs to process and generate audio content. This includes understanding spoken language (speech-to-text), generating spoken language (text-to-speech), and creating music or sound effects.
Video: Video processing and generation is a more complex modality that involves understanding the visual content of videos, potentially interpreting or generating video clips based on text descriptions.
Tactile: While more experimental and less developed, there’s research into how models could interact with tactile information, which could have applications in robotics or virtual reality.
Olfactory and Gustatory: These are highly experimental modalities involving smell and taste, respectively. They are currently more theoretical and not commonly seen in practical applications.
It’s important to note that while text remains the primary and most developed modality for language models, the integration of other modalities like images, audio and video is a rapidly growing area of research and development in generative artificial intelligence.
See Also: Large Language Models