Ars Technica
On Monday, researchers from Microsoft unveiled Kosmos-1, a multimodal model that can reportedly analyze images for content, solve visual puzzles, do visual text recognition, pass visual IQ tests and understand natural language instructions. The researchers believe that multimodal artificial intelligence—which incorporates different modes of input such as text, audio, images, and video—is a key step toward creating artificial general intelligence (AGI) that can perform general human-level tasks.
“Being a basic part of intelligence, multimodal perception is a necessity to achieve artificiality general intelligencein terms of acquiring knowledge and grounding in the real world,” the researchers write in their academic paper, “Language isn’t all you need: aligning perception with language models.”
Visual examples from the Kosmos-1 paper show the model analyzing images and answering questions about them, reading text from an image, captioning images, and performing a visual IQ test with 22–26 percent accuracy ( more on that below).
-
An example provided by Microsoft for Kosmos-1 that answers questions about images and websites.
Microsoft
-
An example of a “multimodal thought chain prompt” provided by Microsoft for Kosmos-1.
Microsoft
-
An example of Kosmos-1 visually answering questions is provided by Microsoft.
Microsoft
While the media buzzes with news of large language models (LLMs), some AI experts are pointing to multimodal AI as a possible path to general AI, a hypothetical technology that will seemingly be able to replace humans in any mental work (and any mental work). . AGI is the stated goal of OpenAI, a key business partner of Microsoft in the field of artificial intelligence.
In this case, Kosmos-1 appears to be a pure Microsoft project without the involvement of OpenAI. The researchers call their creation a “multimodal large language model” (MLLM) because its roots are in natural language processing like a text-only LLM like ChatGPT. And it shows: For Kosmos-1 to accept image input, researchers must first translate the image into a special set of tokens (basically text) that LLM can understand. Kosmos-1 newspaper describes this in more detail:
For the input format, we flatten the input as a sequence decorated with special tokens. Specifically, we use the
and to indicate the start and end of the sequence. The special tokensand indicate the start and end of coded image embeddings. For example, “document ” is a text input and “paragraph» is an interleaved image-text input.Image embedding paragraph… An embedding module is used to encode both text tokens and other input modes into vectors. The embeddings are then fed to the decoder. For input tokens, we use a lookup table to map them to embeds. For continuous signal modes (eg image and sound), it is also possible to represent the inputs as discrete code and then think of them as “foreign languages”.
Microsoft trained Kosmos-1 using data from the web, including snippets from The Pile (an 800 GB English text resource) and Common Crawl. After training, they evaluated Kosmos-1’s abilities in various tests, including language comprehension, language generation, text classification without visual character recognition, image captioning, visual question answering, web page question answering, and classification zero download images. In many of these tests, Kosmos-1 outperformed current state-of-the-art models, according to Microsoft.

Microsoft
Of particular interest is the Kosmos-1’s performance on Raven’s Progressive Reasoning, which measures visual IQ by presenting a sequence of shapes and asking the examinee to complete the sequence. To test Kosmos-1, the researchers gave a completed test, one at a time, with each option completed, and asked if the answer was correct. Kosmos-1 could answer a Raven test question correctly only 22 percent of the time (26 percent with refinement). This is by no means a slam dunk, and errors in methodology could have affected the results, but Kosmos-1 beat chance (17 percent) on the Raven IQ test.
However, while Kosmos-1 represents the first steps in the multimodal domain (an approach pursued by others), it is easy to imagine that future optimizations could yield even more significant results, allowing AI models to perceive any form of media and act on it, which will greatly enhance the abilities of artificial assistants. In the future, the researchers say they would like to scale Kosmos-1 to model size and incorporate the ability to speak as well.
Microsoft says it plans to make Kosmos-1 available to developers, though the GitHub page cited by the paper has no apparent code for Kosmos at the time of this story’s publication.