How Meta's AI model CM3leon compares with OpenAI's DALL-E 2
Meta has been busy in the AI game. Over the past few months, the company has released multiple AI models for various purposes. Now, it has introduced an AI-powered image generator dubbed 'CM3leon' (pronounced like "chameleon"), which the company claims can transform text-to-image and image-to-text generation. Let's see how it fares against OpenAI's DALL-E 2 image generator.
Why does this story matter?
There has been an increase in the number of image generators in the market. Major tech players and multiple start-ups have entered the segment. However, they are yet to reach their complete potential in terms of performance. Meta aims to change that with CM3leon. The company took a different approach to developing CM3leon, which seems to have worked.
Diffusion-based image generators are computationally intensive
Major AI image generators in the market, including DALL-E 2, rely on a process called diffusion. In diffusion, AI models remove the noise from random noised images (denoising) and generate target images. As impressive as diffusion is, it is a computationally heavy process. This makes it expensive to operate. Meanwhile, Meta's CM3leon uses a method in transformer models called "attention."
'Attention method' allows for parallel processing
Transformers use the attention method to make sense of language sequences. The model will consider the relevance of input data to solve the task at hand. The attention method allows for parallel processing and increases the processing speed. This makes it easier to train large image-generation models without having to worry about the increase in computation.
Meta's model can generate images and text
DALL-E 2 is only capable of generating images based on text input. CM3leon, on the other hand, can go beyond that. It can generate sequences of texts and images. This makes it one of the first models that can write captions for an image. According to the company, CM3leon's ability to generate images and text improved its performance in various tasks.
CM3leon has more parameters than DALL-E 2
Meta's CM3leon has seven billion parameters. Meanwhile, OpenAI's DALL-E 2 works on 3.5 billion parameters. Its predecessor, DALL-E had 12 billion parameters. CM3leon was also trained on millions of licensed images from Shutterstock.
CM3leon can generate captions and answer questions about images
CM3leon's text capabilities differentiate it from DALL-E 2. Meta's model can perform various text tasks, including generating short or long captions and answering questions about an image. According to examples provided by Meta, the model is capable of describing an image in detail. In this area, CM3leon performed even better than specialized image captioning models.