Breakthroughs in Image-Making AI Produce Photorealistic Images from Natural Text

An AI space race is occurring in text-to-image technology

Add bookmark
Elliot Leavy
Elliot Leavy
05/31/2022

Sponsored Content


A month after OpenAI showed off its new picture-making neural network, DALL-E 2, Google Brain has revealed its own text-to-image model, Imagen

For those unaware, both DALL-E 2 and Imagen use natural language processing (NLP) to translate text into imagery. Just a month ago the tech world was awed by OpenAI’s venture into this space after it showed the capabilities of its DALL-E 2 model. Yet even last year, graphics card company NVIDIA announced GauGAN AI, its own deep learning NLP model that allows anyone to channel their imagination into a photorealistic masterpiece. 

Now, Google has released its own model that, according to them, has outstripped its competitors in almost every way.

Imagen has shown the ability to produce remarkable high-res images of almost anything it was asked to. From robots dining next to the Eiffel Tower, to corgis in sushi houses, Imagen’s capabilities scored higher on a standard measure for rating the quality of computer-generated images than its OpenAI rival, with its pictures also being preferred by a group of human judges.

The purpose of this technology is limited only to the imaginations of its users. Such images could completely revamp the stock image space, but also allow the creation of entire new artworks from description alone.

These technologies are also an incredible demonstrations  deep tech in action. Translating from text to images is highly multi-modal, which means that while a user may type ‘bird’, the fact of the matter is that there are many different images of birds which correspond to the text description “bird”.

While traditionally very difficult, this multi-modal understanding is greatly facilitated due to the sequential structure of text such that the model can predict the next word conditioned on the image as well as the previously predicted words. Such learning has been made much easier with the advancement of GANs (Generative Adversarial Networks), this framework creates an adaptive loss function which is well-suited for multi-modal tasks such as text-to-image.

According to the Imagen website, the way Google’s model works is by using a “large frozen T5-XXL encoder to encode the input text into embeddings. A conditional diffusion model maps the text embedding into a 64×64 image. Imagen further utilizes text-conditional super-resolution diffusion models to upsample the image 64×64→256×256 and 256×256→1024×1024.”

Depending on your disposition, the results can be perceived as a triumph of technology or just another reason for distrusting what you view online. In any case, concerns have been raised about how this technology would work if it were to be showcasing humans instead of corgis. 

That is because (as with much of AI today) the datasets it is based off of are scraped from the internet. This means that it is likely to inherit the biases we see everyday across the web and as such are also likely to be translated into this imagery. For example, if a user were to enter text that included an image of a doctor - what would that doctor look like? These are the ongoing concerns in this space, and questions are being raised as to whether or not racial stereotypes will permeate into the tech (as we have seen before).

To Google’s credit, it addresses this on its Imagen webpage (which is where you can best see the results of the technology too): “First, downstream applications of text-to-image models are varied and may impact society in complex ways. The potential risks of misuse raise concerns regarding responsible open-sourcing of code and demos. At this time we have decided not to release code or a public demo. In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.

The page continues: “Second, the data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups.”

Finally, Google states that “A conceptual vocabulary around potential harms of text-to-image models and established metrics of evaluation are an essential component of establishing responsible model release practices. While we leave an in-depth empirical analysis of social and cultural biases to future work, our small scale internal assessments reveal several limitations that guide our decision not to release our model at this time.”

In any case, with the rise of text-to-image technologies, the old idiom that a picture is worth a thousand words seems outdated. In this new reality, it seems a picture is now worth about a word or two. 


RECOMMENDED