An image can convey a great deal, yet it may also be marred by various issues such as motion blur, haze, noise, and low dynamic range. These problems, commonly referred to as degradations in low-level computer vision, can arise from difficult environmental conditions like heat or rain or from limitations of the camera itself. Image restoration represents a core challenge in computer vision, striving to recover a high-quality, clean image from one exhibiting such degradations. Image restoration is complex because there might be multiple solutions for restoring any given image. Some approaches target specific degradations, such as reducing noise or removing blur or haze.
While these methods can yield good results for particular issues, they often struggle to generalize across different types of degradation. Many frameworks employ a generic neural network for a wide range of image restoration tasks, but these networks are each trained separately. The need for different models for each type of degradation makes this approach computationally expensive and time-consuming, leading to a focus on All-In-One restoration models in recent developments. These models utilize a single, deep blind restoration model that addresses multiple levels and types of degradation, often employing degradation-specific prompts or guidance vectors to enhance performance. Although All-In-One models typically show promising results, they still face challenges with inverse problems.
InstructIR represents a groundbreaking approach in the field, being the first image restoration framework designed to guide the restoration model through human-written instructions. It can process natural language prompts to recover high-quality images from degraded ones, considering various degradation types. InstructIR sets a new standard in performance for a broad spectrum of image restoration tasks, including deraining, denoising, dehazing, deblurring, and enhancing low-light images.
This article aims to cover the InstructIR framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art image and video generation frameworks. So let’s get started.
Image restoration is a fundamental problem in computer vision since it aims to recover a high-quality clean image from an image that demonstrates degradations. In low-level computer vision, Degradations is a term used to represent unpleasant effects observed within an image like motion blur, haze, noise, low dynamic range, and more. The reason why image restoration is a complex inverse challenge is because there might be multiple different solutions for restoring any image. Some frameworks focus on specific degradations like reducing instance noise or denoising the image, while others might focus more on removing blur or deblurring, or clearing haze or dehazing.
Recent deep learning methods have displayed stronger and more consistent performance when compared to traditional image restoration methods. These deep learning image restoration models propose to use neural networks based on Transformers and Convolutional Neural Networks. These models can be trained independently for diverse image restoration tasks, and they also possess the ability to capture local and global feature interactions, and enhance them, resulting in satisfactory and consistent performance. Although some of these methods may work adequately for specific types of degradation, they typically do not extrapolate well to different types of degradation. Furthermore, whilst many existing frameworks use the same neural network for a multitude of image restoration tasks, every neural network formulation is trained separately. Hence, it is obvious that employing a separate neural model for every conceivable degradation is impracticable and time consuming, which is why recent image restoration frameworks have concentrated on All-In-One restoration proxies.
All-In-One or Multi-degradation or Multi-task image restoration models are gaining popularity in the computer vision field since they are capable of restoring multiple types and levels of degradations in an image without the need of training the models independently for each degradation. All-In-One image restoration models use a single deep blind image restoration model to tackle different types and levels of image degradation. Different All-In-One models implement different approaches to guide the blind model to restore the degraded image, for example, an auxiliary model to classify the degradation or multi-dimensional guidance vectors or prompts to help the model restore different types of degradation within an image.
With that being said, we arrive at text-based image manipulation since it has been implemented by several frameworks in the past few years for text to image generation, and text-based image editing tasks. These models often utilize text prompts to describe actions or images along with diffusion-based models to generate the corresponding images. The main inspiration for the InstructIR framework is the InstructPix2Pix framework that enables the model to edit the image using user instructions that instructs the model on what action to perform instead of text labels, descriptions, or captions of the input image. As a result, users can use natural written texts to instruct the model on what action to perform without the need of providing sample images or additional image descriptions.
Building on these basics, the InstructIR framework is the first ever computer vision model that employs human-written instructions to achieve image restoration and solve inverse problems. For natural language prompts, the InstructIR model can recover high-quality images from their degraded counterparts and also takes into account multiple degradation types. The InstructIR framework is able to deliver state of the art performance on a wide array of image restoration tasks including image deraining, denoising, dehazing, deblurring, and low-light image enhancement. In contrast to existing works that achieve image restoration using learned guidance vectors or prompt embeddings, the InstructIR framework employs raw user prompts in text form. The InstructIR framework is able to generalize to restoring images using human written instructions, and the single all-in-one model implemented by InstructIR covers more restoration tasks than earlier models. The following figure demonstrates the diverse restoration samples of the InstructIR framework.
InstructIR : Method and Architecture
At its core, the InstructIR framework consists of a text encoder and an image model. The model uses the NAFNet framework, an efficient image restoration model that follows a U-Net architecture as the image model. Furthermore, the model implements task routing techniques to learn multiple tasks using a single model successfully. The following figure illustrates the training and evaluation approach for the InstructIR framework.
Drawing inspiration from the InstructPix2Pix model, the InstructIR framework adopts human written instructions as the control mechanism since there is no need for the user to provide additional information. These instructions offer an expressive and clear way to interact allowing users to point out the exact location and type of degradation in the image. Furthermore, using user prompts instead of fixed degradation specific prompts enhances the usability and applications of the model since it can also be used by users who lack the required domain expertise. To equip the InstructIR framework with the capability of understanding diverse prompts, the model uses GPT-4, a large language model to create diverse requests, with ambiguous and unclear prompts removed after a filtering process.
Text Encoder
A text encoder is used by language models to map the user prompts to a text embedding or a fixed size vector representation. Traditionally, the text encoder of a CLIP model is a vital component for text based image generation, and text based image manipulation models to encode user prompts since the CLIP framework excels in visual prompts. However, a majority of times, user prompts for degradation feature little to no visual content, therefore, rendering the large CLIP encoders useless for such tasks since it will hamper the efficiency significantly. To tackle this issue, the InstructIR framework opts for a text-based sentence encoder that is trained to encode sentences in a meaningful embedding space. Sentence encoders are pre-trained on millions of examples and yet, are compact and efficient in comparison to traditional CLIP-based text encoders while having the ability to encode the semantics of diverse user prompts.
Text Guidance
A major aspect of the InstructIR framework is the implementation of the encoded instruction as a control mechanism for the image model. Building on this, and inspired in task routing for many task learning, the InstructIR framework proposes an Instruction Construction Block or ICB to enable task-specific transformations within the model. Conventional task routing applies task-specific binary masks to channel features. However, since the InstructIR framework does not know the degradation, this technique is not implemented directly. Furthermore, for image features and the encoded instructions, the InstructIR framework applies task routing, and produces the mask using a linear-layer activated using the Sigmoid function to produce a set of weights depending on the text embeddings, thus obtaining a c-dimensional per channel binary mask. The model further enhances the conditioned features using a NAFBlock, and uses the NAFBlock and Instruction Conditioned Block to condition the features at both the encoder block and the decoder block.
Although the InstructIR framework does not condition the neural network filters explicitly, the mask facilitates the model to select the channels most relevant on the basis of the image instruction and information.
InstructIR: Implementation and Results
The InstructIR model is end-to-end trainable, and the image model does not require pre-training. It is only the text embedding projections and classification head that needs to be trained. The text encoder is initialized using a BGE encoder, a BERT-like encoder that is pre-trained on a massive amount of supervised and unsupervised data for generic purpose sentence encoding. The InstructIR framework uses the NAFNet model as image model, and the architecture of NAFNet consists of a 4 level encoder decoder with varying number of blocks at each level. The model also adds 4 middle blocks between the encoder and the decoder to further enhance the features. Furthermore, instead of concatenating for the skip connections, the decoder implements addition, and the InstructIR model implements only the ICB or Instruction Conditioned Block for task routing only in encoder and decoder. Moving on, the InstructIR model is optimized using the loss between the restored image, and the ground-truth clean image, and the cross-entropy loss is used for intent classification head of the text encoder. The InstructIR model uses the AdamW optimizer with a batch size of 32, and a learning rate of 5e-4 for nearly 500 epochs, and also implements the cosine annealing learning rate decay. Since the image model in the InstructIR framework comprises only 16 million parameters, and there are only 100 thousand learned text projection parameters, the InstructIR framework can be easily trained on standard GPUs, thus reducing the computational costs, and increasing the applicability.
Multiple Degradation Results
For multiple degradations and multi-task restorations, the InstructIR framework defines two initial setups:
- 3D for three-degradation models to tackle degradation issues like dehazing, denoising, and deraining.
- 5D for five degradation models to tackle degradation issues like image denoising, low light enhancements, dehazing, denoising, and deraining.
The performance of 5D models are demonstrated in the following table, and compares it with state of the art image restoration and all-in-one models.
As it can be observed, the InstructIR framework with a simple image model and just 16 million parameters can handle five different image restoration tasks successfully thanks to the instruction-based guidance, and delivers competitive results. The following table demonstrates the performance of the framework on 3D models, and the results are comparable to the above results.
The main highlight of the InstructIR framework is instruction-based image restoration, and the following figure demonstrates the incredible abilities of the InstructIR model to understand a wide range of instructions for a given task. Also, for an adversarial instruction, the InstructIR model performs an identity that is not forced.
Final Thoughts
Image restoration is a fundamental problem in computer vision since it aims to recover a high-quality clean image from an image that demonstrates degradations. In low-level computer vision, Degradations is a term used to represent unpleasant effects observed within an image like motion blur, haze, noise, low dynamic range, and more. In this article, we have talked about InstructIR, the world’s first image restoration framework that aims to guide the image restoration model using human-written instructions. For natural language prompts, the InstructIR model can recover high-quality images from their degraded counterparts and also takes into account multiple degradation types. The InstructIR framework is able to deliver state of the art performance on a wide array of image restoration tasks including image deraining, denoising, dehazing, deblurring, and low-light image enhancement.