Diffusion processes are best known for training image-from-text models. The idea behind diffusion processes is rather simple: you take an image and gradually destroy the information by applied Gaussian noise. A diffusion model will now learn to reconstruct the original information (i.e., remove the noise) between two steps. Fully trained, these models are able to create photo-realistic images from noise.
To create an image from text, the diffusion process has to be conditioned using a text prompt. This is a way to provide additional information to guide the generative process.
While generative models are not that common in Earth observation, we were wondering, whether we could use a diffusion process to pretrain our models in a self-supervised fashion.
So we trained a diffusion model on our ben-ge dataset. We provide Sentinel-1 and Sentinel-2 raster data as different bands into the model and we experiment the other data modalities as input for the conditioning mechanism.
Did it work?

As the figure above shows, the model is definitely able to generate realistically looking Sentinel-2 scenes. But more importantly, the Sentinel-1 SAR data that it generates for the same scene, is very consistent with the Sentinel-2 data. This supports the notion that the model learns useful information. But are the learned representations useful for downstream tasks such as land use/land cover classification and segmentation?
The answer seems clear: diffusion pretraining works well for the segmentation task, but not so much for the classification tasks. Furthermore, conditioning of the model seems to improve the results.
For more details, please consult our contribution to IGARSS 2024 (see below).
Resources
- Alexander Lontke, Michael Mommert, Damian Borth, “Multi-Modal Diffusion for Self-Supervised Pretraining”, IEEE International Geoscience and Remote Sensing Symposium 2024, 2024.