pix2pix - Research Paper Digest

Image-to-Image Translation with Conditional Adversial Networks

by BAIR | 21 NOV 2016
arXiv: 1611.07004v1

Transformation result


  • relationship of 1-M M-1 for image is analogous to language translation in many sense
  • ▢ identify underlying patterns of language modelling effort that could be application-agnostic
  • UNet is helpful in preserving low-level info compared to autoencoder. (image pairs share lots of low level details. auto-encoder would lose it)
    • advantage not specific to cGAN; also works for L1 loss
  • ▢ Is text/sentences of two languages also shares this attributes ?



$$\mathcal{L} _{\mathit{cGAN}} (G, D) = \begin{aligned} & \hspace{0.2cm}\mathbb{E} _{x,y \sim p_{data}(x,y)}[logD(x,y)] \\ + & \hspace{0.2cm} \mathbb{E}_{x\sim p_{data}(x),z \sim p(z)}[log(1-D(x, G(x,z))]\end{aligned}$$

  • Usage of GAN automates the process of writing a loss function and allows developers to highlight high level goals only
  • cGAN compared to GAN for sharper images and better segmentation as it learn structured loss.
  • ▢ predicted translation semantic tree as the backbone for a prior conditionals


  • pixelGAN cannot increase spatial sharpness; patchGAN yields good results; imageGAN has no significant improvement but add O(n)
  • patchGAN is useful for scaling across large images.
  • could be use as a conv. filter to generate

Loss Function

  • L1 leads to narrower distribution than ground truth. It prefers grayish color (median of color distrbutions).
  • Discriminator can identify grayish outputs as specific features to capture
  • Adversial loss could push the distrubution closer to the ground truth. It could perform "sharpening"
    • Edges -> not blur (sharp lines)
    • Colors -> not median (more colorful)

Color Distribution


  • Eval metrics determine the behaviours of both G and D.
  • FCN-8s scores for evaluating if the common segmentation engion can identify generated images
  • ▢ perplexity -> need approximation functions of syntax tree (real vs fake)
  • img GAN is useful for ambigious task (large output space - highly detailed). But might be better off to use simple L1 when it is small output space (segmentation/classfication)
  • (Table 1) cGAN in general outperform GAN. Interestingly L1+GAN has a slightly higher performance on per-pixel accurary.
  • Paper is the first to demonstrate that GAN can generate discrete labels other than continous-valed variation (images)

Table 1


highlighted ref:

  • Optimization - Instance normalization D. Ulyanov 1607.08022
  • conditional GAN - Conditional geenrative adversarial nets. M. Mirza 1411.1784