Neural style transfer
The last few weekends, I’ve been experimenting with neural style transfer. Style transfer combines two images to create a new image with similar content as one image but using the style from the other image.
For example, using the technique in the paper A Neural Algorithm of Artistic Style and code from the example implementation from Keras, I can produce stylized images of Mt. Hood.
Original | Tsunami | The Scream | Primordal Chaos |
---|---|---|---|
This post shows what the technique is doing and includes an implementation heavily based on Keras’s example implementation.
After some nice examples of how it should work in theory, I also note some learnings and weird results.
Source images
This technique combines two images: a content image and a style image.
(As usual, the full notebook is on github.)
Start with a pretrained network
Often in deep learning, you need to train a network on a lot of data to get good results. However, this neural style transfer technique starts with a network that has already been trained on huge amounts of data.
I used the VGG19 packaged with Keras which has been trained on the gigantic ImageNet dataset. VGG19 on ImageNet was trained to classify what was in images (e.g. tell that a picture is of a “car”).
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, None, None, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, None, None, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, None, None, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, None, None, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, None, None, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, None, None, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, None, None, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, None, None, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, None, None, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, None, None, 256) 590080
_________________________________________________________________
block3_conv4 (Conv2D) (None, None, None, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, None, None, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, None, None, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block4_conv4 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, None, None, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_conv4 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, None, None, 512) 0
=================================================================
Total params: 20,024,384
Trainable params: 20,024,384
Non-trainable params: 0
_________________________________________________________________
VGG19 examples
A hand-wavy explanation of why this is useful is that the trained VGG19 network has already learned a good way to represent images. For example, to identify “a car”, maybe the network learned that a car has shiny surfaces (and superficial features like “shiny” are in theory detected by lower-valued blocks of the network) and that a car has four wheels and is usually on a road (and more abstract features, such as neurons for “wheel” and ones for “road”, are in higher-valued blocks of the network). The neural style transfer technique I’m using works by trying to draw the abstract features from the content image, such as “wheels” and “road”, while matching the superficial features from the style image, such as “shiny” or “lots of blue brush strokes”.
To get an idea of those “wheel” neurons, we can look at what VGG19 produces at different layers.
Content image convolutional outputs
Let’s look inside and see how the network represents the content image. I’ll go to various layers and print out what a few features look like.
I’m only printing 3 filters from each layer (for example, block5_conv1
has 512 filters), so this is just a peek into how the network is representing the image.
Style image convolutional outputs
For fun, I can do the same on the style image.
Setting up a loss function
In order to make an image that combines the images, I’ll have the network nudge the pixel values of a third image, which I call the “variable image.” To guide how to nudge the image, the paper describes a loss function based on a “content loss” and a “style loss”. The Keras code adds a “variational loss.”
Content loss
The content loss tries to get a deep layer, such as block5_conv2
, of the content image to match the variable image. Since the deep layers should represent larger, abstract items (“a tower”, “some grass”, “the sky”), the content loss should encourage the resulting image to share the content of the image.
Style loss
The style loss tries to get shallower layers, such as block1_conv1
, from the style image to match the variable image.
Instead of matching the layers values directly, the style loss is based on the Grams matrix of vectorized features layer. That means that all information about the location is squashed.
(Aside: I wrote a little another use of Grams matrix here).
Variational loss
The variational loss penalizes the variable image if pixels next to each other are very different. This smooths the resulting image.
Defining which layers to use
Typically, “shallow layers” should have a non-zero style weight, and “deeper layers” should have a non-zero content weight. I set my implementation up so I could configure which layers to use and how much to weight each layer.
Now I’ll define a loss_from_layer_weights
based on the provided layer_weights
.
Initializing the image
My favorite results happened when I initialized with random noise.
Fitting the model
With f_loss_and_grads
, I have a function that can tell me how far the variable image is from the ideal image. Now I need a way to update the variable image to make it closer to ideal!
There are a few choices of optimizers. The Keras example uses scipy
’s fmin_l_bfgs_b
. It needs a function that returns the loss function’s value for an image, and a function that returns and the loss function’s gradients for an image.
(I also used BFGS in my logistic regression post!)
Now I run the following code block on a GPU a few times until I get a good image.
<img width=300px src=”/assets/2019-03-17-result.png”>
Results
Now that I have a better image, I can compare it to the original. I should see that the content layers match the content. I might notice the style image being similar.
Etc
This project was neat. Besides making images I think are cool, it’s an example of a deep learning project that doesn’t need a lot of custom data: I just need the style image and content image and the VGG19 network weights.
Failed images
In this post, I showed some cool images, but many of the combinations I tried resulted in muddy images that didn’t look very good. I think there’s a reason certain images (Starry Night, The Scream, Tsunami, Picasso) are so common in Neural Style Transfer posts! There are a lot of choices in parameters, and the scaling of the image influences the layer detection. It’s hard to tell if combinations would turn out better if I had tweaked the parameters right, or if the method wouldn’t work. For example, in the example below, some parameters weren’t weighted right.
Oh no, the Uppsala cathedral is a ghost. style, content
Also some of my favorite results didn’t really copy the style, but just made a cool looking result (Primordial Chaos, and the Sutro Tower image below).
Code
I really like being able to implement methods from scratch.
So I feel a little weird that this post is mostly the provided Keras example, with different configuration (I like my namedtuples
), and some code deleted or moved around.
I started this post by trying to implement Neural Style Transfer in Keras using the paper from scratch. My method was slow and images weren’t turning out very good (looking back, that might be in part because of the technique, and not running on a GPU.) Also a few things in the implementation felt awkward: for example, I was feeding in the same image in each epoch.
GPUs
My laptop was pretty slow at generating the images, so I used a GPU in Colaboratory.
Other posts
- One more time, the provided Keras example