The last few weekends, I’ve been experimenting with neural style transfer. Style transfer combines two images to create a new image with similar content as one image but using the style from the other image.
|Original||Tsunami||The Scream||Primordal Chaos|
This post shows what the technique is doing and includes an implementation heavily based on Keras’s example implementation.
After some nice examples of how it should work in theory, I also note some learnings and weird results.
This technique combines two images: a content image and a style image.
(As usual, the full notebook is on github.)
Start with a pretrained network
Often in deep learning, you need to train a network on a lot of data to get good results. However, this neural style transfer technique starts with a network that has already been trained on huge amounts of data.
I used the VGG19 packaged with Keras which has been trained on the gigantic ImageNet dataset. VGG19 on ImageNet was trained to classify what was in images (e.g. tell that a picture is of a “car”).
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, None, None, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, None, None, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, None, None, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, None, None, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, None, None, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, None, None, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, None, None, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, None, None, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, None, None, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, None, None, 256) 590080 _________________________________________________________________ block3_conv4 (Conv2D) (None, None, None, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, None, None, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, None, None, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block4_conv4 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, None, None, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block5_conv4 (Conv2D) (None, None, None, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, None, None, 512) 0 ================================================================= Total params: 20,024,384 Trainable params: 20,024,384 Non-trainable params: 0 _________________________________________________________________
A hand-wavy explanation of why this is useful is that the trained VGG19 network has already learned a good way to represent images. For example, to identify “a car”, maybe the network learned that a car has shiny surfaces (and superficial features like “shiny” are in theory detected by lower-valued blocks of the network) and that a car has four wheels and is usually on a road (and more abstract features, such as neurons for “wheel” and ones for “road”, are in higher-valued blocks of the network). The neural style transfer technique I’m using works by trying to draw the abstract features from the content image, such as “wheels” and “road”, while matching the superficial features from the style image, such as “shiny” or “lots of blue brush strokes”.
To get an idea of those “wheel” neurons, we can look at what VGG19 produces at different layers.
Content image convolutional outputs
Let’s look inside and see how the network represents the content image. I’ll go to various layers and print out what a few features look like.
I’m only printing 3 filters from each layer (for example,
block5_conv1 has 512 filters), so this is just a peek into how the network is representing the image.
Style image convolutional outputs
For fun, I can do the same on the style image.
Setting up a loss function
In order to make an image that combines the images, I’ll have the network nudge the pixel values of a third image, which I call the “variable image.” To guide how to nudge the image, the paper describes a loss function based on a “content loss” and a “style loss”. The Keras code adds a “variational loss.”
The content loss tries to get a deep layer, such as
block5_conv2, of the content image to match the variable image. Since the deep layers should represent larger, abstract items (“a tower”, “some grass”, “the sky”), the content loss should encourage the resulting image to share the content of the image.
The style loss tries to get shallower layers, such as
block1_conv1, from the style image to match the variable image.
Instead of matching the layers values directly, the style loss is based on the Grams matrix of vectorized features layer. That means that all information about the location is squashed.
The variational loss penalizes the variable image if pixels next to each other are very different. This smooths the resulting image.
Defining which layers to use
Typically, “shallow layers” should have a non-zero style weight, and “deeper layers” should have a non-zero content weight. I set my implementation up so I could configure which layers to use and how much to weight each layer.
Now I’ll define a
loss_from_layer_weights based on the provided
Initializing the image
My favorite results happened when I initialized with random noise.
Fitting the model
f_loss_and_grads, I have a function that can tell me how far the variable image is from the ideal image. Now I need a way to update the variable image to make it closer to ideal!
There are a few choices of optimizers. The Keras example uses
fmin_l_bfgs_b. It needs a function that returns the loss function’s value for an image, and a function that returns and the loss function’s gradients for an image.
(I also used BFGS in my logistic regression post!)
Now I run the following code block on a GPU a few times until I get a good image.
<img width=300px src=”/assets/2019-03-17-result.png”>
Now that I have a better image, I can compare it to the original. I should see that the content layers match the content. I might notice the style image being similar.
This project was neat. Besides making images I think are cool, it’s an example of a deep learning project that doesn’t need a lot of custom data: I just need the style image and content image and the VGG19 network weights.
In this post, I showed some cool images, but many of the combinations I tried resulted in muddy images that didn’t look very good. I think there’s a reason certain images (Starry Night, The Scream, Tsunami, Picasso) are so common in Neural Style Transfer posts! There are a lot of choices in parameters, and the scaling of the image influences the layer detection. It’s hard to tell if combinations would turn out better if I had tweaked the parameters right, or if the method wouldn’t work. For example, in the example below, some parameters weren’t weighted right.
Also some of my favorite results didn’t really copy the style, but just made a cool looking result (Primordial Chaos, and the Sutro Tower image below).
I really like being able to implement methods from scratch.
So I feel a little weird that this post is mostly the provided Keras example, with different configuration (I like my
namedtuples), and some code deleted or moved around.
I started this post by trying to implement Neural Style Transfer in Keras using the paper from scratch. My method was slow and images weren’t turning out very good (looking back, that might be in part because of the technique, and not running on a GPU.) Also a few things in the implementation felt awkward: for example, I was feeding in the same image in each epoch.
My laptop was pretty slow at generating the images, so I used a GPU in Colaboratory.
- One more time, the provided Keras example