Oh, I wrote a masters thesis! I’m hoping to post the thesis later, but today I wanted to post about a cool system I built for training and organize a bunch of neural network models I trained for my thesis.
While the neural network theory is fun, a lot of the work in deep learning projects is tuning hyperparameters and juggling models. I built a system to help train models and manage results. I thought it was pretty cool! But I also think I would rewrite parts of the system before using it for a future project. So to aid the development of future systems, I wanted to summarize some of my favorite ideas from this system. (In a similar way, Daisy is built using my favorite ideas from the first system I built earlier this year!)
Eh, I’m still working on creative naming. The name Daisy is because I was reading The Great Gatsby!
Here is how I will use a few terms:
- model: Deep learning usually involves learning some \( f \) in \( y = f(x) \), where \( x \) is an input (usually some multi-dimensional vector representing the features) and \( y \) is an output (one or more predicted classes or numerical values.)
- architecture: I’m using architecture to represent everything about the model before the model is trained. In this case, the “architecture” includes, for example, if the neural network uses convolutional layers, where exactly dropout is used, which loss function it uses to train, whether it uses multi-task learning, which activation functions, and so on.
- framework: A deep learning software package that trains models and runs models on input. For example, Nematus, fairseq, or tensor2tensor. The framework usually has an executable for training a model that takes in options to choose which architecture to use.
- architecture configuration: metadata needed to define an architecture. Usually consists of a reference to the framework, which parameters to call it with, and the dataset to train on. I use the architecture configuration to provide reproducibility: it makes it easy to track down how to produce the model.
Given an architecture configuration, the Daisy interface tells the framework to train a model that uses that architecture. Daisy helps monitor the model training. When the framework has finished training a model, Daisy then provides an interface to manage the trained model and information about its performance. Because of the systematic way Daisy organizes the folder structure, I can easily load the files Daisy produces and analyze the models’ results.
Defining the architecture configuration
Daisy uses a YAML file to specify the framework, dataset, and architecture. An abbreviated example of a config file is shown below:
I’ll show how the configuration is used below.
I really like using structured, human-readable data. Since the architecture configuration already data, it’s easy to use code to compare models, to modify architecture configurations, and to generate the command to run. I usually write a template and then generate several configurations based on it. For example, I can generate configurations for several architectures that use different kernel sizes.
I commit the architecture configurations to a Git repo.
Training the model
Once the architecture is defined using a configuration file, I can launch a job to train the model with that architecture. This stage stores information that will be useful for post-processing, including the trained model, the output from the job, and the model’s predictions.
I train the models on one of the university’s GPU clusters. Each model is identified by the filename of the configuration file and an ID that is the time the model training was launched.
Example command walkthrough
In this example, the config is called
2018_07_21_cs_finnish_k6_f200.yaml. Daisy first creates a folder for the model using the config_id, framework, and timestamp, called
r1532308429__nematus__2018_07_21_cs_finnish_k6_f200. It then generates a command for the framework specified.
Below is the Daisy-generated command based on the Nematus framework, the Daisy configuration used to define an architecture, and the Daisy folder structure that is produced.
The annotations are explained below.
- The Python to use is defined per-framework and per-environment in an environment configuration file. This way I can tell it to use conda with Python 2 for Nematus on the GPU, Python 3 for FairSeq on the GPU, or to use
- One of the cool things about Daisy is that it provides the same interface for different frameworks.
When launching a job, the system looks up a special script associated with the framework. I had wrapper scripts called
fairseq.py. Each defines the command to use for training and validation. In the above image,
nematus.pygenerates the template highlighted in purple and fills in with the appropriate values as described in this list.
- The Nematus framework is set up to output to its
nematus.pygives Nematus the folder
model_out, which Daisy gives it for this purpose.
- To generate the dataset location, Daisy combines the folder from the environmental configuration file (
~/data), with the architecture task description (
finnish), and the framework wrapper (
train-source). See below for more thoughts on the datasets.
- The architecture configuration defines the options and flags that should be sent into the program. A few might be preprocessed in
nematus.py, such as those in 6.
- Borrowing from my first system Arcadio, I used YAML to specify a convolutional component that I added to the Nematus framework for my thesis.
nematus.pyknew to convert this parameter into JSON for Nematus.
- The output from the framework is stored in the
daisyfolder. Updates from Daisy are logged in
daisy/job.outfile (the current Slurm job id, when Daisy starts validating, and the model accuracy.) When the model starts training, Daisy saves copy of the architecture configuration to
daisy/config.json. When the network finishes training, Daisy saves a copy of the predictions for the models in
One of my favorite features was a script that helped me monitor the progress of model training. For example, the image above shows the interface displaying the last 5 models trained. Below I describe different components.
- The config id, which helps identify which job is running.
- The model id includes the timestamp of when the model began training, so I write it as a human-readable timestamp.
- I can extract details from the
config.jsonfile. In this example, I’ve extracted the dataset. I had the best accuracies for each dataset on the wall behind my desk, so this was useful for initially checking how the accuracy of the model compares.
- I also compute and show the accuracy of the latest model. (For slower runs, I could compute the accuracy for the model from each epoch. For quicker runs, I only computed the accuracy using the best model.)
- The GPU cluster used Slurm to manage jobs. I show the latest job ID associated with the model. As part of displaying this interface, I also check the running jobs in Slurm and highlight which of my jobs are currently running. I can also use the job id to check which command Slurm is running.
- I had a few additional scripts that use the model’s folder. For example, I had an alias for tailing the training log.
- I added additional hints as needed. I think in the future, I would move this code into
When I’ve confirmed the on-GPU post-processing steps are done, I’ll commit and archive the model. Committing means moving the daisy folder (
config.json, predictions, training log) to my Git repo.
Since the Daisy folder usually contained all that I needed about a model, I usually deleted the trained model. If I wanted to explore a model more (like plot the values of the attention mechanisms), I would transfer the trained model to my local machine too.
After commiting the Daisy folders to the Git repo, I pull them onto my local machine for post-processing using Jupyter Lab and Pandas. Below I’ll show a few neat things I could do with the model information from Daisy.
Displaying all models for a dataset
One view of the model performance was to list all models for a given dataset ordered by accuracy. I visualized this by displaying a table using Markdown in iPython and highlighted the baseline.
The above image identifies a model by its config’s filename. However, while I’m using somewhat human-readable config_ids to describe the architecture, they aren’t the best.
First, typos can happen (“whoops, the config says to use 200 filters but the filename says 300”) and the format is hard to get right (“Now I’m changing the architecture to use different filters for each of 5 layers, but the current filename format only allows one value”).
For this reason, the source of truth is the
config.json file that’s saved. When I go to analyze the results, I try to use the contents of the config instead of the names.
I have code that can dig through
config.json and extract some information to populate a Pandas DataFrame.
I also generated LaTeX tables from the DataFrame.
One idea I took from Lematus/Nematus was to save the model’s predictions from the validation dataset. This meant I could slice up the predictions without needing the model. For example, I was investigating seq2seq architectures, so I manually found how to convert the input into the target sequence, and then compared it to how different models actually tried to modify it. One of them is shown below:
Commentary and future work
Overall, this was a useful project for managing models.
Hundreds of models
I’m counting over 400 models that I used. Most of them only took around an hour to train (yay ablation studies!), the GPU cluster allows me to run 10 jobs at a time, and I deleted most models’ parameters once I got the prediction and accuracy.
Do I really need a config file?
I think I might be able to get the same benefits of having a configuration file by saving the configuration used with the model (which I do already) and having tools to pretty-print the configuration used or tools to train a new model with a modification (different dataset, different architecture).
Improving code for parsing
I would also try to move more of the code that parsed the output files into
Automatic hyperparameter selection
It would be cool to automatically select hyperparameters by iterating/randomizing/evolving to find the hyperparameters with the highest accuracy.
Datasets have per-dataset requirements (downloading, how training/dev/test sets are divided, how inputs and targets delimited or if they are in separate files), and per-framework requirements (how the framework needs the dataset to be formatted). I already had the datasets downloaded and formatted in the same way, so I only needed to worry about per-framework requirements. For example, Nematus requires a dictionary of the dataset’s vocabulary (
nematus.py can generate. Fairseq required the data to be in a different format, so
fairseq.py can format the data in that way.