In Data Science courses, homework and projects are done in Jupyter Notebooks and students are not taught to write pipelines. The fact is that working in Jupyter Notebooks, despite its convenience, also has disadvantages. For example, you build several types of models with multiple options for filling in gaps (mean, median), generate a feature engineering set, and apply different options for splitting the sample.
You can put all this code in one Jupyter Notebooks and log metrics and configs. The code will turn out to be cumbersome and slow. To run experiments, you will need to either jump over or comment on cells that don’t need to be run.
To solve these problems, I recommend using pipeline to automate machine learning workflows. The main purpose of creating a pipeline is control. A well-organized pipeline makes implementation more flexible.
Follow the link to view the sample code for the pipeline. The structure is as follows:
- config-folder with configuration files in yaml format
- data-folder with data files
- external — data from external sources
- interim-data separation
- raw — raw data
- processed — data after processing
- experiments-configs, logs of model results
- models-folder for storing models
- src-source code. Contains code for working with data, evaluating models, supporting functions, training models, and working with attributes
Working with the pipeline consists of the following steps:
- Creating configs
- Preparing attributes
- Splitting into samples
- Model Training
- Measuring model quality
Cookiecutter Data Science
Let’s now try to look at different pipeline development practices.
The first thing I would recommend is to try the project structure: cookiecutter
This structure is quite logical, standardized, and flexible. All you need is to install it and start the project:
pip install cookiecutter
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science
The directory structure of your new project looks like this:
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── docs <- A default Sphinx project; see sphinx-doc.org for details
├── models <- Trained and serialized models, model predictions, or model summaries
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
├── references <- Data dictionaries, manuals, and all other explanatory materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
For your projects, you can slightly redesign the structure, for example: I don’t need the src/features, reports, and references folders in my Computer Vision projects.
More details can be found here:
In this part of the article, we will talk about libraries for working with configurations for machine learning projects — Hydra.
What exactly is the problem and why did I start using Hydra? When running Python scripts, many arguments are added, although sometimes they can be grouped. Here is an example of such a script:
help='path to dataset')
parser.add_argument('-a', '--arch', metavar='ARCH', default='resnet18',
help='model architecture: ' +
' | '.join(model_names) +
' (default: resnet18)')
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
help='number of data loading workers (default: 4)')
parser.add_argument('--epochs', default=90, type=int, metavar='N',
help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=256, type=int,
help='mini-batch size (default: 256), this is the total '
'batch size of all GPUs on the current node when '
'using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr', '--learning-rate', default=0.1, type=float,
metavar='LR', help='initial learning rate', dest='lr')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
parser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,
metavar='W', help='weight decay (default: 1e-4)',
Switching to configuration files is a common solution to control the growing complexity. Configuration files can be hierarchical and can help reduce the complexity of the code that defines command-line arguments. But they also have their drawbacks.
- While experimenting, you may need to run the application with different configuration options. At first, you can simply change the configuration file before each run, but you will soon find that it is difficult to track changes associated with each run.
- Configuration files become monolithic. But if, for example, you want your code to use different configuration parameters, say, one for the ImageNet dataset and one for the CIFAR-10 dataset, you have two options: maintain two configuration files, or put both parameters in a single configuration file and somehow use only what you want. you need it at runtime.
Well, the solution to all the above inconveniences is Hydra.
Hydra-allows you to create a composition of configurations. The composition can work both with the configuration file and on the command line. At the same time, everything in the compiled configuration can also be overridden via the command line.
- dataset: cifar10
from omegaconf import DictConfig
def my_app(cfg: DictConfig) -> None:
if __name__ == "__main__":
The default dataset parameter will be used at startup. But you can also pass the parameter to the console as well:
python app.py dataset.path = /datasets/cifar10
Another cool feature: multirun is Hydra’s ability to run your function multiple times, creating a new configuration object each time. This is very convenient for checking parameters without writing additional functions. For example, we can view all 4 combinations (2 datasets X 2 optimizers):
python app.py —multirun dataset= imagenet, cifar10 optimizer=adam, nesterov
And this is how it looks in my project
To learn more about Hydra, I suggest reading and watching:
In this mini-note, I tried to describe the approach to pipeline development, problems with configs that you encounter when writing pipelines, and some of the features that Hydra offers.