Data Science, ML and Analytics Engineering

Machine learning pipeline-basics. Cookiecutter and Hydra

In Data Science courses, homework and projects are done in Jupyter Notebooks and students are not taught to write pipelines. The fact is that working in Jupyter Notebooks, despite its convenience, also has disadvantages. For example, you build several types of models with multiple options for filling in gaps (mean, median), generate a feature engineering set, and apply different options for splitting the sample.

You can put all this code in one Jupyter Notebooks and log metrics and configs. The code will turn out to be cumbersome and slow. To run experiments, you will need to either jump over or comment on cells that don’t need to be run.

To solve these problems, I recommend using pipeline to automate machine learning workflows. The main purpose of creating a pipeline is control. A well-organized pipeline makes implementation more flexible.

Read more