Data Science, ML and Analytics Engineering

How to Become a Machine Learning Engineer

First, let’s define the difference between a Machine Learning Engineer and a Data Scientist. While a Data Scientist may work more on modeling and focus on the intricacies of algorithms, a Machine Learning Engineer is more likely to work on deploying the same model in a production environment that will interact with users or automate learning, monitoring, feature collection. Very often in companies, these two duties are performed by one specialist.

In this article, we will look at what skills a Machine Learning Engineer needs.

Demand

Machine Learning Engineer in 2022 is in the top 10 highly paid professions. Salaries in the US range from $115,000 to $171,000 on average per year. Machine Learning engineers in the field of Natural Language Processing became the most sought-after specialists. Their average salary is $160,227. Source

In studies of data professions in the market of Kazakhstan and Russia, unfortunately, Machine Learning Engineer and Data Scientist specialists are not separated by salary levels. Therefore, we will focus on the Data Scientist.

In Kazakhstan, the average salary is 682 thousand tenge per month. More details can be found in the study.

In Russia, the demand for Data Scientist grew by 93%, as for salaries, the growth was 11%. Forks of salaries for Data Scientist: for juniors 20 – 200, for middles 60 – 300, for seniors and leads from 100 – 700 thousand rubles. Read more here.

Distribution of salaries by grades
Distribution of salaries by grades

What does a Machine Learning Engineer do?

A Machine Learning Engineer is much more likely to roll out a model to production. This skill is where machine learning engineers and data scientists differ the most.

We can consider that the position of a Machine Learning Engineer is actually an engineer for the operation of machine learning models. But this is a delusion, a machine learning engineer can focus not only on operations and operation of models, but also on the operation and optimization of machine learning algorithms.

While some companies prefer an all-round Data Scientist who is able to both work with machine learning algorithms and implement these solutions in production. But still, many companies will prefer to separate these two roles. It can be difficult for one person to do everything from start to finish, so having two people to build the model and one to deploy it is often a more efficient approach.

What a Machine Learning Engineer Should Know

Machine Learning Engineer is an engineering specialization that combines computer science and data science fields. This requires him to have knowledge of the basics of computer science, algorithms, data structures. Be able to write production and maintainable code covered with tests. At the same time, it should be taken into account that sometimes it is important for a Machine Learning Engineer to know certain areas of data science in order to build effective systems. For example: Computer Vision or Natural Language Processing. There are also many tools that are required for the job. Such as Docker, Flask, MLFlow, Airflow, FastAPI just to name a few.

Separately, I will single out cloud platforms: AWS, Google Cloud, Azure. Cloud services are now extremely popular and more and more companies, projects and startups are working in the clouds.

Algorithms and data structures

Algorithms and data structures play a critical role in the development and optimization of machine learning systems. They are used to process and analyze large amounts of data, design ML systems and develop complex APIs.

Some of the main reasons why algorithms and data structures are important to machine learning include:

  • Efficiency: Choosing the right algorithm and data structure can greatly increase the efficiency of a machine learning model. For example, using fast sorting algorithms can significantly reduce model training time.
  • Data complexity: Some machine learning problems can be very complex and require special algorithms and data structures. For example, text analysis can use natural language processing algorithms that are designed to work with unstructured data.
  • Scaling: As data size grows, more efficient algorithms and data structures may need to be used. Some machine learning algorithms can be very resource intensive and do not scale well with data growth. Choosing the right data structure can help ensure model scalability.

Recommend courses:

Programming in Python and C++

Python and C++ are the programming languages ​​most commonly used in Machine Learning work by an engineer. What both studies confirm:

Machine Learning Engineer Skills
Machine Learning Engineer Skills
Top 25 Machine Learning Engineer Skills
Top 25 Machine Learning Engineer Skills

Python and machine learning libraries such as scikit-learn, Pytorch, and others allow developers to quickly build and train machine learning models. C++ can be used to write more optimized code and speed up machine learning models.

C++ is a more complex programming language than Python, but it provides significantly better performance. This makes it a good choice for applications where performance is critical.

Let’s take a closer look at the benefits of C++:

Performance: C++ is a compiled language, which improves performance over interpreted languages ​​such as Python.

Multithreading: C++ has a wide range of libraries that can be used to create parallel applications and perform computations on multiple threads.

Low-level programming: C++ allows you to have finer control over your computer’s hardware resources than high-level languages ​​like Python. This allows the code to be better optimized for working with memory and other hardware resources.

Reliability: C++ is a strongly typed language with static type checking. This means that it can help prevent many common programming mistakes, such as pointer misuse and array overflow.

Integration: C++ can be used to write libraries that can be used from other programming languages, including Python. This allows you to use the performance and threading advantages of C++ in machine learning systems written in other programming languages.

Recommend courses:

Tools

In addition to the necessary basic work skills, experience with data and software engineering tools is also required. For example: Git, Docker, Flask, MLFlow, Airflow, FastAPI.

Let’s take a look at some of the most popular tools:

Docker: Docker is a platform for developing, delivering and running applications in containers. It can be used to create isolated environments for running machine learning models, making them easy to deploy and scale.

Git and GitHub: Git is a version control system that allows you to track changes in your code. GitHub is a hosting service for Git repositories. They are both widely used for developing and collaborating on machine learning projects.

FastAPI is a modern Python web framework for building fast and efficient APIs. It uses modern Python technologies such as data typing and asynchrony to create high performance web services.

Airflow is a workflow management tool that can be used to schedule, monitor, and execute tasks in machine learning systems. Airflow allows you to create and run machine learning pipelines that can consist of any number of tasks. It can be used to automate processes such as model training, testing, performance evaluation, and model deployment.

Recommend courses:

Cloud platforms

Cloud services are now extremely popular and more and more companies, projects and startups are working in the clouds. Among the skills for Machine Learning, this is more often a plus for salary growth:

Cloud platforms
Cloud platforms

Cloud services are very useful for machine learning engineers, as they provide a ready-made infrastructure and tools for training models and deploying machine learning applications.

Some of the more popular machine learning cloud services include: Amazon Web Services (AWS): AWS provides many services and tools that can be useful for machine learning, including Amazon SageMaker which provides tools for training, optimizing, and deploying models machine learning. In addition, AWS has many other services such as data storage, computing resources, databases, and so on.

Microsoft Azure: A core product for ML Azure Machine Learning that helps with model training and building machine learning applications. Azure also has databases, data storage, and other services.

Google Cloud Platform (GCP): Google Cloud AI Platform, a tool for training, optimizing, and deploying machine learning models. All these and other cloud services allow you to work with large amounts of data and conduct experiments with various models. They also provide high fault tolerance, scalability and security, which is important when working with sensitive data.

System and ML design

You need to understand how to approach the definition of business requirements. For example: load, latency, storage. And also be able to work out the architecture of projects in terms of a database, a real-time system, a load balancer, a microservice, kafka, etc.

In addition to system design, special attention should be paid to the use of ML. This section covers examples of topics such as recommender building, search, fraud detection, prediction, and the like.

For each category, learn the algorithmic framework, how to create real-time predictions, how to map engineers in real time, how to train models with an orchestration like Airflow.

Recommend courses:

Conclusion

The role of Machine Learning Engineer requires knowledge of not only programming languages, but also data and software engineering tools and libraries, for example, Git, Docker, Flask, MLFlow, Airflow, FastAPI. Experience with cloud services is also required, especially Amazon Web Services, Microsoft Azure, and Google Cloud Platform. It is important to be able to work out the architecture of projects in terms of a database, a real-time system, a load balancer, microservices, Kafka, etc.

Model deployment with Google Cloud Functions

In this note, I will tell you how to deploy a model for free up to a certain level of use and not bother with writing a microservice. I note that such a solution is easily integrated, for example, into a web service. All you need is to use Google Cloud Functions.

Google Cloud Functions is a serverless approach, i.e. server services are provided without renting or purchasing equipment. With this approach, the provider manages infrastructure resources, configures and maintains them.

The main advantage of Google Cloud Functions is automatic scalability, high availability and fault tolerance.

Read more

Calculating Monthly Recurring Revenue (MRR) in Python

What is Monthly Recurring Revenue?

Monthly Recurring Revenue – regular monthly income. This metric is used primarily in subscription models. In this case, the income itself must be reduced to months.

Why is it valuable?

If we have a subscription service, we have regular or periodic payments, then we can understand how much money we will earn and how effective our business is. Further, we can increase MRR by switching customers to a more expensive tariff or try to reduce customer churn.

The problem

For this task, use the new dataset: https://alimbekov.com/wp-content/uploads/2021/03/mrr.csv

Structure:

  • customer_id – already familiar customer ID
  • first_order – Subscription start date
  • EndDate – Subscription end date
  • rate – subscription plan (monthly, semi-annual, annual)
  • Amount – amount paid
  • commission – payment system commission

We will use the following formula to calculate MRR: MRR = new + old + expansion + reactivation – churn – contraction

  1. new MRR – the first payment of a new client
  2. old MRR – recurring customer payment
  3. expansion MRR – increase in MRR due to the new tariff
  4. contraction MRR – decrease in MRR due to the new tariff
  5. churn MRR — MRR outflow due to termination of payment
  6. reactivation MRR – return of a client who had an outflow of MRR

Read more

Cohort Analysis in Python

Когортный анализ
Cohort Analysis

What is cohort analysis?

Cohort analysis consists in studying the characteristics of cohorts / vintages / generations, united by common temporal characteristics..

A cohort/vintage/generation is a group formed in a specific way based on time: for example, the month of registration, the month of the first transaction, or the first visit to the site. Cohorts are very similar to segments, with the difference that a cohort includes groups of a certain period of time, while a segment can be based on any other characteristics.

Why is it valuable?

This kind of analysis can be helpful when it comes to understanding the health of your business and the stickiness of your customers. Stickiness is critical, as it is much cheaper and easier to retain a customer than it is to acquire new ones. Also, your product evolves over time. New features are added and removed, design changes, etc. Observing individual groups over time is the starting point for understanding how these changes affect user/group behavior.

Read more

RFM analysis in Python

The problem

Make an RFM analysis. It divides users into segments depending on the prescription (Recency), frequency (Frequency) and the total amount of payments (Monetary).

  • Recency – the difference between the current date and the date of the last payment
  • Frequency — number of transactions
  • Monetary – amount of purchases

These three indicators must be calculated separately for each customer. Then put marks from 1-3 or 1-5. The wider the range, the narrower segments we get.

Points can be set using quantiles. We sort the data according to one of the criteria and divide it into equal groups.

For this task, we use the public dataset: https://www.kaggle.com/olistbr/brazilian-ecommerce nd the olist_orders_dataset.csv and olist_order_payments_dataset.csv files. You can connect them order_id.

Read more