Data Science - Renat Alimbekov's personal blog

Deep dive into LLM Part Two

In the first part we discussed the practical part of deep dive into LLM.

In this part we will talk about key papers that will help in understanding LLM and passing interviews =) But more on that later.

It all starts with the first GPT

Then I recommend reading the paper about InstructGPT. The topic of training with feedback from a person is discussed there.

Then there are a couple of interesting papers:
– SELF-INSTRUCT
– Information Retrieval with Contrastive Learning

Then I recommend that you familiarize yourself with two truly iconic papers: LORA and QLORA, which solve the following problems:
– learning speed
– computing resources
– memory efficiency

Two more equally important paperpers are PPO and DPO. Understanding these works will help in reward modeling.

And finally:
– Switch Transformers – as a base Mixtures of experts
– Mixtral of Experts – as Open Source SOTA
– Llama 2

Happy reading everyone

Deep dive into LLM Part One

I’ve started delving deeper into LLM, and personally, I find it much easier to dive myself through practice.

This way, one can grasp all the key concepts and outline a list of papers for further exploration.

I began with the StackLLaMA note: A hands-on guide to train LLaMA with RLHF

Here, you can immediately familiarize yourself with the concepts of Reinforcement Learning from Human Feedback, effective training with LoRA, PPO.

You’ll also get acquainted with the Hugging Face library zoo: accelerate, bitsandbytes, peft, and trl.

The note uses the StackExchange dataset, but for variety, I can recommend using the Anthropic/hh-rlhf dataset

In the second part, we’ll go through key papers.

Machine learning pipeline-basics. Cookiecutter and Hydra

In Data Science courses, homework and projects are done in Jupyter Notebooks and students are not taught to write pipelines. The fact is that working in Jupyter Notebooks, despite its convenience, also has disadvantages. For example, you build several types of models with multiple options for filling in gaps (mean, median), generate a feature engineering set, and apply different options for splitting the sample.

You can put all this code in one Jupyter Notebooks and log metrics and configs. The code will turn out to be cumbersome and slow. To run experiments, you will need to either jump over or comment on cells that don’t need to be run.

To solve these problems, I recommend using pipeline to automate machine learning workflows. The main purpose of creating a pipeline is control. A well-organized pipeline makes implementation more flexible.

How to Become a Machine Learning Engineer

First, let’s define the difference between a Machine Learning Engineer and a Data Scientist. While a Data Scientist may work more on modeling and focus on the intricacies of algorithms, a Machine Learning Engineer is more likely to work on deploying the same model in a production environment that will interact with users or automate learning, monitoring, feature collection. Very often in companies, these two duties are performed by one specialist.

In this article, we will look at what skills a Machine Learning Engineer needs.

Demand

Machine Learning Engineer in 2022 is in the top 10 highly paid professions. Salaries in the US range from $115,000 to $171,000 on average per year. Machine Learning engineers in the field of Natural Language Processing became the most sought-after specialists. Their average salary is $160,227. Source

In studies of data professions in the market of Kazakhstan and Russia, unfortunately, Machine Learning Engineer and Data Scientist specialists are not separated by salary levels. Therefore, we will focus on the Data Scientist.

In Kazakhstan, the average salary is 682 thousand tenge per month. More details can be found in the study.

In Russia, the demand for Data Scientist grew by 93%, as for salaries, the growth was 11%. Forks of salaries for Data Scientist: for juniors 20 – 200, for middles 60 – 300, for seniors and leads from 100 – 700 thousand rubles. Read more here.

What does a Machine Learning Engineer do?

A Machine Learning Engineer is much more likely to roll out a model to production. This skill is where machine learning engineers and data scientists differ the most.

We can consider that the position of a Machine Learning Engineer is actually an engineer for the operation of machine learning models. But this is a delusion, a machine learning engineer can focus not only on operations and operation of models, but also on the operation and optimization of machine learning algorithms.

While some companies prefer an all-round Data Scientist who is able to both work with machine learning algorithms and implement these solutions in production. But still, many companies will prefer to separate these two roles. It can be difficult for one person to do everything from start to finish, so having two people to build the model and one to deploy it is often a more efficient approach.

What a Machine Learning Engineer Should Know

Machine Learning Engineer is an engineering specialization that combines computer science and data science fields. This requires him to have knowledge of the basics of computer science, algorithms, data structures. Be able to write production and maintainable code covered with tests. At the same time, it should be taken into account that sometimes it is important for a Machine Learning Engineer to know certain areas of data science in order to build effective systems. For example: Computer Vision or Natural Language Processing. There are also many tools that are required for the job. Such as Docker, Flask, MLFlow, Airflow, FastAPI just to name a few.

Separately, I will single out cloud platforms: AWS, Google Cloud, Azure. Cloud services are now extremely popular and more and more companies, projects and startups are working in the clouds.

Algorithms and data structures

Algorithms and data structures play a critical role in the development and optimization of machine learning systems. They are used to process and analyze large amounts of data, design ML systems and develop complex APIs.

Some of the main reasons why algorithms and data structures are important to machine learning include:

Efficiency: Choosing the right algorithm and data structure can greatly increase the efficiency of a machine learning model. For example, using fast sorting algorithms can significantly reduce model training time.
Data complexity: Some machine learning problems can be very complex and require special algorithms and data structures. For example, text analysis can use natural language processing algorithms that are designed to work with unstructured data.
Scaling: As data size grows, more efficient algorithms and data structures may need to be used. Some machine learning algorithms can be very resource intensive and do not scale well with data growth. Choosing the right data structure can help ensure model scalability.

Recommend courses:

Programming in Python and C++

Python and C++ are the programming languages most commonly used in Machine Learning work by an engineer. What both studies confirm:

Python and machine learning libraries such as scikit-learn, Pytorch, and others allow developers to quickly build and train machine learning models. C++ can be used to write more optimized code and speed up machine learning models.

C++ is a more complex programming language than Python, but it provides significantly better performance. This makes it a good choice for applications where performance is critical.

Let’s take a closer look at the benefits of C++:

Performance: C++ is a compiled language, which improves performance over interpreted languages such as Python.

Multithreading: C++ has a wide range of libraries that can be used to create parallel applications and perform computations on multiple threads.

Low-level programming: C++ allows you to have finer control over your computer’s hardware resources than high-level languages like Python. This allows the code to be better optimized for working with memory and other hardware resources.

Reliability: C++ is a strongly typed language with static type checking. This means that it can help prevent many common programming mistakes, such as pointer misuse and array overflow.

Integration: C++ can be used to write libraries that can be used from other programming languages, including Python. This allows you to use the performance and threading advantages of C++ in machine learning systems written in other programming languages.

Recommend courses:

Tools

In addition to the necessary basic work skills, experience with data and software engineering tools is also required. For example: Git, Docker, Flask, MLFlow, Airflow, FastAPI.

Let’s take a look at some of the most popular tools:

Docker: Docker is a platform for developing, delivering and running applications in containers. It can be used to create isolated environments for running machine learning models, making them easy to deploy and scale.

Git and GitHub: Git is a version control system that allows you to track changes in your code. GitHub is a hosting service for Git repositories. They are both widely used for developing and collaborating on machine learning projects.

FastAPI is a modern Python web framework for building fast and efficient APIs. It uses modern Python technologies such as data typing and asynchrony to create high performance web services.

Airflow is a workflow management tool that can be used to schedule, monitor, and execute tasks in machine learning systems. Airflow allows you to create and run machine learning pipelines that can consist of any number of tasks. It can be used to automate processes such as model training, testing, performance evaluation, and model deployment.

Recommend courses:

Cloud platforms

Cloud services are now extremely popular and more and more companies, projects and startups are working in the clouds. Among the skills for Machine Learning, this is more often a plus for salary growth:

Cloud services are very useful for machine learning engineers, as they provide a ready-made infrastructure and tools for training models and deploying machine learning applications.

Some of the more popular machine learning cloud services include: Amazon Web Services (AWS): AWS provides many services and tools that can be useful for machine learning, including Amazon SageMaker which provides tools for training, optimizing, and deploying models machine learning. In addition, AWS has many other services such as data storage, computing resources, databases, and so on.

Microsoft Azure: A core product for ML Azure Machine Learning that helps with model training and building machine learning applications. Azure also has databases, data storage, and other services.

Google Cloud Platform (GCP): Google Cloud AI Platform, a tool for training, optimizing, and deploying machine learning models. All these and other cloud services allow you to work with large amounts of data and conduct experiments with various models. They also provide high fault tolerance, scalability and security, which is important when working with sensitive data.

System and ML design

You need to understand how to approach the definition of business requirements. For example: load, latency, storage. And also be able to work out the architecture of projects in terms of a database, a real-time system, a load balancer, a microservice, kafka, etc.

In addition to system design, special attention should be paid to the use of ML. This section covers examples of topics such as recommender building, search, fraud detection, prediction, and the like.

For each category, learn the algorithmic framework, how to create real-time predictions, how to map engineers in real time, how to train models with an orchestration like Airflow.

Recommend courses:

Conclusion

The role of Machine Learning Engineer requires knowledge of not only programming languages, but also data and software engineering tools and libraries, for example, Git, Docker, Flask, MLFlow, Airflow, FastAPI. Experience with cloud services is also required, especially Amazon Web Services, Microsoft Azure, and Google Cloud Platform. It is important to be able to work out the architecture of projects in terms of a database, a real-time system, a load balancer, microservices, Kafka, etc.

Machine Learning Models in production: Flask and REST API

A trained machine learning model alone will not add value for business. The model must be integrated into the company’s IT infrastructure. Let’s develope REST API microservice to classify Iris flowers. The dataset consists of the length and width of two types of Iris petals: sepal and petal. The target variable is Iris variety: 0 – Setosa, 1 – Versicolor, 2 – Virginica.

Saving and loading a model

Before moving on to develope API, we need to train and save the model. Take the RandomForestClassifier model. Now let’s save the model to a file and load it to make predictions. This can be done with pickle or joblib.

import pickle filename = 'model.pkl'
pickle.dump(clf, open(filename, 'wb'))

We’ll use pickle.load to load and validate the model.

loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test) 
print(result)

The code for training, saving and loading the model is available in the repository — link