Data Science, ML and Analytics Engineering

Deep dive into LLM Part Two

In the first part we discussed the practical part of deep dive into LLM.

In this part we will talk about key papers that will help in understanding LLM and passing interviews =) But more on that later.

It all starts with the first GPT

Then I recommend reading the paper about InstructGPT. The topic of training with feedback from a person is discussed there.

Then there are a couple of interesting papers:
SELF-INSTRUCT
Information Retrieval with Contrastive Learning

Then I recommend that you familiarize yourself with two truly iconic papers: LORA and QLORA, which solve the following problems:
– learning speed
– computing resources
– memory efficiency

Two more equally important paperpers are PPO and DPO. Understanding these works will help in reward modeling.

And finally:
Switch Transformers  – as a base Mixtures of experts
Mixtral of Experts  – as Open Source SOTA
Llama 2

Happy reading everyone

Deep dive into LLM Part One

I’ve started delving deeper into LLM, and personally, I find it much easier to dive myself through practice.

This way, one can grasp all the key concepts and outline a list of papers for further exploration.

I began with the StackLLaMA note: A hands-on guide to train LLaMA with RLHF



Here, you can immediately familiarize yourself with the concepts of Reinforcement Learning from Human Feedback, effective training with LoRA, PPO.

You’ll also get acquainted with the Hugging Face library zoo: accelerate, bitsandbytes, peft, and trl.

The note uses the StackExchange dataset, but for variety, I can recommend using the Anthropic/hh-rlhf dataset

In the second part, we’ll go through key papers.

Machine learning pipeline-basics. Cookiecutter and Hydra

In Data Science courses, homework and projects are done in Jupyter Notebooks and students are not taught to write pipelines. The fact is that working in Jupyter Notebooks, despite its convenience, also has disadvantages. For example, you build several types of models with multiple options for filling in gaps (mean, median), generate a feature engineering set, and apply different options for splitting the sample.

You can put all this code in one Jupyter Notebooks and log metrics and configs. The code will turn out to be cumbersome and slow. To run experiments, you will need to either jump over or comment on cells that don’t need to be run.

To solve these problems, I recommend using pipeline to automate machine learning workflows. The main purpose of creating a pipeline is control. A well-organized pipeline makes implementation more flexible.

Read more

Skills for different Data scientists levels

Data Science is a wide range of skills that include varying levels of knowledge and experience. The competencies required for a beginner Data Scientist will be different from those required for an experienced Data Scientist. The note is based on my observations and experience as a Head of Machine Learning and Data Science, led a team of 35+ people and 7 streams: Fintech, Devices, MobileAd and GEO, Computer Vision, NLP, Internal Projects, CVM.

In this note, we will consider general skills, without delving into the specifics of NLP and Computer Vision specializations.

Read more