Renat Alimbekov's personal blog -Data Science, ML and Analytics Engineering

How to Speed Up LLMs and Reduce Costs: Edge Models

RouteLLM reduces your costs for using LLMs by 3.6 times.

It chooses whether to use a strong or weak LLM model depending on the complexity of the user’s query. This optimizes the balance between cost and quality of the response.

The Python library allows you to use this approach directly.

import os
from routellm.controller import Controller

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
# Replace with your model provider, we use Anyscale's Mixtral here.
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"

client = Controller(
  routers=["mf"],
  strong_model="gpt-4-1106-preview",
  weak_model="anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1",
)

Trending Articles on Large Language Models

Google DeepMind has developed a multi-pass online approach using reinforcement learning to enhance the self-correction capabilities of large language models (LLMs).

It has been shown that supervised fine-tuning (SFT) is ineffective for learning self-correction and faces a mismatch between training data and the model’s responses. To address this issue, a two-stage approach is proposed, which first optimizes self-correction behavior and then uses an additional reward to reinforce self-correction during training. This method relies entirely on data generated by the model itself.

When applied to the Gemini 1.0 Pro and 1.5 Flash models, it achieved record-breaking self-correction performance, improving the baseline models by 15.6% and 9.1%, respectively, in MATH and HumanEval tests.

All the Latest in the World of LLM

Over the past month, there have been some very interesting and significant events in the world of Large Language Models (LLM).

Major companies have released fresh versions of their models. First, Google launched two new models, Gemini: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002.

Key Features:

More than a 50% price reduction for the 1.5 Pro version
Results are delivered twice as fast with three times lower latency

The main focus has been on improving performance and speed and reducing costs for models intended for industrial-grade systems.

Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002.

Details here

Key Trends in LLM Reasoning Development

In these notes, I’d like to highlight the latest trends and research in reasoning and new prompting techniques that improve output.

Simply put, reasoning is the process of multi-step thinking, where several consecutive steps of reflection are performed, with each step depending on the previous one.

It may seem that Reasoning and Chain of Thought (CoT) are the same thing. They are related but represent different concepts.

Reasoning is a general concept of thinking and making inferences. It encompasses any forms of reflection and conclusions. Chain of Thought is a specific technique used to improve reasoning by adding intermediate steps to help the model clearly express its thoughts and reach more accurate solutions.

Pandas for Data Science

With this note, I am launching a series of articles for beginners in Data Science and Machine Learning. We’ll start by exploring Pandas. While there are many articles on Pandas available online, I want to focus on practical techniques for using Pandas in Data Science projects and model building.

Dataset: We will use the German Credit Risk dataset from Kaggle.

The dataset contains information on credit data:

Age
Sex
Job
Housing
Saving accounts
Checking account
Credit amount
Duration
Purpose