How to Speed Up LLMs and Reduce Costs: Edge Models

RouteLLM reduces your costs for using LLMs by 3.6 times.

It chooses whether to use a strong or weak LLM model depending on the complexity of the user’s query. This optimizes the balance between cost and quality of the response.

The Python library allows you to use this approach directly.

import os
from routellm.controller import Controller

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
# Replace with your model provider, we use Anyscale's Mixtral here.
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"

client = Controller(
  routers=["mf"],
  strong_model="gpt-4-1106-preview",
  weak_model="anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1",
)

Here’s how the model works:

The model is trained on preference data collected from 80,000 “battles” on the Chatbot Arena platform.
To combat data sparsity, the models are grouped into strong models from the top levels and weak ones from the third level.
They test different routing approaches, including matrix factorization, a BERT classifier, weighted similarity ranking (SW), and a causal classifier Llama 3 with 8B parameters.
Matrix factorization allows for a 3.66 reduction in costs while maintaining quality comparable to GPT-4.

Routers generalize well across different strong and weak model pairs without needing retraining.

For more details, you can read the article.

Overview of Small Language Models (SLMs)

An excellent overview of small language models (SLMs), including discussions on definitions, applications, improvements, reliability, and much more.

Compact LLMs for Edge Devices

Meta released quantized Llama 3.2 models with 1B and 3B parameters — an ideal choice for edge devices and deployments, focusing on privacy and speed while maintaining almost full accuracy!

In brief:

Based on Llama 3.2 with 1B and 3B parameters.
2–3 times faster than the original models during inference.
Reduction in model size and memory usage by 45–60%.
Maintains almost full accuracy.
Uses 4-bit grouped quantization and 8-bit dynamic activation for optimal performance.
The quantization scheme is designed with ExecuTorch from PyTorch and Arm CPU in mind.
Best tasks: knowledge extraction, summarization, and instruction-following.
Available on Hugging Face.

Ministral 3B and Ministral 8B

Mistral AI has announced the Ministraux: two advanced models for local and edge computing. Ministral 3B and 8B support a context length of up to 128k tokens, offering high performance and low latency. These models are perfect for tasks requiring privacy and high speed: from local analytics to autonomous robotics.

Ministral 8B uses an innovative attention window to optimize memory usage, and both models can be tuned for processing data and API calls in agent workflows. These solutions offer an effective alternative for local tasks without an internet connection. Mistral again raises the bar in compact language models.

Conclusion

RouteLLM reduces LLM costs by 3.6 times, optimizing the selection between strong and weak models for different queries.
Meta Llama 3.2 (1B and 3B parameters) are compact models for edge devices with a 45–60% reduction in memory usage while maintaining accuracy.
Ministral 3B and 8B are high-performance models for local computing, supporting context up to 128k tokens, ideal for confidential tasks.

Share it

Overview of Small Language Models (SLMs)

Compact LLMs for Edge Devices

Ministral 3B and Ministral 8B

Conclusion

Other entries in this category: