Llama.cpp
llama.cpp python library is a simple Python bindings for
@ggerganovllama.cpp.This package provides:
- Low-level access to C API via ctypes interface.
 - High-level Python API for text completion
 
OpenAI-like APILangChaincompatibilityLlamaIndexcompatibility- OpenAI compatible web server
 
- Local Copilot replacement
 - Function Calling support
 - Vision API support
 - Multiple Models
 
Overviewโ
Integration detailsโ
| Class | Package | Local | Serializable | JS support | 
|---|---|---|---|---|
| ChatLlamaCpp | langchain-community | โ | โ | โ | 
Model featuresโ
| Tool calling | Structured output | JSON mode | Image input | Audio input | Video input | Token-level streaming | Native async | Token usage | Logprobs | 
|---|---|---|---|---|---|---|---|---|---|
| โ | โ | โ | โ | โ | โ | โ | โ | โ | โ | 
Setupโ
To get started and use all the features show below, we reccomend using a model that has been fine-tuned for tool-calling.
We will use Hermes-2-Pro-Llama-3-8B-GGUF from NousResearch.
Hermes 2 Pro is an upgraded version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house. This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling
See our guides on local models to go deeper:
Installationโ
The LangChain LlamaCpp integration lives in the langchain-community and llama-cpp-python packages:
%pip install -qU langchain-community llama-cpp-python
Instantiationโ
Now we can instantiate our model object and generate chat completions:
# Path to your model weights
local_model = "local/path/to/Hermes-2-Pro-Llama-3-8B-Q8_0.gguf"
import multiprocessing
from langchain_community.chat_models import ChatLlamaCpp
llm = ChatLlamaCpp(
    temperature=0.5,
    model_path=local_model,
    n_ctx=10000,
    n_gpu_layers=8,
    n_batch=300,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    max_tokens=512,
    n_threads=multiprocessing.cpu_count() - 1,
    repeat_penalty=1.5,
    top_p=0.5,
    verbose=True,
)