LLamaSharp 0.11.2

There is a newer version of this package available.
See the version list below for details.

dotnet add package LLamaSharp --version 0.11.2

NuGet\Install-Package LLamaSharp -Version 0.11.2

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="LLamaSharp" Version="0.11.2" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add LLamaSharp --version 0.11.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: LLamaSharp, 0.11.2"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install LLamaSharp as a Cake Addin
#addin nuget:?package=LLamaSharp&version=0.11.2

// Install LLamaSharp as a Cake Tool
#tool nuget:?package=LLamaSharp&version=0.11.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

logo

LLamaSharp is a cross-platform library to run 🦙LLaMA/LLaVA model (and others) on your local device. Based on llama.cpp, inference with LLamaSharp is efficient on both CPU and GPU. With the higher-level APIs and RAG support, it's convenient to deploy LLM (Large Language Model) in your application with LLamaSharp.

Please star the repo to show your support for this project!🤗

<details> <summary>Table of Contents</summary> <ul> <li><a href="#Documentation">Documentation</a></li> <li><a href="#Console Demo">Console Demo</a></li> <li><a href="#Integrations & Examples">Integrations & Examples</a></li> <li><a href="#Get started">Get started</a></li> <li><a href="#FAQ">FAQ</a></li> <li><a href="#Contributing">Contributing</a></li> <li><a href="#Join the community">Join the community</a></li> <li><a href="#Star history">Star history</a></li> <li><a href="#Contributor wall of fame">Contributor wall of fame</a></li> <li><a href="#Map of LLamaSharp and llama.cpp versions">Map of LLamaSharp and llama.cpp versions</a></li> </ul> </details>

📖Documentation

📌Console Demo

<table class="center"> <tr style="line-height: 0"> <td width=50% height=30 style="border: none; text-align: center">LLaMA</td> <td width=50% height=30 style="border: none; text-align: center">LLaVA</td> </tr> <tr> <td width=25% style="border: none"><img src="Assets/console_demo.gif" style="width:100%"></td> <td width=25% style="border: none"><img src="Assets/llava_demo.gif" style="width:100%"></td> </tr> </table>

🔗Integrations & Examples

There are integarions for the following libraries, making it easier to develop your APP. Integrations for semantic-kernel and kernel-memory are developed in LLamaSharp repository, while others are developed in their own repositories.

semantic-kernel: an SDK that integrates LLM like OpenAI, Azure OpenAI, and Hugging Face.
kernel-memory: a multi-modal AI Service specialized in the efficient indexing of datasets through custom continuous data hybrid pipelines, with support for RAG (Retrieval Augmented Generation), synthetic memory, prompt engineering, and custom semantic memory processing.
BotSharp: an open source machine learning framework for AI Bot platform builder.
Langchain: a framework for developing applications powered by language models.

The following examples show how to build APPs with LLamaSharp.

LLamaShrp-Integrations

🚀Get started

Installation

To gain high performance, LLamaSharp interacts with a native library compiled from c++, which is called backend. We provide backend packages for Windows, Linux and MAC with CPU, Cuda, Metal and OpenCL. You don't need to handle anything about c++ but just install the backend packages.

If no published backend match your device, please open an issue to let us know. If compiling c++ code is not difficult for you, you could also follow this guide to compile a backend and run LLamaSharp with it.

Install LLamaSharp package on NuGet:

PM> Install-Package LLamaSharp

Install one or more of these backends, or use self-compiled backend.
- LLamaSharp.Backend.Cpu: Pure CPU for Windows & Linux & MAC. Metal (GPU) support for MAC.
- LLamaSharp.Backend.Cuda11: CUDA11 for Windows & Linux.
- LLamaSharp.Backend.Cuda12: CUDA 12 for Windows & Linux.
- LLamaSharp.Backend.OpenCL: OpenCL for Windows & Linux.
(optional) For Microsoft semantic-kernel integration, install the LLamaSharp.semantic-kernel package.
(optional) To enable RAG support, install the LLamaSharp.kernel-memory package (this package only supports net6.0 or higher yet), which is based on Microsoft kernel-memory integration.

Model preparation

There are two popular format of model file of LLM now, which are PyTorch format (.pth) and Huggingface format (.bin). LLamaSharp uses GGUF format file, which could be converted from these two formats. To get GGUF file, there are two options:

Search model name + 'gguf' in Huggingface, you will find lots of model files that have already been converted to GGUF format. Please take care of the publishing time of them because some old ones could only work with old version of LLamaSharp.
Convert PyTorch or Huggingface format to GGUF format yourself. Please follow the instructions of this part of llama.cpp readme to convert them with the python scripts.

Generally, we recommend downloading models with quantization rather than fp16, because it significantly reduce the required memory size while only slightly impact on its generation quality.

Example of LLaMA chat session

Here is a simple example to chat with bot based on LLM in LLamaSharp. Please replace the model path with yours.

using LLama.Common;
using LLama;

string modelPath = @"<Your Model Path>"; // change it to your own model path.

var parameters = new ModelParams(modelPath)
{
    ContextSize = 1024, // The longest length of chat as memory.
    GpuLayerCount = 5 // How many layers to offload to GPU. Please adjust it according to your GPU memory.
};
using var model = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
var executor = new InteractiveExecutor(context);

// Add chat histories as prompt to tell AI how to act.
var chatHistory = new ChatHistory();
chatHistory.AddMessage(AuthorRole.System, "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.");
chatHistory.AddMessage(AuthorRole.User, "Hello, Bob.");
chatHistory.AddMessage(AuthorRole.Assistant, "Hello. How may I help you today?");

ChatSession session = new(executor, chatHistory);

InferenceParams inferenceParams = new InferenceParams()
{
    MaxTokens = 256, // No more than 256 tokens should appear in answer. Remove it if antiprompt is enough for control.
    AntiPrompts = new List<string> { "User:" } // Stop generation once antiprompts appear.
};

Console.ForegroundColor = ConsoleColor.Yellow;
Console.Write("The chat session has started.\nUser: ");
Console.ForegroundColor = ConsoleColor.Green;
string userInput = Console.ReadLine() ?? "";

while (userInput != "exit")
{
    await foreach ( // Generate the response streamingly.
        var text
        in session.ChatAsync(
            new ChatHistory.Message(AuthorRole.User, userInput),
            inferenceParams))
    {
        Console.ForegroundColor = ConsoleColor.White;
        Console.Write(text);
    }
    Console.ForegroundColor = ConsoleColor.Green;
    userInput = Console.ReadLine() ?? "";
}

For more examples, please refer to LLamaSharp.Examples.

💡FAQ

Why GPU is not used when I have installed CUDA

If you are using backend packages, please make sure you have installed the cuda backend package which matches the cuda version of your device. Please note that before LLamaSharp v0.10.0, only one backend package should be installed.
Add NativeLibraryConfig.Instance.WithLogs(LLamaLogLevel.Info) to the very beginning of your code. The log will show which native library file is loaded. If the CPU library is loaded, please try to compile the native library yourself and open an issue for that. If the CUDA libraty is loaded, please check if GpuLayerCount > 0 when loading the model weight.

Why the inference is slow

Firstly, due to the large size of LLM models, it requires more time to generate outputs than other models, especially when you are using models larger than 30B.

To see if that's a LLamaSharp performance issue, please follow the two tips below.

If you are using CUDA, Metal or OpenCL, please set GpuLayerCount as large as possible.
If it's still slower than you expect it to be, please try to run the same model with same setting in llama.cpp examples. If llama.cpp outperforms LLamaSharp significantly, it's likely a LLamaSharp BUG and please report us for that.

Why the program crashes before any output is generated

Generally, there are two possible cases for this problem:

The native library (backend) you are using is not compatible with the LLamaSharp version. If you compiled the native library yourself, please make sure you have checkouted llama.cpp to the corresponding commit of LLamaSharp, which could be found at the bottom of README.
The model file you are using is not compatible with the backend. If you are using a GGUF file downloaded from huggingface, please check its publishing time.

Why my model is generating output infinitely

Please set anti-prompt or max-length when executing the inference.

🙌Contributing

Any contribution is welcomed! There's a TODO list in LLamaSharp Dev Project and you could pick an interesting one to start. Please read the contributing guide for more information.

You can also do one of the followings to help us make LLamaSharp better:

Submit a feature request.
Star and share LLamaSharp to let others know it.
Write a blog or demo about LLamaSharp.
Help to develop Web API and UI integration.
Just open an issue about the problem you met!

Join the community

Join our chat on Discord (please contact Rinne to join the dev channel if you want to be a contributor).

Join QQ group

Star history

Contributor wall of fame

Map of LLamaSharp and llama.cpp versions

If you want to compile llama.cpp yourself you must use the exact commit ID listed for each version.

LLamaSharp	Verified Model Resources	llama.cpp commit id
v0.2.0	This version is not recommended to use.	-
v0.2.1	WizardLM, Vicuna (filenames with "old")	-
v0.2.2, v0.2.3	WizardLM, Vicuna (filenames without "old")	`63d2046`
v0.3.0, v0.4.0	LLamaSharpSamples v0.3.0, WizardLM	`7e4ea5b`
v0.4.1-preview	Open llama 3b, Open Buddy	`aacdbd4`
v0.4.2-preview	Llama2 7B (GGML)	`3323112`
v0.5.1	Llama2 7B (GGUF)	`6b73ef1`
v0.6.0		`cb33f43`
v0.7.0, v0.8.0	Thespis-13B, LLaMA2-7B	`207b519`
v0.8.1		`e937066`
v0.9.0, v0.9.1	Mixtral-8x7B	`9fb13f9`
v0.10.0	Phi2	`d71ac90`
v0.11.1, v0.11.2	LLaVA-v1.5, Phi2	`3ab8b3a`

License

This project is licensed under the terms of the MIT license.

Product	Compatible and additional computed target framework versions.
.NET	net5.0 was computed. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.
.NET Core	netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed.
.NET Standard	netstandard2.0 is compatible. netstandard2.1 was computed.
.NET Framework	net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed.
MonoAndroid	monoandroid was computed.
MonoMac	monomac was computed.
MonoTouch	monotouch was computed.
Tizen	tizen40 was computed. tizen60 was computed.
Xamarin.iOS	xamarinios was computed.
Xamarin.Mac	xamarinmac was computed.
Xamarin.TVOS	xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETStandard 2.0
- Microsoft.Extensions.Logging.Abstractions (>= 8.0.1)
- System.Linq.Async (>= 6.0.1)
- System.Text.Json (>= 8.0.3)
net6.0
- Microsoft.Extensions.Logging.Abstractions (>= 8.0.1)
net7.0
- Microsoft.Extensions.Logging.Abstractions (>= 8.0.1)
net8.0
- Microsoft.Extensions.Logging.Abstractions (>= 8.0.1)

NuGet packages (10)

Showing the top 5 NuGet packages that depend on LLamaSharp:

Package	Downloads
Microsoft.KernelMemory.AI.LlamaSharp Provide access to OpenAI LLM models in Kernel Memory to generate text	207.6K
LLamaSharp.semantic-kernel The integration of LLamaSharp and Microsoft semantic-kernel.	42.5K
LLamaSharp.kernel-memory The integration of LLamaSharp and Microsoft kernel-memory. It could make it easy to support document search for LLamaSharp model inference.	41.5K
LangChain.Providers.LLamaSharp LLamaSharp Chat model provider.	39.2K
LangChain.Providers.Automatic1111 Automatic1111 Stable DIffusion model provider.	14.9K

GitHub repositories (4)

Showing the top 4 popular GitHub repositories that depend on LLamaSharp:

Repository	Stars
SciSharp/BotSharp AI Multi-Agent Framework in .NET	2.2K
microsoft/kernel-memory RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.	1.6K
CodeMazeBlog/CodeMazeGuides The main repository for all the Code Maze guides	875
jxq1997216/AITranslator 使用大语言模型来翻译MTool导出的待翻译文件的图像化UI软件	138

Version	Downloads	Last updated
0.19.0	1,310	11/8/2024
0.18.0	7,617	10/19/2024
0.17.0	838	10/13/2024
0.16.0	45,917	9/1/2024
0.15.0	18,168	8/3/2024
0.14.0	3,132	7/16/2024
0.13.0	51,694	6/4/2024
0.12.0	82,462	5/12/2024
0.11.2	15,000	4/6/2024
0.11.1	843	3/31/2024
0.10.0	6,298	2/15/2024
0.9.1	10,052	1/6/2024
0.9.0	547	1/6/2024
0.8.1	22,222	11/28/2023
0.8.0	21,477	11/12/2023
0.7.0	1,849	10/31/2023
0.6.0	2,369	10/24/2023
0.5.1	6,971	9/5/2023
0.4.2-preview	1,945	8/6/2023
0.4.1-preview	1,251	6/21/2023
0.4.0	11,418	6/19/2023
0.3.0	10,703	5/22/2023
0.2.3	714	5/17/2023
0.2.2	639	5/17/2023
0.2.1	677	5/12/2023
0.2.0	866	5/12/2023

LLamaSharp 0.11.2 fixed the performance issue of LLaVA on GPU and improved the log suppression.