Tokenizers.DotNet 1.1.3

Additional Details

The latest release can make you prepare for the exceptions came from this library. For details, see the release page:
https://github.com/sappho192/Tokenizers.DotNet/releases/tag/1.2.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package Tokenizers.DotNet --version 1.1.3
                    
NuGet\Install-Package Tokenizers.DotNet -Version 1.1.3
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Tokenizers.DotNet" Version="1.1.3" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Tokenizers.DotNet" Version="1.1.3" />
                    
Directory.Packages.props
<PackageReference Include="Tokenizers.DotNet" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Tokenizers.DotNet --version 1.1.3
                    
#r "nuget: Tokenizers.DotNet, 1.1.3"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#addin nuget:?package=Tokenizers.DotNet&version=1.1.3
                    
Install Tokenizers.DotNet as a Cake Addin
#tool nuget:?package=Tokenizers.DotNet&version=1.1.3
                    
Install Tokenizers.DotNet as a Cake Tool

Tokenizers.DotNet

.NET wrapper of HuggingFace Tokenizers library

Build (Windows-x64) Build (Multi-platform)

Nuget Package list

Package main Description
Tokenizers.DotNet Nuget Tokenizers.DotNet Core library
Tokenizers.DotNet.runtime.win-x64 Nuget Tokenizers.DotNet.runtime.win-x64 Native bindings for windows x64
Tokenizers.DotNet.runtime.win-arm64 Nuget Tokenizers.DotNet.runtime.win-arm64 Native bindings for windows arm64
Tokenizers.DotNet.runtime.linux-x64 Nuget Tokenizers.DotNet.runtime.linux-x64 Native bindings for linux x64
Tokenizers.DotNet.runtime.linux-arm64 Nuget Tokenizers.DotNet.runtime.linux-arm64 Native bindings for linux arm64

Requirements

  • .NET 6 or above
  • (Build) Latest Rust

Supported functionalities

  • Download tokenizer files from Hugginface Hub
  • Load tokenizer file(.json) from local
  • Encode string to tokens
  • Decode tokens to string

How to use

(1) Install the packages

  1. From the NuGet, install Tokenizers.DotNet package
  2. And then, install Tokenizers.DotNet.runtime.<OS>-<ARCH> package too (e.a win-x64 or linux-arm64, check Nuget package list above).

(2) Write the code

Check following example code:

using Tokenizers.DotNet;

// Download skt/kogpt2-base-v2/tokenizer.json from the hub
var hubName = "skt/kogpt2-base-v2";
var filePath = "tokenizer.json";
var fileFullPath = await HuggingFace.GetFileFromHub(hubName, filePath, "deps");
Console.WriteLine($"Downloaded {fileFullPath}");

// Create a tokenizer instance
var tokenizer = new Tokenizer(vocabPath: fileFullPath);
var text = "음, 이제 식사도 해볼까요";
Console.WriteLine($"Input text: {text}");
var tokens = tokenizer.Encode(text);
Console.WriteLine($"Encoded: {string.Join(", ", tokens)}");
var decoded = tokenizer.Decode(tokens);
Console.WriteLine($"Decoded: {decoded}");

Console.WriteLine($"Version of Tokenizers.DotNet.runtime.win: {tokenizer.GetVersion()}");

Console.WriteLine("--------------------------------------------------");
// Use another tokenizer
//// Download openai-community/gpt2 from the hub
hubName = "openai-community/gpt2";
filePath = "tokenizer.json";
fileFullPath = await HuggingFace.GetFileFromHub(hubName, filePath, "deps");

// Create a tokenizer instance
var tokenizer2 = new Tokenizer(vocabPath: fileFullPath);
var text2 = "i was nervous before the exam, and i had a fever.";
Console.WriteLine($"Input text: {text2}");
var tokens2 = tokenizer2.Encode(text2);
Console.WriteLine($"Encoded: {string.Join(", ", tokens2)}");
var decoded2 = tokenizer2.Decode(tokens2);
Console.WriteLine($"Decoded: {decoded2}");

Console.WriteLine($"Version of Tokenizers.DotNet.runtime.win: {tokenizer2.GetVersion()}");
Console.ReadKey();

How to build

  1. Prepare following stuff:
    1. Rust build system (cargo)
    2. .NET build system (dotnet 6.0, 7.0, 8.0, 9.0)
    3. PowerShell (Recommend 7.4.2 or above)
  2. Bump the version number in NATIVE_LIB_VERSION.txt
  3. Run build_all_clean.ps1
    1. To build Tokenizers.DotNet.runtime.<OS> only, run build_rust.ps1
    2. To build Tokenizers.DotNet only, run build_dotnet.ps1

Each build artifacts will be in nuget directory.

Cross-platform build

You can use Docker to compile this library for Windows x64/arm64 and Linux x64/arm64

Run update_version.ps1 before running Docker to update the package version.

Windows:

PS > docker build -f Dockerfile -t ghcr.io/sappho192/tokenizers.dotnet:latest .
PS > docker run -v .\nuget:/out --rm ghcr.io/sappho192/tokenizers.dotnet:latest

Linux/MacOS:

$ docker build -f Dockerfile -t ghcr.io/sappho192/tokenizers.dotnet:latest .
$ docker run -v ./nuget:/out --rm ghcr.io/sappho192/tokenizers.dotnet:latest

Built packages will be in the nuget folder.

Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net6.0

    • No dependencies.
  • net7.0

    • No dependencies.
  • net8.0

    • No dependencies.
  • net9.0

    • No dependencies.

NuGet packages (1)

Showing the top 1 NuGet packages that depend on Tokenizers.DotNet:

Package Downloads
EDMTranslator

Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace

GitHub repositories

This package is not used by any popular GitHub repositories.