BlingFireNuget 0.1.5
See the version list below for details.
dotnet add package BlingFireNuget --version 0.1.5
NuGet\Install-Package BlingFireNuget -Version 0.1.5
<PackageReference Include="BlingFireNuget" Version="0.1.5" />
paket add BlingFireNuget --version 0.1.5
#r "nuget: BlingFireNuget, 0.1.5"
// Install BlingFireNuget as a Cake Addin #addin nuget:?package=BlingFireNuget&version=0.1.5 // Install BlingFireNuget as a Cake Tool #tool nuget:?package=BlingFireNuget&version=0.1.5
Bling Fire
Introduction
Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multi-word expression matching, Unknown word-guessing, Stemming / Lemmatization just to mention a few.
Bling Fire Tokenizer Overview
Bling Fire Tokenizer provides state of the art performance for Natural Language text tokenization. Bling Fire supports four tokenization algorithms:
- Pattern-based tokenization
- WordPiece tokenization
- SentencePiece Unigram LM
- SentencePiece BPE
Bling Fire provides uniform interface for working with all four algorithms so there is no difference for the client whether to use tokenizer for XLNET, BERT or your own custom model.
Model files describe the algorithms they are built for and are loaded on demand from external file. There are also two default models for NLTK-style tokenization and sentence breaking, which does not need to be loaded. The default tokenization model follows logic of NLTK, except hyphenated words are split and a few "errors" are fixed.
Normalization can be added to each model, but is optional.
Diffrences between algorithms are summarized here.
Bling Fire Tokenizer high level API designed in a way that it requires minimal or no configuration, or initialization, or additional files and is friendly for use from languages like Python, Ruby, Rust, C#, JavaScript (via WASM), etc.
We have precompiled some popular models and listed with the source code reference below:
File Name | Models it should be used for | Algorithm | Source Code |
---|---|---|---|
wbd.bin | Default Tokenization Model | Pattern-based | src |
sbd.bin | Default model for Sentence breaking | Pattern-based | src |
bert_base_tok.bin | BERT Base/Large | WordPiece | src |
bert_base_cased_tok.bin | BERT Base/Large Cased | WordPiece | src |
bert_chinese.bin | BERT Chinese | WordPiece | src |
bert_multi_cased.bin | BERT Multi Lingual Cased | WordPiece | src |
xlnet.bin | XLNET Tokenization Model | Unigram LM | src |
xlnet_nonorm.bin | XLNET Tokenization Model /wo normalization | Unigram LM | src |
bpe_example.bin | A model to test BPE tokenization | BPE | src |
xlm_roberta_base.bin | XLM Roberta Tokenization | Unigram LM | src |
laser100k.bin | Trained on balanced by language WikiMatrix corpus of 80+ languages | Unigram LM | src |
uri250k.bin | URL tokenization model trained on random URLs from the web | Unigram LM | src |
Oh yes, it is also the fastest! We did a comparison of Bling Fire with tokenizers from Hugging Face, Bling Fire runs 4-5 times faster than Hugging Face Tokenizers, see also Bing Blog Post. We did comparison of Bling Fire Unigram LM and BPE implementaion to the same one in SentencePiece library and our implementation is ~2x faster, see XLNET benchmark and BPE benchmark. Not to mention our default models are 10x faster than the same functionality from SpaCy, see benchmark wiki and this Bing Blog Post.
So if low latency inference is what you need then you have to try Bling Fire!
Learn more about Target Frameworks and .NET Standard.
-
.NETCoreApp 3.1
- No dependencies.
NuGet packages (2)
Showing the top 2 NuGet packages that depend on BlingFireNuget:
Package | Downloads |
---|---|
SS.SemanticKernel.Extensions
This is a SemanticKernel extension built on the Embedding codebase. |
|
BlingFireNetStandard
BlingFire wrapper for .Net Standard, see https://github.com/microsoft/BlingFire for details. |
GitHub repositories (1)
Showing the top 1 popular GitHub repositories that depend on BlingFireNuget:
Repository | Stars |
---|---|
Azure-Samples/semantic-kernel-rag-chat
Tutorial for ChatGPT + Enterprise Data with Semantic Kernel, OpenAI, and Azure Cognitive Search
|
BlingFire wrapper for .Net Core, see https://github.com/microsoft/BlingFire for details.