DocumentChunker 1.0.0

dotnet add package DocumentChunker --version 1.0.0                
NuGet\Install-Package DocumentChunker -Version 1.0.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="DocumentChunker" Version="1.0.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add DocumentChunker --version 1.0.0                
#r "nuget: DocumentChunker, 1.0.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install DocumentChunker as a Cake Addin
#addin nuget:?package=DocumentChunker&version=1.0.0

// Install DocumentChunker as a Cake Tool
#tool nuget:?package=DocumentChunker&version=1.0.0                

Document Chunker Utility Library

This library provides utility classes to break down large files, such as PDF, DOCX, and HTML, into smaller text chunks for creating corpora for RAG prototyping.

Purpose

The primary goal of this library is to assist in creating a corpus for prototyping or testing Retrieval-Augmented Generation (RAG) systems. However, the use case of this library should not be limited to this specific purpose. It can be utilized for any application that requires splitting large text files into manageable pieces.

Features

  • Supports breaking down the following file types:
    • PDF files
    • DOCX (Microsoft Word) files
    • HTML content
  • Provides efficient processing for large files.
  • Generates precise and context-preserving text chunks.

Licensing

This library is provided under the Apache License. Refer to the repository's NOTICE file for information on the open-source projects leveraged by this library, which are distributed under various permissive open source licenses.

Usage

Installation

You can include this library in your .NET project using your preferred method (e.g., NuGet, project reference, etc.).

Example Code

Here's a basic example of how to use the library:

// Example usage of the Document Chunker Library

 var config = new ChunkerConfig(maxWordsPerChunk: 11, chunkType: ChunkType.Sentence);
 var chunker = new PdfDocumentChunker(config);
 var filePath = "example.pdf";
 var chunker = new PdfDocumentChunker();
 await foreach (var chunk in chunker.ExtractChunksAsync(testPdfPath))
  {
      Console.WriteLine(chunk);
  }
        

Requirements

  • .NET Framework/SDK versions:
    • .Net Frameworkd 4.6.2 and above
    • .NET 6.0 or above for modern .NET platforms.
    • Compatible with .NET Standard 2.0 for broader compatibility.

Contribution

Contributions are welcome! Please feel free to submit issues or pull requests to improve the library.


For more details about its features and implementation, check out the NOTICE file and LICENSE file included in the project.

Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed. 
.NET Core netcoreapp2.0 was computed.  netcoreapp2.1 was computed.  netcoreapp2.2 was computed.  netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.0 is compatible.  netstandard2.1 was computed. 
.NET Framework net461 was computed.  net462 is compatible.  net463 was computed.  net47 was computed.  net471 was computed.  net472 was computed.  net48 was computed.  net481 was computed. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen40 was computed.  tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.0 73 2/19/2025