ProSol.WebScrap
2.0.2
dotnet add package ProSol.WebScrap --version 2.0.2
NuGet\Install-Package ProSol.WebScrap -Version 2.0.2
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ProSol.WebScrap" Version="2.0.2" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add ProSol.WebScrap --version 2.0.2
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: ProSol.WebScrap, 2.0.2"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install ProSol.WebScrap as a Cake Addin #addin nuget:?package=ProSol.WebScrap&version=2.0.2 // Install ProSol.WebScrap as a Cake Tool #tool nuget:?package=ProSol.WebScrap&version=2.0.2
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
ProSol.WebScrap
A HTML
parser, for extracting the text from a web pages, with CSS
selectors.
Purpose
The purpose of this library is to get the essential data from a web-page for a user, in JSON
format.
It could be further used for:
- Analyzing the essential data. Like a charts, diagramms, plain tables.
- Tracking the history of the essential data. Like prices for sales, currencies, user activity.
- Searching for specific essential data. Some word in multiple html resources, like movie title, or any other product, any mentioning.
Usage
Let's make a console demo and install the package:
dotnet new console -n WebScrap.Demo.CLI
cd WebScrap.Demo.CLI
dotnet add package ProSol.WebScrap --version 2.0.0
And try the following code:
using ProSol.WebScrap;
var request = "https://en.wikipedia.org/wiki/Food_energy";
// Download the html:
using var client = new HttpClient();
using var response = await client.GetAsync(request);
var html = await response.Content.ReadAsStringAsync();
// Run the WebScrapper:
var css = "#firstHeading";
var result = WebScrapper
.Run(html, css)
.ToJsonString();
// Get the results:
Console.WriteLine(result);
// OUTPUT:
// [{"key":"#firstHeading","values":[{"value":"Food energy"}]}]
Console.Read();
Known Issues
The project currently under active development, and there are some issues, some of the obvious, which are not the priority right now.
CSS
- multiple css entries, comma-separated, are not supported.
- attribute-based css are not supported.
HTML
- object model returns tags in reverse order.
- non-unicode text is not converted.
Goals
This project is for extracting text from html in a performant way.
Extract text
Plain text
: This tool must extract a plain text from html.User-defined result structure
: The amount of text, and it's structure is defined by user, via multiple css selectors.
Performance
Parallel processing
: All of css selectors should process the html in parallel.Stream-based processing
: The processed parts of html should be disposed from memory.
Footnote
- The versioning is complied to the Semver 2.0.0. Please refer to semver.org for details.
- Please refer to the Changelog for the progress.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
-
net8.0
- ProSol.Html.TagsProvider (>= 2.0.0)
- ProSol.Messaging (>= 4.0.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.