Introduction To The HtmlAgilityPack Library

If you haven't heard of the HtmlAgilityPack before then you could be in for a treat. It's an open source library on codeplex which greatly eases html screen scraping.

What is screen scraping?

Screen scraping is the term for downloading a web page and then parsing its html to pull out information from it. In the old days before RSS feeds this was the only way to get at a websites information. With the advent of syndicated content and web services it has become a lot easier to get information from other sites but there are still situations where you need to go back to screen scraping to get the job done.

One example of a reason that you would still want to use this technique in a modern day environment is a price comparison site. When the complete stock information is not made available you have to read the data out of the website. Other reasons include weather data monitoring, website change detection (such as the release of a new version of software) and getting meta information from a page.

Doesn't .net support this natively?

The .net library comes with built in classes to manipulate XML documents with ease however in most cases you wont get won't get very far trying to screen scrape with these libraries. This is because they require standards compliant mark-up which is not very common for web sites. The XML standard has a strict policy of failing at the first error so unless you give it perfectly formatted code you wont be able to use these classes.

The HtmlAgilityPack provides a set of classes that makes it easy for you to download html pages into memory and then query them using XPath syntax. It doesn't matter if the page isn't standards compliant, the library will just do the best with what it has.

Where can I download it?

The codeplex project is located at the following location:

To download it just click the Download Now button located down the right hand side of the page and then press I Agree to accept the codeplex license agreement.

The pack comes with a couple of examples to get you started and there are most posts in this series here on this site.

More In This Series

This article is part of a series. You can find more posts in this series here:

Further reading

kick it Shout it vote it on WebDevVote.com

No comments :