Hire my services

If you would like to hire my services, you can now do so by visiting the following link:

Website Design Darlington

Article Statistics

ArticlesLatest Entry
2121-May-2008

Code Bank Statistics

CategoryTipsLatest Entry
Applications420-Feb-2008
Cache118-Apr-2007
Controls926-Jul-2007
CSS107-Sep-2007
Database428-Aug-2007
Dates128-Aug-2007
Email105-Mar-2007
Files123-Apr-2007
HTML101-Mar-2007
Images226-Mar-2007
Pages115-Oct-2007
Sessions122-Nov-2007
SQL Server1518-Mar-2008
Strings309-May-2007
Skip Navigation LinksHome > Articles > Convert HTML tables to a DataSet

Convert HTML tables to a DataSet

Summary: An example of how we can extract data from HTML tables and create a DataSet object containing this data.Socialize it

Requirements
I recently needed to do some "screen scraping" from a locally installed 3rd party web application and then do some data manipulation based on the results. The application in question wrote the results out to the page inside <table> tags and there were several of these tables on the page. I decided that the approach I would take would be to read all of these HTML tables, identifying them with a Regular Expression, and then convert them into one DataSet where I could then perform the required manipulation.

Sample Data
To recreate the page that we need to scrape, I've created a simple function to build a HTML page containing two tables. You can use this function whilst doing your testing, but I imagine that in a real-life situation you will want to retrieve the HTML directly from the web page, or maybe read all the lines from a locally based file. The function I created looks like this, although feel free to modify this if you need to:


Data Extraction
Whichever method we use to retrieve this HTML, we then need to be able to extract the relevant table elements. I decided to use a Regular Expression to do this (adding some options in to make sure that the case and any line breaks were ignored), specifically this one which targets the beginning and end <table> tags:

This will return all of the text in between the <table> tags and will allow us to then apply further Regular Expressions to get the text inside all of the <th>, <tr> and <td> tags. As some of the tables returned to me had tags, and some didn't, I decided to include a check in the function to see if they did exist. If they did, I would use the text inside these tags for the column names in my DataTable; if they didn't exist, I would simply create a default naming scheme (e.g. Column1, Column2 etc).

Logic
The logic of the function was actually fairly simple and could be broken down into the following "pseudo" steps:

  1. Retrieve each instance of the table elements on the page.
  2. Loop through each table, performing the following checks.
  3. Check for the existence of <th> tags to determine if we know the names of the columns, otherwise just add a default name for each column.
  4. Loop through the rows of the table and for each column, add the value to our column in the DataTable.

Implementation
Recreating these steps into a .NET function, I came up with this function named "ConvertHTMLTablesToDataSet" which accepts the full HTML string, performs the actions we identified above and then returns a DataSet with a corresponding DataTable for each HTML table that was found:


Viewing the results
If you want to test this function, you can create a simple .aspx page with a Panel on it:

And then create some dynamic GridView's for each DataTable e.g.

You may also need to include the following Import statements on your page:

When you run this test page in your development environment, if you have used the sample data from the GetHTML function above you should see the following tables:
DataTable results

Considerations and Improvements
You may want to entend the functionality of this approach. For example, the function assumes that the HTML that is retrieved will be valid and in the correct format. I was lucky in the sense that I knew exactly what would be included in the HTML before writing the function, however, if you are retrieving data from an external site this may not always be the case so you may want to build in your own validity check and associated error handling.

Reader Comments

1. HY Lee 02/02/2008 04:42:43

2. R Peters 19/02/2008 15:49:37

3. kanedogg 03/03/2008 16:14:59

4. XAncholy 19/06/2008 09:46:11

5. Goce Ristanoski 21/08/2008 07:33:25

Add your comments

Please keep your comments relevant to this page. Any inappropriate or purely promotional comments may be removed. Email addresses are never displayed but are required so you can confirm your comments.

Your Name:
 
Your Email:
 
Add your comments: