If you would like to hire my services, you can now do so by visiting the following link:
Website Design Darlington
Article Statistics
Code Bank Statistics
Summary: An example of how we can extract data from HTML tables and create a DataSet object containing this data.
RequirementsI recently needed to do some "screen scraping" from a locally installed 3rd party web application and then do some data manipulation based on the results. The application in question wrote the results out to the page inside <table> tags and there were several of these tables on the page. I decided that the approach I would take would be to read all of these HTML tables, identifying them with a Regular Expression, and then convert them into one DataSet where I could then perform the required manipulation.Sample DataTo recreate the page that we need to scrape, I've created a simple function to build a HTML page containing two tables. You can use this function whilst doing your testing, but I imagine that in a real-life situation you will want to retrieve the HTML directly from the web page, or maybe read all the lines from a locally based file. The function I created looks like this, although feel free to modify this if you need to: Private Function GetHTML() As String ' Declarations Dim sb As New StringBuilder ' Create a valid HTML file sb.AppendLine("<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">") sb.AppendLine("<html>") sb.AppendLine("<head>") sb.AppendLine("<meta http-equiv=""Content-Type"" content=""text/html; charset=iso-8859-1" > "") sb.AppendLine("<title>Title</title>") sb.AppendLine("</head>") sb.AppendLine("<body>") ' Table One (with headers) sb.AppendLine("<table>") sb.AppendLine("<tr>") sb.AppendLine("<th>Table 1 - Header 1</th>") sb.AppendLine("<th>Table 1 - Header 2</th>") sb.AppendLine("<th>Table 1 - Header 3</th>") sb.AppendLine("</tr>") sb.AppendLine("<tr>") sb.AppendLine("<td>Table 1 - Row 1 - Column 1</td>") sb.AppendLine("<td>Table 1 - Row 1 - Column 2</td>") sb.AppendLine("<td>Table 1 - Row 1 - Column 3</td>") sb.AppendLine("</tr>") sb.AppendLine("<tr>") sb.AppendLine("<td>Table 1 - Row 2 - Column 1</td>") sb.AppendLine("<td>Table 1 - Row 2 - Column 2</td>") sb.AppendLine("<td>Table 1 - Row 2 - Column 3</td>") sb.AppendLine("</td>") sb.AppendLine("</tr>") sb.AppendLine("<tr>") sb.AppendLine("<td>Table 1 - Row 3 - Column 1</td>") sb.AppendLine("<td>Table 1 - Row 3 - Column 2</td>") sb.AppendLine("<td>Table 1 - Row 3 - Column 3</td>") sb.AppendLine("</td>") sb.AppendLine("</tr>") sb.AppendLine("</table>") ' Table Two (without headers) sb.AppendLine("<table>") sb.AppendLine("<tr>") sb.AppendLine("<td>Table 2 - Row 1 - Column 1</td>") sb.AppendLine("<td>Table 2 - Row 1 - Column 2</td>") sb.AppendLine("<td>Table 2 - Row 1 - Column 3</td>") sb.AppendLine("</td>") sb.AppendLine("</tr>") sb.AppendLine("<tr>") sb.AppendLine("<td>Table 2 - Row 2 - Column 1</td>") sb.AppendLine("<td>Table 2 - Row 2 - Column 2</td>") sb.AppendLine("<td>Table 2 - Row 2 - Column 3</td>") sb.AppendLine("</td>") sb.AppendLine("</tr>") sb.AppendLine("<tr>") sb.AppendLine("<td>Table 2 - Row 3 - Column 1</td>") sb.AppendLine("<td>Table 2 - Row 3 - Column 2</td>") sb.AppendLine("<td>Table 2 - Row 3 - Column 3</td>") sb.AppendLine("</td>") sb.AppendLine("</tr>") sb.AppendLine("</table>") ' Close the HTML elements sb.AppendLine("</body>") sb.AppendLine("</html>") Return sb.ToString End Function Data ExtractionWhichever method we use to retrieve this HTML, we then need to be able to extract the relevant table elements. I decided to use a Regular Expression to do this (adding some options in to make sure that the case and any line breaks were ignored), specifically this one which targets the beginning and end <table> tags:<table[^>]*>(.*?)This will return all of the text in between the <table> tags and will allow us to then apply further Regular Expressions to get the text inside all of the <th>, <tr> and <td> tags. As some of the tables returned to me had
Posted on 04/12/2007 01:38:21
1. HY Lee 02/02/2008 04:42:43
Thanks!!
2. R Peters 19/02/2008 15:49:37
Thanks! I tried this with my code and it is great, but is there a way to get the details from HTML controls that are in a table? If there is a drop down in the table, how can you get the selected value?Thanks!
3. kanedogg 03/03/2008 16:14:59
R.Pteters, in answer to your question. I would simply add some javascript to the equation .....the code of a normal get value in javascript is:eg your ddlistbox id is ddl1 then a textbox you want to give this value. ---------------------------------------------------var ddlVAL = document.getElementById('ddl1').value;var El_NEW = document.getElementById('textbox1');El_NEW.value = ddlVAL;and so-on !hope this helpsCheers
4. XAncholy 19/06/2008 09:46:11
Thanks for this great code. What if a table has 2 columns with no headers BUT the header for each row is in Table1, Col1, rowx ?eg:Name | Donald DuckAge | 150How would I then capture it ?
5. Goce Ristanoski 21/08/2008 07:33:25
I need to extract data from a html page that has nested tables in it, and the application I develop is in C#. Is there a way that this code would work with nested tables, and what would the code look like in C#?
Please keep your comments relevant to this page. Any inappropriate or purely promotional comments may be removed. Email addresses are never displayed but are required so you can confirm your comments.