Loop Thorough Multiple HTML Tables In HTML Agility Pack
I followed the example in the below link and was able to parse HTML table successfully to a datatable. http://blog.ditran.net/parsing-html-table-to-c-usable-datalist/ But I am not
Solution 1:
I'll keep the first answer for reference, but below is a method that will split the original html into a string array with each string element containing the HTML for one table:
public static string[] ParseHtmlSplitTables(string htmlString)
{
string[] result = new string[] { };
if (!String.IsNullOrWhiteSpace(htmlString))
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var tableNodes = doc.DocumentNode.SelectNodes("//table");
if (tableNodes != null)
{
result = Array.ConvertAll<HtmlNode, string>(tableNodes.ToArray(), n => n.OuterHtml);
}
}
return result;
}
With the result you can then proceed to parse each table:
string[] htmlTables = ParseHtmlSplitTables(htmlString);
foreach (string html in htmlTables)
{
List<List<KeyValuePair<string, string>>> parseResult = ParseHtmlToDataTable(html);
DataTable dataTable = ToDataTable(parseResult);
}
Solution 2:
Since you want to parse multiple html tables you should return a DataSet
that will have one DataTable
per html table. If table headers are present, the code below will add column names to the corresponding DataTable
. The html table id will be used as the name for the DataTable
with which you can use to directly access from the DataSet
:
Method to convert html tables to a DataSet
:
public static DataSet HtmlTablesToDataset(string html)
{
var resultDataset = new DataSet();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
var resultTable = new DataTable(table.Id);
foreach (HtmlNode row in table.SelectNodes("tr"))
{
var headerCells = row.SelectNodes("th");
if (headerCells != null)
{
foreach (HtmlNode cell in headerCells)
{
resultTable.Columns.Add(cell.InnerText);
}
}
var dataCells = row.SelectNodes("td");
if (dataCells != null)
{
var dataRow = resultTable.NewRow();
for (int i=0; i < dataCells.Count; i++)
{
dataRow[i] = dataCells[i].InnerText;
}
resultTable.Rows.Add(dataRow);
}
}
resultDataset.Tables.Add(resultTable);
}
return resultDataset;
}
Test code:
var resultDS = HtmlTablesToDataset(html);
foreach(DataTable dt in resultDS.Tables)
{
Console.WriteLine("Table: " + dt.TableName);
string line = "";
foreach (DataColumn col in dt.Columns)
{
line += col.ToString() + " ";
}
Console.WriteLine(line.Trim());
foreach (DataRow row in dt.Rows)
{
line = "";
foreach (DataColumn col in dt.Columns)
{
line += row[col].ToString() + " ";
}
Console.WriteLine(line.Trim());
}
}
Sample HTML:
string html =
@"
<html>
<head>
<title>Test</title>
</head>
<body>
<table id='t1'>
<tr>
<th>Col1</th>
<th>Col2</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
</table>
<table id='t2'>
<tr>
<th>Col1</th>
<th>Col2</th>
</tr>
<tr>
<td>5</td>
<td>6</td>
</tr>
<tr>
<td>7</td>
<td>8</td>
</tr>
</table>
</body>
</html>
";
Post a Comment for "Loop Thorough Multiple HTML Tables In HTML Agility Pack"