在 C# 中使用 Html 敏捷性解析表格、单元格
我需要解析 Html 代码。更具体地说,解析所有表中每行的每个单元格。每行代表一个对象,每个单元格代表不同的属性。我想解析这些,以便能够编写一个包含其中每个数据的 XML 文件(没有无用的 HTML 代码)。我已经成功地解析了 HTML 文件中的每一列,但现在我不知道将其写入 XML 文件的选项是什么。我很困惑。
HTML:
<tr><tr>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF">
1
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="left">
<a href="/ice/player.htm?id=8471675">Sidney Crosby</a>
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center">
PIT
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center">
C
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
39
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
32
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
33
</td>
<td class="statBox sorted" style="border-width:0px 1px 1px 0px; background-color: #E0E0E0" align="right">
<font color="#000000">
65
</font>
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
20
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
29
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
10
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
1
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
3
</td>
<td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right">
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
0
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
154
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
20.8
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
21:54
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
22.6
</td>
<td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right">
55.7
</td>
</tr></tr>
C#:
using HtmlAgilityPack;
namespace Stats
{
class StatsParser
{
private string htmlCode;
private static string fileName = "[" + DateTime.Now.ToShortDateString() + " NHL Stats].xml";
public StatsParser(string htmlCode)
{
this.htmlCode = htmlCode;
this.ParseHtml();
}
public void ParseHtml()
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
try
{
// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td[@class='statBox']");
for (int j = 0; j < cols.Count; ++j)
{
// Get the value of the column and print it
string value = cols[j].InnerText;
if (value!="")
System.Windows.MessageBox.Show(value);
}
}
}
catch (NullReferenceException)
{
System.Windows.Forms.MessageBox.Show("Exception!!");
}
}
XML:
<?xml version="1.0" encoding="utf-8" ?>
<Stats Date="2011-01-01">
<Player Rank="1">
<Name>Sidney Crosby</Name>
<Team>PIT</Team>
<Position>C</Position>
<GamesPlayed>39</GamesPlayed>
<Goals>32</Goals>
<Assists>33</Assists>
</Player>
</Stats>
I need to parse Html code. More specifically, parse each cell of every rows in all tables. Each row represent a single object and each cell represent different properties. I want to parse these to be able to write an XML file with every data inside (without the useless HTML code). I have successfully been able to parse each column from the HTML file but now I don't know what my options are for writing this to an XML file. I am baffled.
HTML:
<tr><tr>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF">
1
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="left">
<a href="/ice/player.htm?id=8471675">Sidney Crosby</a>
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center">
PIT
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center">
C
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
39
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
32
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
33
</td>
<td class="statBox sorted" style="border-width:0px 1px 1px 0px; background-color: #E0E0E0" align="right">
<font color="#000000">
65
</font>
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
20
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
29
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
10
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
1
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
3
</td>
<td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right">
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
0
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
154
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
20.8
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
21:54
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
22.6
</td>
<td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right">
55.7
</td>
</tr></tr>
C#:
using HtmlAgilityPack;
namespace Stats
{
class StatsParser
{
private string htmlCode;
private static string fileName = "[" + DateTime.Now.ToShortDateString() + " NHL Stats].xml";
public StatsParser(string htmlCode)
{
this.htmlCode = htmlCode;
this.ParseHtml();
}
public void ParseHtml()
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
try
{
// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td[@class='statBox']");
for (int j = 0; j < cols.Count; ++j)
{
// Get the value of the column and print it
string value = cols[j].InnerText;
if (value!="")
System.Windows.MessageBox.Show(value);
}
}
}
catch (NullReferenceException)
{
System.Windows.Forms.MessageBox.Show("Exception!!");
}
}
XML:
<?xml version="1.0" encoding="utf-8" ?>
<Stats Date="2011-01-01">
<Player Rank="1">
<Name>Sidney Crosby</Name>
<Team>PIT</Team>
<Position>C</Position>
<GamesPlayed>39</GamesPlayed>
<Goals>32</Goals>
<Assists>33</Assists>
</Player>
</Stats>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在浏览了 MSDN 之后,我终于找到了解决我的问题的实现方案:
它给出了以下 XML 文件作为输出:
After looking around MSDN, I finally found an implementation solution to my problem:
which gives the following XML file as an output:
我在评论中的意思是,您正在代码(嵌套循环)中执行正确的 XPath 可以为您做的事情。使用 LINQ-to-XML 可以使编写变得更加简单。但现在我们知道您希望 XML 文件的格式如何,我们可以提供我们自己的答案。我会像这样编写
ParseHtml()
方法:产生输出:
What I had meant in my comment was that you're doing in code (the nested loops) what having the right XPath can do for you. Using LINQ-to-XML can make this even more simpler to write. But now that we see how you want your XML file formatted, we can offer our own answers. I'd write the
ParseHtml()
method like so:Produces the output: