C# 抓取 HTML/.asp
我正在研究“个人可以工作”之类的事情,除了尝试将 .asp 源文件中的一些信息解析到我的程序中之外,我一切都工作得很好。
这是我到目前为止的解析代码
// parse out the results
try
{
int snr_start = result.IndexOf("SNR");
int snr_end = result.IndexOf("</TR>", snr_start);
snr = result.Substring(snr_start, snr_end - snr_start);
snr = snr.Substring(snr.IndexOf("<TD>") + 1);
snr = snr.Substring(0, snr.Length - 6);
iSNR = Convert.ToInt32(snr.Substring(0, snr.IndexOf(" ")));
int dnpwr_start = result.IndexOf("Downstream Power", snr_end);
int dnpwr_stop = result.IndexOf("</TR>", dnpwr_start);
dnpwr = result.Substring(dnpwr_start, dnpwr_stop - dnpwr_start);
dnpwr = dnpwr.Substring(dnpwr.IndexOf("<TD>") + 1);
dnpwr = dnpwr.Substring(0, dnpwr.IndexOf("<TABLE") - 1);
iDPWR = Convert.ToInt32(dnpwr.Substring(0, dnpwr.IndexOf(" ")));
int uppwr_start = result.IndexOf("Upstream Power", dnpwr_stop);
int uppwr_stop = result.IndexOf("</TR>", uppwr_start);
uppwr = result.Substring(uppwr_start, uppwr_stop - uppwr_start);
uppwr = uppwr.Substring(uppwr.IndexOf("<TD>") + 1);
uppwr = uppwr.Substring(0, uppwr.IndexOf("</TD>") - 1);
iUPWR = Convert.ToInt32(uppwr.Substring(0, uppwr.IndexOf(" ")));
}
catch
这是源文件和我试图从中获取的信息(SNR,下游功率,上游功率)
<td class="headerR">Downstream Power</td>
<td class="contentL">1.0 dBmV</td>
</tr>
<tr>
<td class="headerR">SNR</td>
<td class="contentL">39.656 dB</td>
</tr>
<tr>
<td class="headerR">Upstream Power</td>
<td class="contentL">42.0 dBmV</td>
</tr>
不太确定我哪里出错了,但任何帮助都会很大赞赏。 该项目的重点是让我能够解析调制解调器(我是 MSO 员工)的信号电平,以进行扩展监控。如果需要,我可以从 .asp 页面发布完整的源代码,
谢谢, 马特
I'm working on a "personal-can-it-work" sort of thing, and i have everything working great except for trying to parse some information from a .asp sourcefile into my Program.
This is the parsing code i have so far
// parse out the results
try
{
int snr_start = result.IndexOf("SNR");
int snr_end = result.IndexOf("</TR>", snr_start);
snr = result.Substring(snr_start, snr_end - snr_start);
snr = snr.Substring(snr.IndexOf("<TD>") + 1);
snr = snr.Substring(0, snr.Length - 6);
iSNR = Convert.ToInt32(snr.Substring(0, snr.IndexOf(" ")));
int dnpwr_start = result.IndexOf("Downstream Power", snr_end);
int dnpwr_stop = result.IndexOf("</TR>", dnpwr_start);
dnpwr = result.Substring(dnpwr_start, dnpwr_stop - dnpwr_start);
dnpwr = dnpwr.Substring(dnpwr.IndexOf("<TD>") + 1);
dnpwr = dnpwr.Substring(0, dnpwr.IndexOf("<TABLE") - 1);
iDPWR = Convert.ToInt32(dnpwr.Substring(0, dnpwr.IndexOf(" ")));
int uppwr_start = result.IndexOf("Upstream Power", dnpwr_stop);
int uppwr_stop = result.IndexOf("</TR>", uppwr_start);
uppwr = result.Substring(uppwr_start, uppwr_stop - uppwr_start);
uppwr = uppwr.Substring(uppwr.IndexOf("<TD>") + 1);
uppwr = uppwr.Substring(0, uppwr.IndexOf("</TD>") - 1);
iUPWR = Convert.ToInt32(uppwr.Substring(0, uppwr.IndexOf(" ")));
}
catch
And this is the Sourcefile and the Information i'm trying to scrape from it (SNR, Downstream Power, Upstream Power)
<td class="headerR">Downstream Power</td>
<td class="contentL">1.0 dBmV</td>
</tr>
<tr>
<td class="headerR">SNR</td>
<td class="contentL">39.656 dB</td>
</tr>
<tr>
<td class="headerR">Upstream Power</td>
<td class="contentL">42.0 dBmV</td>
</tr>
Not too sure where i'm going wrong to, but any helpwould be greatly appreaciated.
The focus of the project is so i can parse the signal levels off of my modem (I'm a MSO employee) for extended monitoring. If needed i can post the full source from the .asp page
Thanks,
Matt
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您只想从一张表中提取数据,这应该可行:
This should work if you only want to pull the data from one table:
我不太热衷于使用这些字符串方法进行屏幕抓取,除非这是你的最后手段。
您可以尝试使用一些正则表达式......或者甚至更好,如果您可以保证您的 HTML 源格式良好(XHTML),您可以将其(或您想要的 XML 片段)加载到 XML Document 对象中并使用 XPath或 Linq-to-XML (XLinq)(如果使用 .NET 3.5)。
I am not too keen on using those string methods for screen scraping unless it's your last resort.
You can try using some Regex...or even better, if you can guarantee that your HTML source is well formed (XHTML), you could load it (or the snippet of XML you want) into an XML Document object and use either XPath, or Linq-to-XML (XLinq), if using .NET 3.5.
ASP源码?您最好的选择可能是正则表达式 - 它们是为此类任务而设计的。任何类型的抓取通常都意味着值得您花时间去挖掘它们。
你用什么语言来解析它?如果是 .Net,您可以使用 Regex 类轻松获得名称/值对。
正则表达式如下:
然后,您可以循环捕获并获取名称值对列表:
“下游功率”:“1.0 dBmV”
“信噪比”:“39.656 分贝”
“上游功率”:“42.0 dbmV”
应该很简单。
ASP source? Your best bet is probably a regular expression - they're designed for this kind of task. Any kind of scraping usually means it will be worth your while to dig into them.
What langauge are you using to parse it? If .Net, you can get your name/value pairs easy with the Regex class.
Something like this for the regex:
Then, you can loop throught the captures and get your list of name value pairs:
"Downstream Power":"1.0 dBmV"
"SNR":"39.656 dB"
"Upstream Power":"42.0 dbmV"
Should be straightforward.